Big Data

A Complete Statistics Beginner's Guide to Data Science!!

Introduction:

In this article, we will learn all the important statistical concepts that are required for data science roles.

1. Difference between parameter and statistic

In our day to day we continue talking about Population and shows. Then, it is very important to know the terminology to represent the population and the sample.

A parameter is a number that describes the population data. And a statistic is a number that describes the data in a sample.

2. Statistics and their types

The Wikipedia definition of Statistics states that “is a discipline that deals with the compilation, organization, analysis, interpretation and presentation of data”.

Means that, as part of the statistical analysis, we collect, organize and extract meaningful information from data, either through visualizations or mathematical explanations.

Statistics are broadly classified into two types:

Descriptive statistics
Inferential statistics

Descriptive statistics:

As the name suggests in Descriptive Statistics, We describe the data using the Mean distributions, Standard deviation, Graphs or Probability.

Basically, as part of the descriptive statistics, we measure the following:

Frequency: no. number of times a data point occurs
Central trend: the centrality of data: media, medianThe median is a statistical measure that represents the central value of a set of ordered data. To calculate it, the data is organized from lowest to highest and the number in the middle is identified. If there are an even number of observations, the two core values are averaged. This indicator is especially useful in asymmetric distributions, since it is not affected by extreme values.... and fashion.
Dispersion: the extent of the data: rank, variance and standard deviation
The measureThe "measure" it is a fundamental concept in various disciplines, which refers to the process of quantifying characteristics or magnitudes of objects, phenomena or situations. In mathematics, Used to determine lengths, Areas and volumes, while in social sciences it can refer to the evaluation of qualitative and quantitative variables. Measurement accuracy is crucial to obtain reliable and valid results in any research or practical application.... of the position: percentiles and quantile ranges

Inferential statistics:

In Inferential Statistics, estimamos los parametersThe "parameters" are variables or criteria that are used to define, measure or evaluate a phenomenon or system. In various fields such as statistics, Computer Science and Scientific Research, Parameters are critical to establishing norms and standards that guide data analysis and interpretation. Their proper selection and handling are crucial to obtain accurate and relevant results in any study or project.... de población. Or we perform hypothesis tests to evaluate the assumptions made about the population parameters..

In simple terms, we interpret the meaning of descriptive statistics by inferring them to the population.

For instance, we are conducting a survey on the number of two-wheelers in a city. Suppose the city has a total population of 5L people. Therefore, we take a sample of 1000 people, as it is impossible to perform an analysis of the entire population data.

From the survey carried out, it is found that 800 people from 1000 (800 of 1000 it is 80%) they are two-wheelers. Then, we can infer these results to the population and conclude that the 4L people from the 5L population are two-wheelers.

3. Data types and measurement level

On a higher level, data is classified into two types: Qualitative Y Quantitative.

Qualitative data is not numerical. Some of the examples are eye color, car brand, the city, etc.

Secondly, quantitative data is numerical and again divided into continuous and discrete data.

Continuous data: Can be represented in decimal format. Some examples are height, weight, weather, distance, etc.

Discrete data: Cannot be represented in decimal format. Some examples are the number of laptops, the number of students in a class.

Discrete data is split back into categorical and count data.

Categorical data: represent the type of data that can be divided into groups. Some examples are age, sex, etc.

Count data: These data contain non-negative integers. Example: number of children a partner has.

52497data20types-1659332 — Type of data (author's image)

Measurement level

In statistics, el nivel de medición es una clasificación que describe la relación entre los valores de una variableIn statistics and mathematics, a "variable" is a symbol that represents a value that can change or vary. There are different types of variables, and qualitative, that describe non-numerical characteristics, and quantitative, representing numerical quantities. Variables are fundamental in experiments and studies, since they allow the analysis of relationships and patterns between different elements, facilitating the understanding of complex phenomena.....

We have four fundamental levels of measurement. Son:

Nominal scale
Ordinal scale
Interval scale
Proportion scale

1. Nominal scale: This scale contains the least amount of information since the data only has names / labels. Can be used for classification. We cannot perform mathematical operations on nominal data because there is no numerical value for the options (the numbers associated with the names can only be used as labels).

Example: To what country do you belong? India, Japan, Korea.

2. Ordinal scale: Compared to the nominal scale, the ordinal scale has more information because together with the labels, has order / address.

Example: Income level: high income, average income, low incomes.

3. Interval scale: It is a numerical scale. The interval scale has more information than the nominal ordinal scales. Along with the order, we know the difference between the two variables (the interval indicates the distance between two entities).

The average can be used, the median and mode to describe the data.

Example: temperature, income, etc.

4. Ratio scale: The ratio scale has the most information about the data. Unlike the other three scales, the ratio scale can accommodate a true zero point. It is simply said that the ratio scale is the combination of scales Nominal, Ordinal e Intercal.

Example: actual weight, height, etc.

4. Business decision moments

We have four business decision moments that help us understand the data.

4.1. Measures of central tendency

(Also known as a business decision in the first place)

Talk about the centrality of data. to simplify it, is part of the descriptive statistical analysis in which a single value in the center represents the entire data set.

The central tendency of a data set can be measured by:

To mean: It is the sum of all data points divided by the total number of values in the data set. The mean cannot always be trusted because it is influenced by outliers.

Median: It is the intermediate value of an ordered data set / tidy. If the size of the data set is even, the median is calculated by taking the average of the two mean values.

Way: It is the most repeated value in the data set. Data with only one mode is called unimodal, data with two modes is called bimodal and data with more than two modes is called multimodal.

4.2. Measures of dispersion

(Also known as a second-time business decision)

Talk about the dissemination of data from your center.

Dispersion can be measured using:

Difference: It is the average squared distance of all data points from its mean. The problem with variance is that the units will also square.

Standard deviation: It is the square root of the variance. Helps to recover original drives.

Distance: It is the difference between the maximum and minimum values of a data set.

Measure	Population	Shows
To mean	µ = (Σ X_I)/NORTH	x̄ = (Σ x_I)/North
Median	The mean value of the data	The mean value of the data
Way	Most occurred value	Most occurred value
Difference	σ²= (Σ X_I – µ)²/NORTH	s²= (Σ x_I – X )²/ (n-1)
Standard deviation	σ = square root ((Σ X_I – µ)²/NORTH)	s = square root ((Σ x_I – X )²/ (n-1))
Distance	Maximum minimum	Maximum minimum

4.3. Obliquity

(It is also known as a business decision in the third moment)

Measure skewness in data. The two types of asymmetry are:

Positive / skewed to the right: The data is said to be positively biased if most of the data is concentrated on the left side and has a tail to the right.

Negative / skewed to the left: The data is said to be negatively biased if most of the data is concentrated on the right side and has a tail to the left.

The asymmetry formula is me [(X - µ)/ σ ]) ³= Z³

52087positive20skewed-2518974 — Positively biased data (author's image)

41403negative20skewed-5715167 — Negatively skewed data (author's image)

4.4. Curtosis

(Also known as a fourth-moment business decision)

Talk about the central peak or the plumpness of the tails. The three types of kurtosis are:

Positive / leptocurtic: Has sharp beaks and lighter tails.

Negative / Platokúrtico: Has wide beaks and thicker tails.

mesokurtic: Normal distribution

The kurtosis formula is me [(X - µ)/ σ ]) ⁴-3 = Z⁴– 3

86688kurtosis-3100998 — Curtosis (Author's Image)

Together, skewness and kurtosis are called shape statistics.

5. Central limit theorem (CLT)

Instead of analyzing the data of the entire population, we always take a sample for analysis. The problem with sampling is that “the sample mean is a random variable, varies for different samples ". And the random sample we draw can never be an exact representation of the population. This phenomenon is called sample variation.

To cancel the sample variation, we use the central limit theorem. And according to the central limit theorem:

1. The distribution of the sample means follows a normal distribution if the population is normal.

2. the distribution of the sample means follows a normal distribution even though the population is not normal. But the sample size should be large enough.

3. The grand average of all sample mean values gives us the population mean.

4. Theoretically, the sample size should be 30. And practically, the condition about sample size (n) it is:

n> 10 (k₃)², where k₃is the asymmetry of the sample.

n> 10 (k₄), where K₄is the kurtosis sample.

6. Probability distributions

In statistical terms, a distribution function is a mathematical expression that describes the probability of different possible outcomes for an experiment.

Please, read this article of mine on the different types of probability distributions.

7. Graphical representations

Graphical representation refers to the use of tables or graphs to visualize, analyze and interpret numerical data.

For a single variable (univariate analysis), we have a bar chart, a line diagram, a frequency diagram, a dot plot, a box plot and the normal QQ plot.

We will discuss the box plot and the normal QQ plot.

7.1. Box plot

A box plot is a way to visualize the distribution of data based on a five-number summary. Used to identify outliers in the data.

The five numbers are minimum, first quartile (Q1), median (Q2), third quartile (Q3) and maximum.

The box region will contain the 50% of the data. The 25% bottom of the data region is called the Bottom Whisker and the bottom 25% top of the data region is called Top Whisker.

The interquartile region (IQR) is the difference between the third and the first quartile. IQR = Q3 – Q1.

Outliers are the data points below the lower whisker and beyond the upper whisker.

The formula for finding outliers is Outlier = Q ± 1,5 * (IQR)

The outliers below the lower whisker are given as Q1 – 1,5 * (IQR)

Outliers beyond the upper whisker are given as Q3 + 1.5 * (IQR)

See my article on detecting outliers using a box plot.

7.2. Normal QQ chart

Un diagrama de QQ normal es una especie de Dispersion diagramThe scatter plot is a graphical tool used in statistics to visualize the relationship between two variables. It consists of a set of points in a Cartesian plane, where each point represents a pair of values corresponding to the variables analyzed. This type of chart allows you to identify patterns, Trends and possible correlations, facilitating data interpretation and decision-making based on the visual information presented.... que se traza creando dos conjuntos de cuantiles. It is used to check whether the data is normal or not.

On the x-axis we have the Z scores and on the y-axis we have the actual sample quantiles. If the scatter plot forms a straight line, the data is said to be normal.

8. Hypothesis testing

Hypothesis testing in statistics is a way of testing assumptions made about population parameters.

See my article on hypothesis testing to read it in detail.

Final notes:

Thanks for reading to the conclusion. At the end of this article, we are familiar with important statistical concepts.

I hope this article is informative. Feel free to share it with your fellow students.

A Complete Statistics Beginner's Guide to Data Science!!

Contents

Introduction:

Table of Contents:

1. Difference between parameter and statistic