Hypothesis testing | Hypothesis Testing for Data Science Beginners

Contents

This article was published as part of the Data Science Blogathon.

Introduction

untitled-design-4-8851911

Before going to that, let's know what exactly Hypothesis means:

"The hypothesis is described as a recommended solution for an indefinable incident that does not fit into current theory".

The actual definition of hypothesis testing is whereby an analyst tests an assumption regarding a population parameter. The methodology retained by the analyst depends on the nature of the data used and the reason for the analysis..

problem-5283606

Hypothesis testing

Steps to perform the hypothesis test:

  1. Define null and alternative hypotheses

  2. Browse data, check assumptions

  3. Calculate test statistic

  4. Determine the corresponding p-value

  5. Make a decision about the null hypothesis.

To perform all these steps, let's take an example to easily understand.

Trouble: Taking into account the Italian adults of the age group of 18 a 30 years living in Italy, Do men have a body mass index (IMC) mean significantly higher than women?

Here the population is Italian adults (18-30) in Italy and the parameter of interest is the body mass index (IMC)

Paso 1: define hypothesis

  • Null: There is no difference in the mean BMI
    H (0): U1= U2 [U1 represents the population mean BMI for Males and U2 represents the population mean BMI for females]
    Here H (0) says they are equal to each other
  • Alternative: there is a significant difference in the mean BMI
    H (A): U1=U2 [U1 represents the population mean BMI for Males and U2 represents the population mean BMI for females]

    Here H (A) says they are not equal to each other

  • Significance level = 5%

Paso 2: examine data and verify assumptions

In this step, the data was filtered to include only Italian adults who were between 18 Y 30 years. After that, we need to do some statistical calculations like the mean, the minimum, the maximum, the standard deviation and sample size for both males and females.

Some of the assumptions that we must verify are the following:

  • Samples are considered simple random samples
  • The samples are independent of each other
  • Both response populations are approximately normal or the sample sizes are large enough.

Paso 3: Calculate test statistic:

The test statistic is a measure of how far our sample statistic is from our hypothesized population parameter., in terms of estimated standard errors.

  • Z = Best estimate – null value / estimated standard error
  • The best estimate is the difference between the mean of the male and female statistical sample
  • The null value is the hypothetical null value
  • The estimated standard error for two means can change depending on the approach we are going to use..
  • The two approaches you can use are the clustered approach and the non-clustered approach..
  • The combined approach is that the variance of two populations is assumed to be equal.
  • The non-clustered approach consists of eliminating the assumption of equal variances.

Paso 4: Determination of the P value:

The p-value is determined assuming that the null hypothesis is true, is the probability of observing a test statistic of a value (WITH) or more extreme.

So we are going to calculate this probability using the Z distribution where dF = n1+ n2-2

we need to check both sides since it is a two-sided alternative hypothesis because our alternative is not the same as well. Thus, we have to check both the upper and lower tails of our distribution.

The distribution chart looks as shown below with its corresponding sample size and degrees of freedom:

28679p-value-8901206

Distribution curve

In the graph above, we can see both our positive test statistic value and below negative test statistic value. This means that if the difference in the mean BMI of the population between men and women was really zero, then if that null hypothesis were true, then it is quite probable to observe a difference in the sample means of the statistical value of the test or something more extreme. There is almost a 20 percent chance of seeing that because this value is so large, we will go ahead and not reject the null value.

Paso 5: take a decision

If the P-value is greater than the significance level, which means there is weak evidence against the null value. Therefore, we do not reject the null hypothesis.

Then, in summary, hypothesis tests are used to test theories about a parameter of interest. Here, that parameter is the difference in the population means. The basic steps to perform this hypothesis test. First, let's define our hypotheses. Later, we will examine our data while checking our assumptions and calculating our test statistic. With this test statistic, we will determine our corresponding p-value and, Finally, we will make a decision based on this value.

The assumptions for the two-sample t-test for the population means are that we need both data sets to be two simple random samples and to be independent of each other.. We need to ensure that both response populations are normally distributed. On the contrary, we need to make sure we have at least one large sample size so that we can apply the central limit theorem. If our population variations are the same or not, it is also crucial to determine if we use a clustered or non-clustered approach. Finally, we need to know how to interpret the p-value, the decision and our final conclusion. These are all very important when conducting a hypothesis test..

For more items, see this profile:

https://likhithakakanuru.medium.com/

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.