Whenever any sample is collected and interpreted it is required at the same time to check its reliability and consistency with the population or to make any inference about the population. Statistical hypothesis testing does this for us. For example, if we take a sample of marks of 15 students of a class whose average marks are 85 and we get the average of the sample as 80.
Then we may be interested in knowing if this sample average is in line with the population average of 85 or not. Hypothesis testing is like a litmus test that gives us the path for rejection or acceptance of an assumption or a claim except for the fact that it is not deterministic but probabilistic. It is a technique to compare two datasets or a sample from a dataset.
Let us first see what a hypothesis is and take a look at some of the terms that are inclusive to hypothesis testing.
A hypothesis is nothing but some assumptions that we make about the population parameters that we want to verify. Two hypotheses are included in every test namely the null hypothesis and alternative hypothesis. The Null Hypothesis is the statement of no difference and is denoted as H0. It simply asserts that there is no real difference between the sample and the population and the difference is accidental or by chance. An alternative hypothesis is a statement against the null hypothesis.
It is the contradiction of the null hypothesis. It is usually denoted by H1. For example, a sample of 50 light bulbs is tested for their life and we want to test if the average lifetime of the bulbs is 300 days. Then we will set up the null hypothesis as “the lifetime is 300 days” and the alternative hypothesis will be “the lifetime is not 300 days”.
To test a hypothesis we need to have a single value based on the sample observations that can be compared with a pre-defined value so as to reach a decision. This value is computed using a certain formula and follows a particular probability distribution under some assumptions. Since the value calculated is used for testing and is derived from the sample, it is called a test statistic.
We all are familiar with the game of darts. Consider a simpler version of such a game in which an aim on the outer ring results in the disqualification (rejected, straightaway) of the aimer whereas an aim on the inner two circles results in qualification (acceptance) of the aimer for further rounds (only qualified, not yet the winner).
Just like this dartboard is divided into areas of rejection and acceptance, in a similar way a probability curve is divided into acceptance region and the rejection region (also called the critical region). If a test statistic falls in the critical region then the null hypothesis is rejected, it may be accepted otherwise. Hence, a critical region can be defined as the region of rejection of H0 when H0 is true.
A point to be noted here is that we reject the null hypothesis much strongly as compared to its acceptance (as in the example above, where the aimer is only qualified and is not the winner). The reason is that we deal with a sample rather than the population itself. In R.A. Fisher’s own words:
“The null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.”
The confidence with which a decision is taken depends on the significance level chosen. The significance level is the probability of rejecting the null hypothesis when it is true. It is the size of the critical region and is expressed in terms of percentage. For example, a significance level of 5% means that the null hypothesis will be rejected 5 out of 100 times when it is true. The significance level is denoted by α.
One-Tailed Test and Two-Tailed Test:
As we have seen above, the critical region is a portion of the probability curve. This portion can lie on either end of the curve or on both ends of the curve. A test is recognized as one tailed or two tailed depending upon which side of the curve the critical area lies which further depends on the nature of our alternative hypothesis. For example, in the lightbulb problem if we want to test that the lifetime of bulbs is greater than 300 days then our alternative hypothesis will be “lifetime of bulbs >300 days” (right-tailed).
If we want to test if it is less than 300 days then the alternative hypothesis becomes “lifetime of bulbs < 300 days” (left tailed). If we do not care about whether it is greater or less and just want to test if it is 300 days or not then alternative hypothesis becomes “lifetime of bulbs ≠ 300 days” (two-tailed). The critical region, say at 5%, in these cases can be illustrated as below:
For a two-tailed test, the critical region is divided into two parts, one for the right side and other for the left side. While for one-tailed test it remains undivided. So, if we are dealing with a two-tailed test at significance level (size of the critical region) α% then on each side we have (α/2)% of the area.
In simple terms, the p-value is a evidence to accept or reject the null hypothesis. Consider a coin that someone says is biased. So, to test this claim we set up the null hypothesis that “the coin is unbiased” as opposed to the alternative hypothesis that “the coin is biased”. Now to test its biasedness the coin is tossed 20 times and suppose that 18 heads and 2 tails are obtained. Clearly, we should have got a similar number of heads and tails if the coin was unbiased. But to prove this we need to have an evidence in the form of a p-value.
It is the probability of obtaining a result equal to or more extreme than the observed value. So, in this case, “the result more extreme than observed” would be (19 heads, 1 tail), (20 heads, 0 tail), (19 tails, 1 head) or (20 tails, 0 head). Calculating the probability of obtaining this result (using binomial distribution) under the null hypothesis we get:
P(18 heads and 2 tails) = P(18 tails and 2 heads) = 0.000181
P(19 heads and 1 tail) = P(19 tails and 1 head) = 0.00001907
P(20 heads and 0 tail) = P(20 tails and 0 head) = 0.0000009536
Adding up the probabilities we get the p-value as 0.0004.
This p-value is compared with the significance level which we will take here as 0.05. If the p-value is greater then the significance level then we say that the evidence against the null hypothesis is weak, which means we can accept the null hypothesis. If the p-value is less than the significance level then the evidence against the null hypothesis is strong and hence we reject the null hypothesis.
So, in this case, the p-value is very small as compared to the significance level, therefore, we can safely say that the null hypothesis is rejected and the coin is indeed a biased one.
Types Of Error:
Statistics is probabilistic and so is hypothesis testing. There is always a probability of making a wrong decision. While making decisions, four possibilities arise:
- Rejecting H0 when it is true
- Accepting H0 when it is false
- Accepting H0 when it is true
- Rejecting H0 when it is false
Clearly, the last two decisions are correct. First two decisions reject and accept the null hypothesis wrong. They are errors. The first one rejects the null hypothesis when it is, in fact, true, it is called the type I error. The probability of committing type I error is denoted by α.
The second one accepts the null hypothesis when it is false, it is called the type II error. The probability of committing type II error is denoted by β.
Consider an example of testing whether a new toothpaste is better than the previous toothpaste in fighting dental cavities. The hypotheses are H0: the toothpaste have no difference against H1: the new toothpaste is better than the old one. Now suppose that the new toothpaste is actually better. If our test accepts the null hypothesis that the toothpaste has no difference then we commit a type II error.
While testing the hypothesis, our aim is to reduce both types of error but it is not possible to control both the errors simultaneously. So we fix the probability of type I error(α) in advance at a satisfactory level and try to minimize the probability of type II error(β). α is also known as the significance level or the size of the critical region.
Large Sample Tests:
We know that for large sample sizes, almost all the distributions can be approximated by the normal distribution due to the Central Limit Theorem. This forms the basis of the large sample tests. Let us take a look at some of the tests and also how to perform them in R.
Test of Significance For Single Mean:
Consider a random sample of size n(≥30) from a normal population with mean µ and variance σ2. We know that the sample mean(x) of this sample will also be normally distributed as N( µ, σ2/n). Thus, the standard normal variate corresponding to x is:
If the population standard deviation is not known, which is usually the case, then we use sample variance as its estimate. Let us take an example of a pizza delivery boy who claims that he takes on an average 8.9 minutes to reach his destination to deliver pizzas. To check on this claim the agency that hires him notes his time taken for 50 orders. It gets a mean of 9.3 minutes with a standard deviation of 1.6 minutes.
Now let us check if the average time taken to deliver a pizza is 8.9 or not. For this we start by setting up the null hypothesis as; H0: the average time taken to deliver the pizza is 8.9 minutes (µ=8.9) against the alternative hypothesis; H1: the average time taken to deliver a pizza is not 89 minutes (µ≠8.9).
According to the situation we have: sample mean(x) = 9.3 minutes, population mean (µ)= 8.9 minutes, population standard deviation(σ) = 1.6 minutes and the sample size(n) = 50. Since the sample is large(≥30) and the population standard deviation is known therefore we apply the large sample test, otherwise t-test is used. On substituting the values in the test statistic formula we get the value of test statistic as
This is a two-tailed test so the critical region will be on both sides of the curve. The critical value from the standard normal table is 1.96 (we have taken a significance level of 0.05). The calculated value of the test statistic, 1.767, is less than the tabulated value of 1.96. Hence, we may accept the null hypothesis at the 0.05 level of significance.
This test can be performed in R as well. The code for which is given below.
Since here we are dealing with two-tailed test the p-value is calculated as p = P(Z ≤ -z) + P(Z ≥ z) = 2*P(Z ≤ -z). Clearly, the p-value is greater than the significance level, therefore, we say that we have less evidence against the null hypothesis and may accept it.
Test Of Significance For Difference Of Means:
In the above test, we had one sample and we compared the value of the sample mean to that of the population mean. We can also compare the value of two sample means to find out if they belong to populations with identical means or to check if one population is superior or inferior to other.
For this two samples of size n1 and n2 taken from same or different populations with means µ1 and µ2 and variances σ12 and σ22 respectively. Also, we now that the sample means x1, x2, and their difference, x1 – x2 are distributed as:
Under the null hypothesis that the samples are from the same population (µ1 = µ2), the test statistic is given as:
If population standard deviations are not known then sample variances s1 and s2 are used as:
For better understanding consider an example where it is required to check if the mean level of pay of one state is greater than that of another state. Two samples of employees are taken from sizes 1200 and 1000. The mean and standard deviation of the samples (in thousands of rupees) is given as:
Here, we have to test the null hypothesis, H0: there is no difference between the average pay of two states i.e., µ1 = µ2 as opposed to the alternative hypothesis, H1: mean level of pay of state 1 is greater than that of state 2 i.e., µ1 > µ2 (right tail test). The population standard deviations are not known hence estimating it using sample standard deviation, we get Now, the value of test the statistic is |z| = 24.28. The tabulated value for the left tailed test at 5% level of significance is 1.645.
The calculated value is much greater than the tabulated value at 5% significance level and thus we reject the null hypothesis and conclude that the mean pay of state 2 is higher than the mean pay of state 1.
In R, this test can be done as follows:
Test Of Significance For Difference Of Standard Deviations:
This test is used to test if the standard deviations of the two samples differ significantly or not. Let s1 and s2 be standard deviations of two independent samples then under the null hypothesis that the sample standard deviations don’t differ significantly i.e., σ1 = σ2, the test statistic for large samples is given as:
Where σ12 and σ22 are population variances and n1 and n2 are sample sizes for sample 1 and sample 2 respectively. Sample variance is used as an estimate of population variance when it is not known. Consider a farmer who yields two sets of plots whose variability are as follows:
We have to test whether the variability in two sets of plots is significant. The null hypothesis can be stated as, H0: there is no significant difference between the variability of the two plots i.e., σ1=σ2 against the alternative hypothesis, H1: two sets of plots have significantly different variability i.e., σ1≠σ2. We have, s1=34, s2=28, n1=40, n2=60, σ12 =342 = 1156, σ22=282=784. Substituting these values in test statistic we get z as 1.3.
This value is less than the tabulated at 0.05 level of significance which is given as 1.96. Hence, we cannot reject the null hypothesis and conclude that difference between the variability for two sets of plots is not significant.
In R, it can be done as follows: