Application of Chi-Square Distribution

In statistics, distribution theory is the most important part to perform a testing of the hypothesis. There are many types of distributions among which the Chi-Square distribution is the most important one for testing. An interesting fact about its use is that we can use Chi-Square distribution in both parametric and non-parametric tests.

Basically, Chi-Square (with one degree of freedom) variable is the square of a standard normal variable and Chi-Square distribution has additive property (Sum of two independent Chi-Square distributions is also a Chi-Square variable). Let’s discuss the different uses of Chi-Square distribution in the testing of hypothesis in real life situations.

Testing for a Population Variance:

Suppose, we have two cricket teams A and B. In team A, there is a good player, David. In team B, Rohit is a good player. Now, assume that Rohit is better than David, but from this point of view, we can’t conclude that winning possibility of team B is greater than team A.

It may happen that the team understanding in team A is much better than team B. Similarly, we can understand any underlying distribution of a population not only by testing for mean but we have to test for variance also.

Now, consider a random sample of size 10 as 205, 203, 191, 196, 200, 201, 200, 200, 200, 198. We want to test whether the population variance is greater than 6 or not. Here are testing of the hypothesis is,

H0: σ2 = σ02(=6) vs H1: σ2 ≠ σ02(=6). Here the test statistic is, χ2 = (n-1)s202 which follows the Chi-Square distribution with df (n-1) under the null hypothesis, where n is the sample size and s2 is the sample variance. Now, let’s see how we can easily perform the test in R using the library EnvStats.

So, here the value of the test statistic is 22.06667. Now, we have to find the critical value of χ2distribution with df 9.

Using the code, qchisq(0.05/2,9,lower.tail = F) and qchisq(0.05/2,9,lower.tail = T), we have the critical region as {(-ꝏ, 2.700389) U(19.02277, ꝏ)}. Now the value of the test statistic belongs to the critical region. So, we can reject the null hypothesis that the variance is equal to 6 at 5% level of significance.

We can also decide whether to reject the null hypothesis based on the p-value. Now, we can adjust the alternative hypothesis according to the question whether left-sided, right-sided or two-sided. We can also change the confidence level as required.

Test for statistical independence:

If we have two categorical variables and our interest is to find if there is any statistical dependency or not, we use the Chi-Square test. Suppose we want to check whether smoking causes cancer or not. We have data in a table format where we have the frequencies as follows,

Now, we have to check whether smoking and having cancer are statistically independent or not. Here the test statistic is

which follows the Chi-Square distribution with df (l-1)(k-1) under the null hypothesis, where Oij and Eij are the observed and expected values of the (i,j)th cell, l & k is the number of levels of two categorical variables. Here we assume that Eij should be at least 5 otherwise we use Yates’ Correction while testing.

Using R we can easily perform the test with the function chisq.test() from stats library as follows,

Here the p-value 0.1128901 which is greater than 0.05, so we can’t reject the null hypothesis (independence) at 5% level of significance.

The above example is of 2X2 contingency table. But we can also test where we have k X l contingency table for two categorical variables.

Test for Goodness of Fit:

From the name of this test, one can easily understand its purpose. In this test, there is one categorical variable on which we can fit a model and check how good the model is. Suppose one has 3 dices and one observed how many sixes appeared while throwing them simultaneously 100 times. Suppose the data he got is,

Now we want to check is the dice are fair or not. If the dice are fair then the no. of sixes will follow the binomial distribution with n = 3, p = 1/6. So, it’s our null distribution. We use R to check if we fit this population as Binomial (3, 1/6), then is it a good fitting? Here the null hypothesis is, the dice are fair.

The statistic we use here is

which follows the Chi-Square distribution with df (n-1) under the null hypothesis, where Oi and Ei are the observed and the predicted/expected values for the ith level of the variable and n is the no of levels of the categorical variable. So, the result is as follows,

Here the p-value of the test is less than 0.05 so we can reject the null hypothesis at 5% level of significance and conclude that the Binomial fitting is not good. So, this is how a goodness of fit test is performed for a categorical variable.

Test for Homogeneity:

Suppose there are some channels of movies, sports, news, and music. We want to check whether the distributions of watching TV channels are identical between adult males and females or differ. This is a test for checking the homogeneity of watching TV in adult males’ and females’ population.

So, basically, we have one categorical variable which has different levels. We have the samples on two or more different populations and we want to check whether the populations are distributed identically among the levels of the variable. Here the statistic used for testing the null hypothesis, that the distribution of the levels of the variables among the populations is identical

Oij and Eij are the observed and expected frequency of the jth level of the variable in the ith population. Let there are n levels and m populations. So, χ2 follows the Chi-Square distribution with df (m-1)(n-1) under the null hypothesis. Here we also get a contingency table of m X n.

Consider the following data:

We want to check whether the watching types of TV channels are identical or different between males and females. We have 100 males and 100 females in the sample. Here we use the Chi-Square test using R.

Running these above codes we have the result:

Here the p-value is less than 0.01, so we can reject the null hypothesis at 1% level of significance. We can conclude that the male and the female populations are not homogeneous in watching TV channels.

So, these are the Chi-Square tests. Basically, all the non-parametric tests where the Chi-Square statistics are used, the testing procedures are more or less the same, with some slight changes depending on your purpose of the hypothesis.


You might also like More from author