# Application of Chi-Square Distribution

In statistics, distribution theory is the most important part to perform a testing of the hypothesis. There are many types of distributions among which the Chi-Square distribution is the most important one for testing. An interesting fact about its use is that we can use Chi-Square distribution in both parametric and non-parametric tests.

Basically, Chi-Square (with one degree of freedom) variable is the square of a standard normal variable and Chi-Square distribution has additive property (Sum of two independent Chi-Square distributions is also a Chi-Square variable). Let’s discuss the different uses of Chi-Square distribution in the testing of hypothesis in real life situations.

**Testing for a Population Variance:**

Suppose, we have two cricket teams A and B. In team A, there is a good player, David. In team B, Rohit is a good player. Now, assume that Rohit is better than David, but from this point of view, we can’t conclude that winning possibility of team B is greater than team A.

It may happen that the team understanding in team A is much better than team B. Similarly, we can understand any underlying distribution of a population not only by testing for mean but we have to test for variance also.

Now, consider a random sample of size 10 as 205, 203, 191, 196, 200, 201, 200, 200, 200, 198. We want to test whether the population variance is greater than 6 or not. Here are testing of the hypothesis is,

**H _{0}: σ^{2} = σ_{0}^{2}(=6)**

__vs__

**H**. Here the test statistic is,

_{1}: σ^{2}≠ σ_{0}^{2}(=6)**χ**which follows the Chi-Square distribution with df (n-1) under the null hypothesis, where n is the sample size and s

^{2}= (n-1)s^{2}/σ_{0}^{2 }^{2}is the sample variance. Now, let’s see how we can easily perform the test in R using the library

*EnvStats*.

1 2 3 4 |
library(EnvStats) x <- c(205, 203, 191, 196, 200, 201, 200, 200, 200, 198) varTest(x,sigma.squared = 6, alternative = "two.sided", conf.level = 0.95) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
> library(EnvStats) > x <- c(205, 203, 191, 196, 200, 201, 200, 200, 200, 198) > varTest(x,sigma.squared = 6, alternative = "two.sided", conf.level = 0.95) Results of Hypothesis Test -------------------------- Null Hypothesis: variance = 6 Alternative Hypothesis: True variance is not equal to 6 Test Name: Chi-Squared Test on Variance Estimated Parameter(s): variance = 14.71111 Data: x Test Statistic: Chi-Squared = 22.06667 Test Statistic Parameter: df = 9 P-value: 0.01734015 95% Confidence Interval: LCL = 6.960081 UCL = 49.029964 |

So, here the value of the test statistic is 22.06667. Now, we have to find the critical value of χ^{2}_{– }distribution with df 9.

Using the code,** qchisq(0.05/2,9,lower.tail = F) **and

*we have the critical region as*

**qchisq(0.05/2,9,lower.tail = T)**,**{(-ꝏ, 2.700389) U(19.02277, ꝏ)}**. Now the value of the test statistic belongs to the critical region. So, we can reject the null hypothesis that the variance is equal to 6 at 5% level of significance.

We can also decide whether to reject the null hypothesis based on the p-value. Now, we can adjust the alternative hypothesis according to the question whether left-sided, right-sided or two-sided. We can also change the confidence level as required.

**Test for statistical independence:**

If we have two categorical variables and our interest is to find if there is any statistical dependency or not, we use the Chi-Square test. Suppose we want to check whether smoking causes cancer or not. We have data in a table format where we have the frequencies as follows,

Now, we have to check whether smoking and having cancer are statistically independent or not. Here the test statistic is

which follows the Chi-Square distribution with df (l-1)(k-1) under the null hypothesis, where *O _{ij}* and

*E*are the observed and expected values of the (i,j)

_{ij}^{th}cell, l & k is the number of levels of two categorical variables. Here we assume that

*E*should be at least 5 otherwise we use Yates’ Correction while testing.

_{ij}Using R we can easily perform the test with the function chisq.test() from *stats *library as follows,

1 2 3 4 5 6 7 |
library(stats) data <- as.data.frame(matrix(c(35,30,40,60),nrow = 2)) names(data) <- c("Cancer","No Cancer") row.names(data) <- c("Smoker","Non-Smoker") data chisq.test(data,correction = T) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
> library(stats) > data <- as.data.frame(matrix(c(35,30,40,60),nrow = 2)) > names(data) <- c("Cancer","No Cancer") > row.names(data) <- c("Smoker","Non-Smoker") > data Cancer No Cancer Smoker 35 40 Non-Smoker 30 60 > chisq.test(data,correction = T) Results of Hypothesis Test -------------------------- Alternative Hypothesis: Test Name: Pearson's Chi-squared test with Yates' continuity correction Data: data Test Statistic: X-squared = 2.513288 Test Statistic Parameter: df = 1 P-value: 0.1128901 |

Here the p-value 0.1128901 which is greater than 0.05, so we can’t reject the null hypothesis (independence) at 5% level of significance.

The above example is of 2X2 contingency table. But we can also test where we have k X l contingency table for two categorical variables.

**Test for Goodness of Fit:**

From the name of this test, one can easily understand its purpose. In this test, there is one categorical variable on which we can fit a model and check how good the model is. Suppose one has 3 dices and one observed how many sixes appeared while throwing them simultaneously 100 times. Suppose the data he got is,

Now we want to check is the dice are fair or not. If the dice are fair then the no. of sixes will follow the binomial distribution with n = 3, p = 1/6. So, it’s our null distribution. We use R to check if we fit this population as Binomial (3, 1/6), then is it a good fitting? Here the null hypothesis is, the dice are fair.

1 2 3 4 |
x <- c(37,29,24,10) nullprob = dbinom(c(0,1,2,3),3,1/6) chisq.test(x, p = nullprob) |

The statistic we use here is

which follows the Chi-Square distribution with df (n-1) under the null hypothesis, where *O _{i}* and

*E*are the observed and the predicted/expected values for the i

_{i}^{th}level of the variable and n is the no of levels of the categorical variable. So, the result is as follows,

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
> x <- c(37,29,24,10) > nullprob = dbinom(c(0,1,2,3),3,1/6) > chisq.test(x, p = nullprob) Results of Hypothesis Test -------------------------- Alternative Hypothesis: Test Name: Chi-squared test for given probabilities Data: x Test Statistic: X-squared = 246.8211 Test Statistic Parameter: df = 3 P-value: 3.186801e-53 |

Here the p-value of the test is less than 0.05 so we can reject the null hypothesis at 5% level of significance and conclude that the Binomial fitting is not good. So, this is how a goodness of fit test is performed for a categorical variable.

**Test for Homogeneity:**

Suppose there are some channels of movies, sports, news, and music. We want to check whether the distributions of watching TV channels are identical between adult males and females or differ. This is a test for checking the homogeneity of watching TV in adult males’ and females’ population.

So, basically, we have one categorical variable which has different levels. We have the samples on two or more different populations and we want to check whether the populations are distributed identically among the levels of the variable. Here the statistic used for testing the null hypothesis, that the distribution of the levels of the variables among the populations is identical

*O _{ij}* and

*E*are the observed and expected frequency of the j

_{ij }^{th}level of the variable in the i

^{th }population. Let there are n levels and m populations. So, χ

^{2 }follows the Chi-Square distribution with df (m-1)(n-1) under the null hypothesis. Here we also get a contingency table of m X n.

Consider the following data:

We want to check whether the watching types of TV channels are identical or different between males and females. We have 100 males and 100 females in the sample. Here we use the Chi-Square test using R.

1 2 3 4 5 6 7 |
males <- c(28,20,39,13) females <- c(35,11,19,35) data <- as.data.frame(rbind(males,females)) names(data) <- c("Moives","Sports","News","Music") data chisq.test(data) |

Running these above codes we have the result:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
> males <- c(28,20,39,13) > females <- c(35,11,19,35) > data <- as.data.frame(rbind(males,females)) > names(data) <- c("Moives","Sports","News","Music") > data Moives Sports News Music males 28 20 39 13 females 35 11 19 35 > chisq.test(data) Results of Hypothesis Test -------------------------- Alternative Hypothesis: Test Name: Pearson's Chi-squared test Data: data Test Statistic: X-squared = 20.37057 Test Statistic Parameter: df = 3 P-value: 0.0001422209 |

Here the p-value is less than 0.01, so we can reject the null hypothesis at 1% level of significance. We can conclude that the male and the female populations are not homogeneous in watching TV channels.

So, these are the Chi-Square tests. Basically, all the non-parametric tests where the Chi-Square statistics are used, the testing procedures are more or less the same, with some slight changes depending on your purpose of the hypothesis.