# What Is Heteroscedasticity in Regression Analysis

We have some assumptions in our linear regression analysis. Some of them are really important. I am going to state them below at first.

1. The regression model is linear in parameters.
2. The mean of residuals is zero.
3. Homoscedasticity of residuals or equal variance of residuals.
4. No perfect multicollinearity.
5. No autocorrelation of residuals.

Let’s come to the 3rd point. We all know what the residual is. The difference between actual value and the predicted value of the dependent variable is residual. And we assume that the residuals with equal variance. But when this assumption is violated, that is the residual’s variance is not equal(constant) then the problem is called Heteroscedasticity.

### What Causes Heteroscedasticity?

There are various causes for the presence of heteroscedasticity in our regression model. Some of them are:

1. The presence of outliers in our data.
2. If we have mixing observations with different measures of scale (such as mixing high-income households with low-income households).
3. When we use an incorrect transformation of data to perform the regression.

### What are the several consequences of Heteroscedasticity?

1. If heteroscedasticity exists, then our OLS (ordinary least square) estimators are still unbiased and consistent but the estimators are less efficient actually. This makes our inferences about the regression less reliable.
2. If we have Heteroscedasticity, then we will get p-values that are smaller than they should be. This occurs as the heteroscedasticity increases the variance of the coefficient estimates but the OLS procedure can’t detect this increase. So, we evaluate the t-values and F-values using an underestimated variance. Basically, we will conclude that a model is statistically significant but it is actually not significant.

### How to detect heteroscedasticity?

It is customary to check for heteroscedasticity of residuals once you build the linear regression model. We can either use the graphical method or some statistical tests for detecting the heteroscedasticity in our model. First, we will discuss the graphical method. I am going use R Programming language and environment (R studio) for the detecting purpose.

### Graphical method

Basically what we do in this graphical method is: we develop a model for our data set and create a plot of residuals. Now if we see a randomness in the plot then there is no heteroscedasticity. But if there is a specific pattern or deterministic pattern (like fan shape or any other pattern), then heteroscedasticity is present in our model. It’s very simple. Okay! Let’s take the very popular “cars” data set and fit a model.

R-code: Let’s have a look at the plot that we have created. And we can see that the Randomness is there in that plot. So we can say no heteroscedasticity is present. That’s great. But suppose heteroscedasticity is present in our data. then how do our plot look like? Let’s see. Statistical tests:
Next, we will discuss some theoretical approach for detecting the heteroscedasticity. There are some statistical tests for detection. I will discuss two of them.

•  Breusch-Pagan Test
•  Goldfeld-Quandt Test

### Breusch-Pagan Test

The Breusch-Pagan test is a pretty simple but powerful test. It can be used to detect whether more than one independent variables are the cause for heteroscedasticity. There are five steps to the Breusch-Pagan test.
Step 1: first we will run the regular regression model and collect the residuals.

Step 2: Then we will estimate the variance of the residuals.

Step 3: we will compute the square of the standardized residuals.

Step 4: we will fit another regression line with all our independent variables taking the sum of standardized residuals as the dependent variable.

Step 5: we will calculate the RSS (Residual sum of the square) divide the RSS by 2 and will compare with the χ2 table’s critical value for the appropriate degrees of freedom or we will use the P-value. If the P-value is less than significant level then we will reject the null hypothesis that the variances of the residuals are equal.

Now in R, the task is very simple. Let’s have a look.

R-code:

P-value is 0.07297 which is greater than 0.5. So we can’t reject the null hypothesis. That is,we can conclude that there is no heteroscedasticity

### Goldfeld-Quandt Test:

Step 1. First, we arrange the data in ascending order of the independent variable Xj

Step 2. We will omit the middle observations (app. 20%) of the sorted data and fit two separate regressions, one for small values of Xj and one for large values of Xj and record the residual sum of squares (RSS) for each regression, say RSS1 for small values of Xj and RSS2 for large Xj’s.

Step 3. Then we will calculate the ratio F = RSS2/RSS1, which will follow an F distribution with d.f. = [n – d – 2(k+1)]/2 both in the numerator and the denominator, where d is the number of omitted observations, n is the total number of observations, and k is the number of explanatory variables.

Step4. We will reject H0: Residuals’ variances are equal if F > Fα,[n-d-2(k-1)]/2 or we can use the p-value to check.

Now in R, we will write:

Again p-value is 0.1498 and that means no heteroscedasticity is there. So by the graphical method and by statistical tests, we can conclude that our model is homoscedastic.

### How to Fix Heteroscedasticity

Once you find heteroscedasticity in your model, it’s mandatory to fix the issue. We can use a weighted least square regression model or a transformation od dependent variable.

### Weighted least square regression

In weighted regression, we assign a weight to each data point based on the variance of its fitted value. We will give small weights to observations associated with higher variances to minimize their squared residuals. Then weighted regression automatically minimizes the sum of the weighted squared residuals. If we can use the correct weights, heteroscedasticity is replaced by homoscedasticity. It’s a very good approach to remove heteroscedasticity.

### Transform the dependent variable

Transforming the data is harder than WLS because it involves the much manipulations. It is difficult to interpret the result because the units of data are gone. What we will do is that transform our original data into different values that produce better residuals with similar variances.

Author: Kuntal Roy Chowdhury