# Effect Of Multicollinearity and VIF in R

### Problems Due To Multicollinearity

We start the topic of Multicollinearity by giving a funny example by which we can realize what multicollinearity is and its effect. One day Ram’s father asked him “Ram, Why your bank balance is so much low?” Ram became afraid and stated different issues. His father slowly noted down all the issues and asked him to detect important issues which are the main reason for this low bank balance. But Ram was unable to determine the important issues behind this incident because of the relationship between all the prescribed issues.

It is the main concept of multicollinearity. Collinearity means the linear relationship and multicollinearity means the linear relationship between multiple variables. Here the low bank balance is predicted by some independent causes where the low bank balance can be described as explained variable and the causes are defined as an explanatory variable. Then there are obviously some linear relationships between the variables.

In Statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation, the coefficient estimates of the multiple regressions may change erratically in response to small changes in the model or the data.

Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multivariate regression model with collinear predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others.

Now, before going to the definition of multicollinearity we will discuss collinearity. Collinearity is a linear association between *two* explanatory variables. Two variables are perfectly collinear if there is an exact linear relationship between them

Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related. We have perfect multicollinearity if, the correlation between two independent variables is equal to 1 or −1. In practice, we rarely face perfect multicollinearity in a data set. More commonly, the issue of multicollinearity arises when there is an approximately linear relationship between two or more independent variables.

Now we discuss it mathematically let where i=1 then the collinearity means the linear relationship between y and which is denoted by and it lies between [-1,1]. Now we think about the situation where there are k explanatory variables and the regression model now looks like Mathematically, a set of variables are multicollinear if there exists one or more exact linear relationship among some of the variables. For example, we may have

holding for all observations i=1,2,3,…,N where are constants and be the i^{th} observation of j^{th} explanatory variable. The ordinary least squares estimates involve inverting the matrix where

is an *N* × (*k*+1) matrix, where *N* is the number of observations and *k* is the number of explanatory variables (with *N* required to be greater than or equal to *k*+1). If there is an exact linear relationship (perfect multicollinearity) among the independent variables, at least one of the columns of X is a linear combination of the others, and so the rank of X (and therefore of X^{T}X) is less than *k*+1, and the matrix X^{T}X will not be invertible.

Perfect multicollinearity is fairly common when working with raw data sets, which frequently contain redundant information. Once redundancies are identified and removed, however, nearly multicollinear variables often remain due to correlations inherent in the system being studied.

__Effects of Multicollinearity__

__Effects of Multicollinearity__

Now we discuss the effects of multicollinearity. If two independent variables contain essentially the same information to a large extent, one gains little by using both in the regression model. Moreover, multicollinearity leads to unstable estimates as it tends to increase the variances of regression coefficients. Now we discuss this mathematically.

One consequence of a high degree of multicollinearity is that, even if the matrix is invertible, a computer algorithm may be unsuccessful in obtaining an approximate inverse, and if it does obtain one, it may be numerically inaccurate. But even in the presence of an accurate matrix, the following consequences arise:

In the presence of multicollinearity, the estimate of one variable’s impact on the dependent variable {displaystyle Y}Y while controlling for the others tends to be less precise than if predictors were uncorrelated with one another. The usual interpretation of a regression coefficient is that it provides an estimate of the effect of a one unit change in an independent variable holding the other variables constant.

If is highly correlated with another independent variable , in the given data set, then we have a set of observations for which and have a particular linear stochastic relationship. We don’t have a set of observations for which all changes in are independent of changes in so we have an imprecise estimate of the effect of independent changes in

In some sense, the collinear variables contain the same information about the dependent variable. If nominally “different” measures actually quantify the same phenomenon then they are redundant. Alternatively, if the variables are accorded different names and perhaps employ different numeric measurement scales but are highly correlated with each other, then they suffer from redundancy.

One of the features of multicollinearity is that the standard errors of the affected coefficients tend to be large. In that case, the test of the hypothesis that the coefficient is equal to zero may lead to a failure to reject a false null hypothesis of no effect of the explanatory, a type II error.

Another issue with multicollinearity is that small changes to the input data can lead to large changes in the model, even resulting in changes of sign of parameter estimates.

__Remedies of Multicollinearity__

__Remedies of Multicollinearity__

Now we discuss the remedies of multicollinearity. The main solution is to keep only one of the two independent variables that are highly correlated with the regression model. Make sure you have not fallen into the dummy variable trap; including a dummy variable for every category (e.g., summer, autumn, winter, and spring) and including a constant term in the regression together guarantee perfect multicollinearity.

Try seeing what happens if you use independent subsets of your data for estimation and apply those estimates to the whole data set. Theoretically, you should obtain a somewhat higher variance from the smaller datasets used for estimation, but the expectation of the coefficient values should be the same. Naturally, the observed coefficient values will vary but look at how much they vary.

Leave the model as it is, despite multicollinearity. The presence of multicollinearity doesn’t affect the efficiency of extrapolating the fitted model to new data provided that the predictor variables follow the same pattern of multicollinearity in the new data as in the data on which the regression model is based.^{[9]}

Drop one of the variables. An explanatory variable may be dropped to produce a model with significant coefficients. However, you lose information (because you’ve dropped a variable). The omission of relevant variable results in biased coefficient estimates for the remaining explanatory variables that are correlated with the dropped variable.

Obtain more data, if possible. This is the preferred solution. More data can produce more precise parameter estimates (with lower standard errors), as seen from the formula in variance inflation factor for the variance of the estimate of a regression coefficient in terms of the sample size and the degree of multicollinearity. If the VIF > 10 then the model is affected by multicollinearity.

Now we take an example by taking the real-life data based on divorce from 1920 to 1996.

And we show that the effect of multicollinearity.

**Code:**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
install.packages("faraway") library("faraway) str(divusa) ## to keep the dependent variable and theindependent variable together#### df=data.frame(divusa[,2:7]) head(df) ## round of the correlation upto 2decimal points ### round(cor(df),2) library(GGally) ## to make the corrogram to visualize the correlation matrix### ggcorr(df) ### now we do multiple linear regrsiion### model<-lm(divorce~.,data=df) ### we get the multiple regression equation and the lower p-value indicates that the model is significant### summary(model) #### to chechk the effect of multi collinearity and vif > 10 #### then we say there is the great effect of multi collinearity vif(model) |

**Output :**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
str(divusa) 'data.frame': 77 obs. of 7 variables: $ year : int 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 ... $ divorce : num 8 7.2 6.6 7.1 7.2 7.2 7.5 7.8 7.8 8 ... $ unemployed: num 5.2 11.7 6.7 2.4 5 3.2 1.8 3.3 4.2 3.2 ... $ femlab : num 22.7 22.8 22.9 23 23.1 ... $ marriage : num 92 83 79.7 85.2 80.3 79.2 78.7 77 74.1 75.5 ... $ birth : num 118 120 111 110 111 ... $ military : num 3.22 3.56 2.46 2.21 2.29 ... df=data.frame(divusa[,2:7]) head(df) divorce unemployed femlab marriage birth military 1 8.0 5.2 22.70 92.0 117.9 3.2247 2 7.2 11.7 22.79 83.0 119.8 3.5614 3 6.6 6.7 22.88 79.7 111.2 2.4553 4 7.1 2.4 22.97 85.2 110.5 2.2065 5 7.2 5.0 23.06 80.3 110.9 2.2889 6 7.2 3.2 23.15 79.2 106.6 2.1735 round(cor(df),2) divorce unemployed femlab marriage birth military divorce 1.00 -0.21 0.91 -0.53 -0.72 0.02 unemployed -0.21 1.00 -0.26 -0.27 -0.31 -0.40 femlab 0.91 -0.26 1.00 -0.65 -0.60 0.05 marriage -0.53 -0.27 -0.65 1.00 0.67 0.26 birth -0.72 -0.31 -0.60 0.67 1.00 0.14 military 0.02 -0.40 0.05 0.26 0.14 1.00 |

1 2 3 |
library(GGally) ggcorr(df) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
model<-lm(divorce~.,data=df) summary(model) Call: lm(formula = divorce ~ ., data = df) Residuals: Min 1Q Median 3Q Max -3.8611 -0.8916 -0.0496 0.8650 3.8300 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.48784 3.39378 0.733 0.4659 unemployed -0.11125 0.05592 -1.989 0.0505 . femlab 0.38365 0.03059 12.543 < 2e-16 *** marriage 0.11867 0.02441 4.861 6.77e-06 *** birth -0.12996 0.01560 -8.333 4.03e-12 *** military -0.02673 0.01425 -1.876 0.0647 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.65 on 71 degrees of freedom Multiple R-squared: 0.9208, Adjusted R-squared: 0.9152 F-statistic: 165.1 on 5 and 71 DF, p-value: < 2.2e-16 |

1 2 3 4 |
vif(model) unemployed femlab marriage birth military 2.252888 3.613276 2.864864 2.585485 1.249596 |