Sometimes in multiple regression models suffer from problems like multicollinearity and increased the complexity of data collection and model maintenance, due to a large number of variables. In this article, we will learn how we can use stepwise regression to overcome these challenges.
So our objective is to build regression models which are complete and as realistic as possible. We want every variable which is even remotely related to the dependent variable to be included. Secondly, we want to include as few variables as possible.
Theory and experience give us a certain direction as to which variables should be included in the regression model. Moreover manually filtering through and comparing regression models can be tedious.
Luckily, several approaches exist for automatically performing feature selection or variable selection — that is, for identifying those variables that result in superior regression results. These traditional approaches the determining the subset of actual predictor variables is called the variable selection.
The three main approaches of variable selection are as follows:
The forward selection method begins with no candidate variables in the model. Then at each step, we select the variable with the
Forward selection is mostly used when a large group of variables exists. For example, suppose there are more than fifty variables in a data set. A reasonable approach would be to obtain the best “n” number of variables and then apply the all-possible algorithm in the subset. This procedure is also a good choice when multicollinearity is a problem.
The backward selection method or one might say backward elimination method begins with all possible variables that are believed to be potentially significant. Then at each step, we attempt to eliminate the variable that is most insignificant. This process continues until no insignificant variables remain and no further variables can be deleted without a statistically significant loss of fit.
Stepwise regression is a combination of forward and backward selection. The method of variable selection starts with no predictor variables and then sequentially adds new variables into the model which leads to a reduction of the sum squared errors. Then simultaneously the old predictors are removed from the model at later stages that have become insignificant as a result of the inclusion of additional variables in the model.
The process carries on until an equilibrium point is reached where no significant reduction in the sum squared residuals is to be gained by adding variables in the regression and where a significant increase in the sum squared residuals would arise if a variable were removed from regression.
Implementation Of Stepwise Regression In R
There are various packages to do the stepwise regression. Here I have shown two methods. Loading the necessary packages
The first method uses the stepAIC () function present in the MASS package. It chooses the best model by AIC Wiki. We use the option direction = “both” for stepwise regression. We can also perform forward and backward selection by choosing “forward” and “backward” respectively. Fitting the model with all the predictor variables and saving it in
model_lm <- lm (medv ~.,data = train)
This is the multiple linear regression models. And the summary given below shows that model_lm contains all the predictor variables in the data set.
Checking the summary of the model
Fitting the stepwise regression model and saving it
model_stepAIC <- stepAIC (model_lm, direction = “both”, trace = FALSE)
From the summary of the stepwise regression model, it is clear that only the variables which are highly significant are considered in the updated model.
Now we predict the value of medv or House Price and save it in
Pred_model_stepAIC <- predict (model_stepAIC, test [,-14])
In the second method, I have used the caret package and we the option to use method = “leapSeq” to carry out the stepwise regression. The other methods to fit forward and backward selection are given below:
We use 10-fold cross-validation to estimate the RMSE, MAE and etc. These statistical error estimates are used to compare the models and to automatically choose the one of them
We also specify the tuning parameter
train.control <- trainControl (method = “cv”, number = 10)
Saving the model in model_caret
model_caret <- train (medv ~., data = train, method = “leapSeq”, tuneGrid = data.frame (nvmax = 1:10), trControl = train.control)
So the best number of predictor variables that minimizes the RMSE is 4 >Now, predicting the values of House Price with the caret model and saving it in
Pred_model_caret <- predict (model_caret, test [,-14])
Calculating and saving the value of mape in mape_caret
mape_caret <- mape (test$medv, Pred_model_caret)
ADVANTAGES AND DISADVANTAGES
Advantages of Stepwise Regression:
- It is faster than most other automatic model-selection methods.
- It has the ability to manage the predictor variables in the regression model by eliminating or inserting them into the model according to their significance.
Disadvantages of Stepwise Regression
- Smaller datasets may result in higher sum squared residuals.
- The method adds or removes variables in a certain order; that we end up with a combination of predictors. That combination of predictors may not be closest to how it is in reality.
To know more about the other methods see the links given below: