Ridge Regression and Its Application

In this article, we will be learning the practical implementation, advantages, and disadvantages of Ridge Regression. Ordinary least squares regression chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the dependent variable and those predicted by the linear function. In other words, it tries is to find the plane that minimizes the sum of squared errors (SSE) between the original and predicted variables.

The OLS function works quite well when some assumptions like a linear relationship, no autocorrelation, homoscedasticity, more observations than variables, normal distribution of the residuals and No or little multicollinearity are fulfilled.

But in many real-life scenarios, these assumptions are violated. In those cases, we need to find alternative approaches to provide solutions. Penalized/Regularized regression techniques such as ridge, lasso and elastic net regression work very well in these cases. In this article, I have tried to explain the ridge regression technique which is a way of creating regression models when the number of predictor variables of a dataset is more than the number of observations or when the data suffers from multicollinearity (independent variables are highly correlated).

Understanding The Regularization Method

Regularization methods provide a means to control our regression coefficients, which can help to reduce the variance and decrease the sampling error. Ridge regression belongs to a class of regression tools that use L2 regularization. L2 regularization works as a small addition to the OLS function that weights the residuals in a particular way to make the parameters more stable. The L2 penalty parameter, which equals the square of the magnitude of coefficients, is given by,

And the regression function is given by,

The amount of the penalty can be fine-tuned using a constant called lambda (λ). Selecting a good value for λ is critical. When λ = 0, the penalty term has no effect and ridge regression produces classical least square coefficients. If λ = ∞, the impact of the penalty grows and all coefficients are shrunk to zero. The ideal penalty is therefore somewhere in between 0 and ∞.

In this way, ridge regression puts constraints on the magnitude of the coefficients and help to reduce the magnitude and fluctuations of the coefficients and progressively shrinks them towards zero. This will definitely help to reduce the variance of the model. The outcome is typically a model that fits the training data less well than OLS but generalizes better because it is less sensitive to extreme variance in the data such as outliers.

Note that, in contrast to the ordinary least square regression, ridge regression is highly affected by the scale of the predictors. Therefore, it is better to standardize (i.e., scale) the predictors before applying the ridge regression so that all the predictors are on the same scale.

Advantages and Disadvantages Of Ridge Regression

Advantages 
  • Least squares regression doesn’t differentiate “important” from “less-important” predictors in a model, so it includes all of them. This leads to overfitting a model and failure to find unique solutions. Ridge regression avoids these problems.
  • Ridge regression works in part because it doesn’t require unbiased estimators; while least squares produce unbiased estimates; its variances can be so large that they may be wholly inaccurate.
  • Ridge regression adds just enough bias to make the estimates reasonably reliable approximations to true population values.
  • One important advantage of the ridge regression is that it still performs well, compared to the ordinary least square method in a situation where you have a large multivariate data with the number of predictors (p) larger than the number of observations (n).
  • The ridge estimator is especially good at improving the least-squares estimate when multicollinearity is present.
Disadvantages  
  • Firstly ridge regression includes all the predictors in the final model, unlike the stepwise regression methods which will generally select models that involve a reduced set of variables.
  • A ridge model does not perform feature selection. If a greater interpretation is necessary where we need to reduce the signal in our data to a smaller subset then a lasso model may be preferable.
  • Ridge regression shrinks the coefficients towards zero, but it will not set any of them exactly to zero. The lasso regression is an alternative that overcomes this drawback.

Here I have given the link of a website below, where you can get the mathematical and geometric interpretation of Ridge regression More Info

Ridge Regression In R

Loading the MASS package to get the data set
library (MASS)
data <- Boston

Splitting the dataset in training and testing data
train <- data [1:400,]
test <- data [401:506,]

Loading libraries required for Ridge regression
library(tidyverse)
library(caret)

library (glmnet)
library (MASS)
library (Metrics)

We need to know the glmnet package

  • The glmnet package provides the function glmnet () for ridge regression. Rather than accepting a formula and data frame, it requires a vector input and matrix of predictors.
  • We must specify alpha = 0for ridge regression.( for lasso  alpha = 1 and for elastic net, 0 < = alpha < = 1)
  • Ridge regression also involves tuning a hyperparameter lambda ( λ).[ Discussed earlier ]
  • In case of classification or penalized logistic regression method we mention family = “binomial”.

For more details about this package: More Info

There is another function lm.ridge () in MASS package which can also be used. Please see the link below for more details about the function. More Info [Page Number: 79]

Preparing the training data set for training the regression model
x.train <- model.matrix (medv~., train) [,-1]

We save the response variable housing price in a vector y.train
y.train <- train$medv

We need to find the best value for lambda for the given data set with the  function cv.glmnet()
set.seed (123)
cv <- cv.glmnet (x.train, y.train, alpha = 0)
plot (cv)

Displaying the best lambda value
cv$lambda.min

We fit the final model on the training data by adding the best lambda value.
model_ridge <- glmnet (x.train, y.train, alpha = 0, lambda = cv$lambda.min)

Displaying the regression coefficients below
coef (model_ridge)

Preparing the test data set to be used as a data matrix and discarding the intercept for predicting the values of the response variable.
x.test <- model.matrix (medv ~., test)[,-1]

We save the predicted values of the response variable Housing price in a vector prediction_ridge
prediction_ridge <- as.vector(predict(model_ridge,x.test))

Saving the RMSE, SSE and MAPE value of the predicted values of the test data set in Accuracy_ridge
Accuracy_ridge <- data.frame(
  RMSE = RMSE (prediction_ridge, test$medv),
  SSE = sse (test$medv, prediction_ridge),
  Mape = mape (test$medv, prediction_ridge))

Now we fit the multiple linear regression model on the training data set
names (train)
model_lm <- lm (medv ~ crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat, data=train)

From the summary of the model we can find the p value of the individual predictor variables and decide which variables to be kept in the model
summary (model_lm)
model_lm <- lm (medv ~ crim+zn+nox+rm+dis+rad+tax+ptratio+lstat, data=train)
summary (model_lm)

We need to check the multicollinearity with the help of the function vif () from car package.
vif (model_lm)

We also need to exclude the predictor variables with high vif values to avoid multicollinearity. Though we may allow multicollinearity up to a certain level.
model_lm <- lm (medv ~ crim+zn+nox+rm+dis+rad+ptratio+lstat, data=train)

Below I have mentioned the summary of the updated final model with all the significant variables and the vif values of the variables. The values of the R square and adjusted R square are pretty close, which also shows that the present predictor variables in the model are pretty significant.
summary (model_lm)

vif (model_lm)

We compute the prediction of the test data set with multiple linear regression which was trained using the training dataset
prediction_lm <- predict (model_lm, test [,-14])

We find out the RMSE, SSE, and MAPE of the regression model and save them in Accuracy_lm
Accuracy_lm <-data.frame (
RMSE = RMSE (prediction_lm, test$medv),
SSE = sse (test$medv, prediction_lm),
MAPE = mape (test$medv, prediction_lm))

We save the RMSE, SSE and MAPE values of both linear and ridge regression models in Accuracy.
Accuracy <- rbind.data.frame (Accuracy_ridge = Accuracy_ridge, Accuracy_lm = Accuracy_lm)
Accuracy

From the Accuracy mentioned above, it is clear that even though the least square estimates are unbiased; the accuracy of the model is compromised. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors. But with other models like the lasso and elastic net regression, we have a possibility of getting a better accuracy value.

This is because complicated models tend to overfit the training data. In my next article, I will introduce you to lasso and elastic net regression and explain the comparative advantage of using these models over multiple linear or ridge regression models.

To learn more on Statistics for Data Science Read

You might also like More from author