Lasso And Elastic Net Regression

In this article, we will learn the details of Lasso and Elastic Net Regression. In my previous article, I told you about the ridge regression technique and how it fairs well against the multiple linear regression models in terms of accuracy. (You can read my previous article from this link: Click). But in this technique, there are some areas where it falls short.

For example, it shrinks the coefficients towards zero, but it does not set any of them exactly to zero. It does not perform feature selection and etc. So in this article, I have introduced two new methods such as lasso and elastic net regression which deals with these issues very well and does both variable selection and regularization.

Lasso Regression

Lasso (or least absolute shrinkage and selection operator) is a regression analysis method that follows the L1 regularization and penalizes the absolute size of the regression coefficients similar to ridge regression. In addition; it is capable of reducing the variability and improving the accuracy of linear regression models.  Lasso regression differs from ridge regression in a way that it uses absolute values in the penalty function, instead of squares. This leads to penalizing the regression coefficients for which some of the parameter estimates turn out exactly zero. Hence, much like the best subset selection method, lasso performs variable selection out of the given n variables.

The tuning parameter lambda is chosen by cross-validation. When lambda is small, the result is essentially the least squares estimates (OLS). As lambda increases, shrinkage occurs and the less important feature’s coefficient shrinks to zero thus, removing some feature altogether.

So, a major advantage of lasso is that it is a combination of both shrinkage and selection of variables. In cases of a very large number of features, lasso allows us to efficiently find the sparse model that involves a small subset of the features.

The cost function is given below, where the highlighted part is the L1 regularization.

The method was proposed by Professor Robert Tibshirani from the University of Toronto, Canada. He said, “The Lasso minimizes the residual sum of squares to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint, it tends to produce some coefficients that are exactly 0 and hence gives interpretable models”.

In his article titled Regression Shrinkage and Selection via the LassoTibshirani tells us about this technique with respect to various other statistical models such as subset selection and ridge regression. He goes on to say that “lasso can even be extended to generalized regression models and tree-based models. In fact, this technique provides possibilities for even conducting statistical estimations.”

Traditional methods like cross-validation, stepwise regression to handle overfitting and perform feature selection works well with a small set of features but penalized regression techniques are a great alternative when we are dealing with a large set of features.

Lasso was originally formulated for least squares models and this simple case reveals a substantial amount about the behavior of the estimator, including its relationship to ridge regression and best subset selection and the connections between lasso coefficient estimates and so-called soft thresholding. It also reveals that (like standard linear regression) the coefficient estimates need not be unique if covariates are collinear. [Source: Wikipedia]

Though originally defined for least squares, lasso regularization is easily extended to a wide variety of statistical models including generalized linear models, generalized estimating equations, proportional hazards models, and M-estimators, in a straightforward fashion. Lasso’s ability to perform subset selection relies on the form of the constraint and has a variety of interpretations including in terms of geometry, Bayesian statistics, and convex analysis. [Source: Wikipedia]

As discussed above, lasso can set coefficients to zero, while ridge regression, which appears superficially similar, but cannot. This is due to the difference in the shape of the constraint boundaries in the two cases.

From the figure, one can see that the constraint region of lasso regression is a rotated square and its corners lie on the axes, while the constraint region of ridge regression is a sphere which is rotationally invariant and, therefore, has no corners. A convex object that lies tangent to the boundary, is likely to encounter a corner a hypercube, for which some components of are identically zero, while in the case of a sphere, the points on the convex object boundary for which some of the components are not distinguished from the others and the convex object is not likely to contact a point at which some components are zero.

In the case of ML, both ridge regression and Lasso find their respective advantages. Both these techniques tackle overfitting, which is generally present in a realistic statistical model. It all depends on the computing power and data available to perform these techniques on statistical software. Ridge regression is faster compared to lasso but then again lasso has the advantage of completely reducing unnecessary parameters in the model.

One important limitation of lasso regression is that, for grouped variables, the lasso fails to do grouped selection. It tends to select one variable from a group and ignore the others.

Elastic Net Regression

Elastic-net is a mix of both L1 and L2 regularizations. A penalty is applied to the sum of the absolute values and to the sum of the squared values:

Lambda is a shared penalization parameter while alpha sets the ratio between L1 and L2 regularization in the Elastic Net Regularization. Hence, we expect a hybrid behavior between L1 and L2 regularization. Though coefficients are cut, the cut is less abrupt than the cut with lasso penalization alone. The hyper-parameter is between 0 and 1 and controls how much L2 or L1 penalization is used. The usual approach to optimizing the lambda hyper-parameter is through cross-validation—by minimizing the cross-validated mean squared prediction error—but in elastic net regression, the optimal lambda hyper-parameter also depends upon the alpha hyper-parameter.

This article takes a cross-validated approach that uses the grid search to find the optimal alpha hyper-parameter while also optimizing the lambda hyper-parameter for the data set.

Computing Ridge, Lasso, And Elastic Net Regression

In my previous article, I used the glmnet package to show the ridge regression in R. In this article, I have used the caret package for better comparison between the techniques.

Loading the MASS package to get the data set
library (MASS)
data <- Boston

Splitting the dataset in training and testing data
train <- data [1:400,]
test <- data [401:506,]

Setting up a grid range of lambda values
lambda <- 10^seq (-3, 3, length = 100)

Loading the required libraries
library (tidyverse)
library (caret)
library (Metrics)

We fit the ridge regression model on the training data using k fold cross validation

set.seed (123)
ridge <- train (
medv ~., data = train, method = “glmnet”,
trControl = trainControl (“cv”, number = 10),
tuneGrid = expand.grid (alpha = 0, lambda = lambda)) 

plot (ridge$finalModel , xlab = “L2 Norm” )

Displaying the regression coefficients below
coef (ridge$finalModel, ridge$bestTune$lambda)

We save the predicted values of the response variable in a vector prediction_ridge
prediction_ridge <- predict (ridge, test)

Saving the RMSE, SSE and MAPE values in Accuracy_lasso
Accuracy_ridge <- data.frame (
RMSE = RMSE (prediction_ridge, test$medv),
SSE = sse (test$medv, prediction_ridge),
MAPE = mape (test$medv, prediction_ridge))

The only difference between the R code used for ridge and lasso regression is that for lasso regression, we need to specify the argument alpha = 1 instead of alpha = 0 (for ridge regression).

Now executing the Lasso Regression

set.seed (123)
lasso <- train (
medv ~., data = train, method = “glmnet”,
trControl = trainControl (“cv”, number = 10),
tuneGrid = expand.grid (alpha = 1, lambda = lambda))

plot (lasso$finalModel , xlab = “L1 Norm” )

If we look at the plot, the x-axis is the maximum permissible value the L1 norm can take. So when we have a small L1 norm, we have a lot of regularization. Therefore, an L1 norm of zero gives an empty model, and as you increase the L1 norm, variables will “enter” the model as their coefficients take non-zero values.

Displaying the regression coefficients below
coef (lasso$finalModel, lasso$bestTune$lambda)

We save the predicted values of the response variable in a vector prediction_ lasso
prediction_lasso <- predict (lasso, test)

Saving the RMSE, SSE and MAPE values in Accuracy_lasso

Accuracy_lasso <-data.frame (
RMSE = RMSE (prediction_lasso, test$medv),
SSE = sse (test$medv, prediction_lasso),
MAPE = mape (test$medv, prediction_lasso))

The elastic net regression models do not require us to mention a specific value of lambda and alpha. We use caret package to automatically select the best tuning parameters alpha and lambda. The caret package tests a range of possible alpha and lambda values, and then selects the best values for lambda and alpha, resulting in a final model that is an elastic net model.

Now executing the Elastic Net Regression

set.seed (123)
elasticnet <- train (
medv ~., data = train, method = “glmnet”,
trControl = trainControl (“cv”, number = 10), tuneLength = 10)
plot (elasticnet$finalModel , xlab= “Elasticnet Regularization”)

Displaying the regression coefficients below
coef (elasticnet$finalModel, elasticnet$bestTune$lambda)

We save the predicted values of the response variable in a vector prediction_ elasticnet
predictions_elasticnet <- predict (elasticnet, test)

Saving the RMSE, SSE and MAPE values in Accuracy_ elasticnet
Accuracy_elasticnet <-data.frame (
RMSE = RMSE (predictions_elasticnet, test$medv),
SSE = sse (test$medv, predictions_elasticnet),
MAPE = mape (test$medv, predictions_elasticnet))

We finally bring the RMSE, SSE and MAPE values of the three regression techniques in a dataframe Accuracy.
Accuracy <- rbind.data.frame (Accuracy_ridge = Accuracy_ridge, Accuracy_lasso = Accuracy_lasso, Accuracy_elasticnet = Accuracy_elasticnet)
Accuracy

Here both lasso and elastic net regression do a great job of feature selection technique in addition to the shrinkage method. On the other hand, the lasso achieves poor results in accuracy. This is because there is a high degree of collinearity in the features. Further, the L1 norm is underdetermined when the number of predictors exceeds the number of observations while ridge regression can handle this.

Conclusion

From our example we see that penalized regression models performed much better than the multiple linear regression models. But it can be said that  Lasso regression performs better than ridge in scenarios with many noise predictors and worse in the presence of correlated predictors. Elastic net, is a hybrid of the two, and performs well in all these scenarios.

You might also like More from author