Multivariate Adaptive Regression Splines (MARSplines) is a non-parametric regression technique that was introduced by Jerome H. Friedman in 1991. It is an extension of linear models that can automatically model nonlinearities and interactions between the variables.
What is a non-parametric regression?
In most popular regression techniques like generalized linear regression (GLM) and multiple linear regressions (LM) etc. the dependent variable is hypothesized to depend linearly on the predictor variables and the values of the dependent variable are predicted based on the values of the independent predictor variables. For example:
House_Prices = Constant + 2.5*No_of_rooms + 2.0*No_of_floors +3.2*Total_Area
Here the independent variables are the number of rooms, number of floors and the total area of the house. The dependent variable is the house price. The numerical values which are multiplied by the independent variables are the regression coefficients. Higher the value of the regression coefficients, higher is the influence of the independent variable over the dependent variable. If the variables are scaled properly, then direct comparisons between these coefficients are more relevant and useful.
The example given above is a parametric regression, which assumes a relationship between the dependent and independent variables prior to the regression modeling. But, nonparametric regression does not make such assumptions about the relationship between the variables. Instead, it constructs the relation from the coefficients and so-called basis functions (Wiki Click) that
MARSplines algorithm performs such regression techniques along with the search for nonlinearities in the data that helps to maximize the predictive accuracy of the model. So, MARSplines technique has taken a step forward in creating successful results where the relationship between the predictors and the dependent variables is difficult to establish.
The MARSplines model equation is given below:
In the given equation output vector y is predicted as a function of the predictor variables X. The constant term is B0 and BM is the associated coefficient. The function from a set of one or more basis functions is hm(X).
How the MARSplines algorithm works?
MARSplines method partitions the input data into different parts, each with its own regression equation. The partitioning of the data happens with the help of hinge or rectifier functions (Wiki Click) which takes the form:
Where t is a constant and it is also called a knot. This knot helps to model a nonlinear regression.
However, this diagram shows a single hinge function in action to attain the nonlinearity, but in reality, with a large number of
Now the model builds in two parts:
1. FORWARD PASS
2. BACKWARD PASS
In this step we first build the model with just the intercept term. After that it starts adding basis functions in pairs. At each step it adds a pair of basis functions for each variable until it reaches to the point of minimum prediction error. The pair of basis functions are identical to each other. Each new basis function added to the model consists of a term already in the model multiplied by a new hinge function. Here we just add basis functions in a greedy way one after another, so it is also known as the greedy algorithm.
Backward Pass is a method of removing those basis functions from the model which are least significant. In
Nonparametric models exhibit a high degree of flexibility that may ultimately result in overfitting of the model. This high degree of flexibility leads the model to compromise its accuracy when it is presented with a new dataset. To combat this problem, the backward pass which is also known as a “pruning pass” because it uses the pruning technique to limit the complexity of the model by reducing the number of its
Implementation of MARSplines in R
The MARSplines algorithm is available in the R package earth and we install it with:
Now calling the package to use the function earth
I have used a dataset known Boston which is present in the MASS package. I have used this dataset to show you a comparison between MARSplines and other regression and penalized regression techniques.
data <- Boston
Splitting the dataset in two parts
train <- data [1:400,]
test <- data [401:506,]
We fit the model and save the model in Fit
Fit <- earth (medv ~. , data = train )
We call the summary of the model to see the values of the parameters
To see the importance of input variables we have to use the function
Read more details about GCV
We predict the values of the House price and save it in Predictions
Predictions <- predict(Fit, test)
Lastly, to see the accuracy of the model for the test dataset, we use the function rmse()
From the rmse value it can be concluded that MARSplines algorithm works very well in regressions problems. We can also tell that the model has worked better than the logistic regression and other penalized regression techniques like ridge ,lasso and elastic net.
[Note: Compare the accuracy of MARS with the accuracies of the event techniques from my previous article: (https://stepupanalytics.com/lasso-and-elastic-net-regression/ )
You can also use the MARSplines algorithm with mars() from the
Two important features of MARSplines algorithm :-
- It can be applied to multiple dependent variables. The algorithm determines a common set of basis functions in the predictors, but estimates different coefficients for each dependent variable.
- Because MARSplines can handle multiple dependent variables; it is easy to apply the algorithm to classification problems as well.