Model Selection Based on Cross Validation in R
Introduction Model Selection
In statistics, Model Selection Based on Cross Validation in R plays a vital role. The prediction problem is about predicting a response (either continuous or discrete) using a set of predictors (some of them may be continuous, others may be discrete). The solution to such a problem is to build a prediction model on a training sample. This process presents two (related) challenges:
We often have many candidate models (e.g. regression, tree, neural net, SVM, etc). Each model may have many sub-models specified by hyper-parameters that need to be tuned for optimal prediction performance (e.g. variable selection, shrinkage/penalty factor, smoothing/complexity parameter, Bayesian hyper-priors, etc.). In order to choose the (approximate) best model, we need to estimate objectively the performance of different models and their sub-models.
Once we decide on the best model, we want to estimate its test error by making a prediction on a new sample, which provides an objective judgment on the performance of this final model.
Besides the test error, other model assessment tools such as ROC Curve or Calibration Plot may be useful.
To estimate the test errors of different models and to assess the final model objectively, we shall ideally split the dataset into three parts
Training Sample It is used for model estimation by estimating the model parameters. It can be 50% of the data.
Validation Sample It is used for model selection by estimating the test error of each candidate model. It can be 25% of the data.
Test Sample It is used for model assessment by estimating the test error of the final chosen model. It can be the remaining 25% of the data.
If the data is insufficient to be split into three parts, we can drop the Test Sample if we can accept (slightly) biased model assessments. In addition, we can drop the Validation Sample and use CV (Cross Validation) or Bootstrap method on the Training Sample to estimate the test errors to do the model selection.
Basically, the CV or Bootstrap method use the training sample efficiently by generating internal small validation samples. Alternatively, we can use analytical estimators such as AIC or BIC, which may not be available for some models. In general, my recommendation is to use 5-fold or 10-fold CV.
Model Selection Based on Cross Validation (CV)
Because analytical model selection metrics such as AIC or BIC are not universally available for all models (e.g. trees, SVM), we usually use:
- Bootstrap for model selection
In general, Cross-Validation(CV) and Bootstrap have similar performance. However, the Bootstrap method (e.g. the “.632+” estimator) is generally more computationally intensive than CV. What’s more, the concept of CV is simple and easy to communicate
How CV Works Consider a 10-fold CV, we would split the Training Sample into 10 roughly equal-sized parts. For the 1st part, we fit the model to the other 9 parts of the data, and calculate the prediction error, 1, of the fitted model on this 1st part of the data; then we repeat this process on the 2nd part of data, producing 2, and so on and so forth.
Note in this way each observation is predicted exactly once, as an out-of-sample prediction. In the end, we will have 10 error estimates: e1, e2, . . . , e10, which can be averaged into one e (CV Error ) and which can be used to compute the standard error of E.
Alternatively, if the computing resource is ample, then we can, for example, repeat the CV procedures, using different 10-fold partitions each time, for 20 times and estimate the standard error of the CV Error. In the model selection process, it would be very informative to plot out this CV Error, along with its standard error bar, to facilitate the model comparison.
Often we apply a one-standard error rule, in which we choose the most parsimonious model whose error is no more than one standard error above the error of the best model.
Comparison Between 5-fold CV (CV-5) and 10-fold CV (CV-10)
Bias If the training sample size is small, CV will, in general, overestimate the test error (because the model does not see enough data to show its full “strength”). This estimation bias will be negligible if we have sufficient sample size. 2 For CV-5, each training model uses only 80% of the Training Sample. For CV-10, 90%. Thus the bias of CV-5 error is higher than that of CV-10.
Variance For the same reason above (CV-5: 80%; CV-10: 90%), the training models of CV-10 are more similar (correlated) to each other than those of CV-5. Thus the variance of CV-10 error is higher than that of CV-5.
Computing Time CV-10, which fits a model 10 times, will take about twice as long to run in the computer as CV-5, which fits a model 5 times.
Since it is generally hard to know the bias-variance trade-off between CV-5 and CV-10, we recommend using the computing time to make a choice: if a model takes a long time to fit, use CV-5; otherwise use CV-10.
Examples of Model Selection and Assessment
Selection of hyper-parameter Considers the best subset regression of size p. We can choose p best by using CV-10 on Training Sample. Thus we may not need Validation Sample here. Then we fit a new best subset regression of size p best on the entire Training Sample. The fitted regression is then assessed using the Test Sample, yielding an estimate of test error, say, RMSE (Root Mean Squared Error).
When we are satisfied with this model, we can combine Training, Validation, and Test Samples together and fit a new best subset regression of size p best on the pooled data. This final model can then be deployed in Production, where it is used for prediction on the new data set. Feature selection Suppose we have a data of 100 observations and 1000 variables.
Naturally, we want to select some important variables before modeling. Unless we are using unsupervised learning methods (e.g. PCA, ICA, etc.) to select/summarize the variables, any supervised learning method for selecting variables, where the knowledge of the response is utilized, shall be carried out independently and repeatedly in each training set of the CV procedure.
In other words, CV is fair only if the variable selection is considered as part of the prediction model. ROC curve When the response is dichotomous, we often use two metrics in lieu of test error:
- Sensitivity = 1 – (Misclassification Error when Response = TRUE)
- Specificity = 1 – (Misclassification Error when Response = FALSE)
There is often a key tuning parameter, called Cut-off, that dictates the trade-off between sensitivity and specificity. The trade-off can be illustrated by the ROC (Receiver Operating Characteristic) curve.