Feature selection in R (i.e. pick important variables) using Boruta Package

Variable selection or Feature selection in R is an important aspect of the model building which every analyst must learn. After all, it helps in building predictive models free from correlated variables, biases and unwanted noise.

A lot of novice analysts assume that keeping all (or more) variables will result in the best model as you are not losing any information. Sadly, that is not true!

How many times has it happened that removing a variable from the model has increased your model accuracy?
At least, it has happened to me. Such variables are often found to be correlated and hinder achieving higher model accuracy. Today, we’ll learn one of the ways of how to get rid of such variables in R. I must say, R has an incredible CRAN repository. Out of all packages, one such available package for variable selection is Boruta Package.

In this article, we’ll focus on understanding the theory and practical aspects of using Boruta Package. I’ve followed a stepwise approach to help you understand better.

I’ve also drawn a comparison of boruta with other traditional feature selection algorithms. Using this, you can arrive at a more meaningful set of features which can pave the way for a robust prediction model. The terms “features”, “variables” and “attributes” have been used interchangeably, so don’t get confused!

Description

Boruta is a feature selection algorithm. Precisely, it works as a wrapper algorithm around Random Forest. This package derives its name from a demon in Slavic mythology who dwelled in pine forests. This technique achieves supreme importance when a data set comprised of several variables is given for model building.

Boruta can be your algorithm of choice to deal with such data sets. Particularly when one is interested in understanding the mechanisms related to the variable of interest, rather than just building a black box predictive model with good prediction accuracy.

Below is the stepwise working of boruta algorithm:

  • Firstly, it adds randomness to the given data set by creating shuffled copies of all features (which are called shadow features).
  • Then, it trains a random forest classifier on the extended data set and applies a feature importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each feature where higher means more important.
  • At every iteration, it checks whether a real feature has a higher importance than the best of its shadow features (i.e. whether the feature has a higher Z score than the maximum Z score of its shadow features) and constantly removes features which are deemed highly unimportant.
  • Finally, the algorithm stops either when all features gets confirmed or rejected or it reaches a specified limit of random forest runs.

Usage

Details Of Boruta Package

Boruta iteratively compares the importance of attributes with the importance of shadow attributes, created by shuffling original ones. Attributes that have significantly worst importance than shadow ones are being consecutively dropped. On the other hand, attributes that are significantly better than shadows are admitted to be confirmed. Shadows are re-created in each iteration.

The algorithm stops when only Confirmed attributes are left, or when it reaches maxRuns importance source runs. If the second scenario occurs, some attributes may be left without a decision. They are claimed Tentative. You may try to extend maxRuns or lower pValue to clarify them, but in some cases, their importance does fluctuate too much for Boruta to converge.

Instead, you can use TentativeRoughFix function, which will perform other, weaker test to make a final decision, or simply treat them as undecided in further analysis

Boruta in Action in R (Practical)

Download Reference Material

Author: Ashar Ahmad

You might also like More from author