Machine Learning in Python – sklearn Feature Selection

The data features that we use to train our machine learning models have a huge influence on the performance we can achieve. Irrelevant or partially relevant features can negatively impact model performance. Here in this article, we will learn the implementation of sklearn Feature Selection.

Let’s get started.

Feature Selection

Feature selection is a process where we automatically select those features in our data that contribute most to the prediction variable or output in which we are interested.
Having irrelevant features in our data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression.

Three benefits of performing feature selection before modelling our data are:

  • Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
  • Improves Accuracy: Less misleading data means modeling accuracy improves.
  • Reduces Training Time: Less data means that algorithms train faster.

We can learn more about feature selection with scikit-learn in the article Feature selection.

Feature Selection for Machine Learning

This section lists 4 feature selection recipes for machine learning in Python. This post contains recipes for feature selection methods.

Each recipe was designed to be complete and standalone so that we can copy-and-paste it directly into our project and use it immediately.

Recipes uses the Pima Indians onset of diabetes dataset to demonstrate the feature selection method. This is a binary classification problem where all of the attributes are numeric.

Univariate Selection

Statistical tests can be used to select those features that have the strongest relationship with the output variable. The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.

The example below uses the chi squared (chi^2) statistical test for non-negative features to select 4 of the best features from the Pima Indians onset of diabetes dataset.

Feature Extraction

We can see the scores for each attribute and the 4 attributes chosen (those with the highest scores): plas, test, mass and age.

Recursive Feature Elimination

The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain.
It uses the model accuracy to identify which attributes (and the combination of attributes) contribute the most to predicting the target attribute.

We can learn more about the RFE class in the scikit-learn documentation. The example below uses RFE with the logistic regression algorithm to select the top 3 features. The choice of the algorithm does not matter too much as long as it is skillful and consistent.

We can see that RFE chose the the top 3 features as preg, mass and pedi.
These are marked True in the support_ array and marked with a choice “1” in the ranking_ array.

Principal Component Analysis

Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form.
Generally, this is called a data reduction technique. A property of PCA is that We can choose the number of dimensions or principal component in the transformed result.

In the example below, we use PCA and select 3 principal components. Learn more about the PCA class in scikit-learn by reviewing the PCA API. Dive deeper into the math behind PCA on the Principal Component Analysis Wikipedia article.

We can see that the transformed dataset (3 principal components) bare little resemblance to the source data.

Feature Importance

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.
In the example below, we construct a ExtraTreesClassifier classifier for the Pima Indians onset of diabetes dataset. We can learn more about the ExtraTreesClassifier class in the scikit-learn API.

We can see that we are given an importance score for each attribute where the larger score the more important the attribute. The scores suggest at the importance of plas, age and mass.

Summary

In this post, we discovered feature selection for preparing machine learning data in Python with scikit-learn.
We learned about 4 different automatic feature selection techniques:

  • Univariate Selection.
  • Recursive Feature Elimination.
  • Principle Component Analysis.
  • Feature Importance.

Will be continued..

You might also like More from author