Random Forest Using Python and Sci-kit Learn

Random Forest, which actually is an ensemble of the different and the multiple numbers of decision trees taken together to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone i.e. a single decision tree model here.

Classification and Regression Trees CART

Two of the very basic ideas behind any algorithm used for predictive analysis are:-

  1. prediction on the basis of regression and
  2. prediction on the basis of classification.

However, both are equally important concepts of data science. Having said that, there are several dissimilarities between the two concepts also.

In the case of regression, as we all know the predicted outcome is a numeric variable and that too continuous.
For a classification task, the predicted outcome is not numeric at all and represents categorical classes or factors i.e. the outcome variable in such a task has to be assuming a limited number of values which may be binary in nature (dichotomous) or multinomial (having more than 2 classes).

We in our analysis are motivated to work only on the ‘classification’ scheme of tasks from a predictive analysis domain keeping our focus not on regression trees but only on classification trees, as the name suggests ‘Classification and Regression Trees’.

Now coming to the point of Classification Trees, we see that a lot of decision sciences are involved in such classification tasks. Decisions need to be taken on the procedure of classification of the outcome variable when we go for predicting.

These decisions are not certainly human-made decisions in such cases. We as the students trying to do some analysis can merely set the parameters of the decision criteria and nothing else after that. The entire process of decision making while doing the classification task is done by machine learning algorithms like ‘Decision Trees’ and ‘Random Forest’.

The name ‘tree’ used here is quite analogous to the tree that we are familiar with in ‘Game Theory’. These are combinations of ‘Decision Nodes’ and ‘Sub-Nodes’ which are subsequently linked to one another in a logical manner. On the basis of the results obtained at each node criteria, the algorithm itself chooses to operate and implement those criteria and takes a decision regarding which rows of the data set satisfy each condition generated at each node and then segregate those rows on its own decision criteria.

This entire thing is a self-learned process for the machine (computer software here). The machine learns from its past decision on each row of the Training sample and implements the same learned knowledge later on the Testing sample to give us an idea of its accuracy of self-learning. All these concepts together give us the specific nomenclature Decision Tree Learning.

By decision trees, we mean decision tree models of self-learning used by the machine for segregation and classification of the tasks on its own.

However, sometimes a single decision tree classifier cannot suffice for the perfect classification of tasks. So, we in most cases require another machine learning algorithm called Random Forest, which actually is an ensemble of the different and the multiple numbers of decision trees taken together to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone i.e. a single decision tree model here.

So our focus will be primarily on the Random Forest Classifier algorithm and its comparison with the single decision tree algorithm.

The Objective of Coding Problem And The Relevant Methodology

We need to build a classifier that classifies the “salary” attribute. We can do different data pre-processing and transformations (e.g. grouping values of attributes, converting them to binary, etc.), but we will be providing the explanation of why we have chosen to do that.

We will be splitting the data set into training and testing sets to accurately set the parameters and evaluate the quality of the classifier. We may use any tool such as classifiers in R, Weka, Python, Orange, sci-kit learn or other pieces of software to do this. While doing this though, we need to explain more about the classifier used and be sure that we are producing valid results.

However, here we will limit ourselves to the most common classifiers used out of the broad umbrella of ‘CART’, viz. Decision Trees and Random Forests. We will be using ‘R’ as our desired software tool and write out the entire analysis in an R-Markdown Format.

Data Set

We have a dataset named ‘Train’ itself and we divide it in a 7:3 ratio into a Testing DataSet and an in-sample Testing dataset respectively. As usual, we will build the model in R using the training data set and test it on the testing data set.
The dataset ‘Train’ is available here
train.csv: Download

Data Description:

The data set consists of the following variables:-
ID
Age
Employment class
Fnlwgt
Education level
Education years
Marital status
Occupation
Relationship status
Race
Sex
Capital gain
Capital loss
Work hours per week
Native country
Salary
Some variables are categorical while some are numeric.

Coding Demonstration:

We hereby start with the modeling using Decision Trees and Random Forests

Loading the libraries

Getting the data

Reading the structure of the data

Reading the first few rows

It seems from looking at the initial rows of the Data Frame that the variable ‘ID’ is not useful for any analysis.
The variable ‘Fnlwgt’ is a kind of final sampling weight.

Getting some plots done to understand the data visually:-

So, from the graph, it is pretty clear that the non-salaried people mostly accumulate towards the younger age group. While the salaried people are concentrated around the Middle Age groups.

We see that both the Salaried and the Non-Salaried people are concentrated around Fnlwgt values of 150000 to 200000. So, there are no stark differences between the two groups of people. However, Fnlwgt is the final sampling weight assigned. As it makes no sense to use it. We will drop it.

We see that both the Non-Salaried and the Salaried groups of people are present across all values of Education years. This signifies that Years of Education does not have that much impact on the distribution of Salaried and Non-Salaried people.

The highest number of non-salaried people are working at about 40 Work hours per week. Similarly, the highest number of salaried people also are working at about 40 Work hours per week. However, people working at less than 40 working hours per week are mostly non-salaried. This is because the number of salaried workers working less than 40 work hours per week is much lesser.

Capital gain is 0 for all Non-Salaried people. Capital gain for Salaried people is highest at 0 and gradually decreases as Capital gain increases. However, Capital gain is skewed a lot. So it’s better we bin them.

Capital gain for Non-Salaried people is highest accumulated at the capital gain value of 0. Capital gain for Salaried people is also highest at 0. But the occurrence of Non-Salaried people is much much higher than the Salaried people at the capital gain value of zero. However, the Capital loss is skewed a lot. So it’s better we bin them.

Non-Salaried people are much more in number for all educational levels except for Doctorate, Masters and Prof-school.

The number of Non-Salaried people is always higher across all Marital Statuses.

Non-Salaried people are more across all professions

The number of Non-Salaried people is always greater across all Relationship status.

Non-Salaried people are highest among the Whites and it is higher than the number of salaried people across all Race.

Interestingly, although Non-Salaried is more in number for both males and Females, the disparity is less for Males than for the females. Now we will be doing several bivariate plots

We see that there is as such no definite pattern shown here.

To see the trend between Work hours per week and Age we recreate this scatter plot.
We see that there is as such no definite pattern shown here.

Setting Up The Data

Treatment of the variables ‘Capital gain’ and ‘Capital loss’ into categories.

We divide the variables Capital.gain and Capital.loss into 3 groups namely, ‘None’, ‘Low’, ‘High’

Now, we drop the variables which are unnecessary viz. ‘ID’,‘Native country’,‘Fnlwgt’,‘Capital gain’,‘Capital loss’

Train Test Split

We check for missing values.

So, there are no missing values. Now we need to train the decision tree classifier.

Training The Decision Tree Classifier Model for Understanding

Caret package provides train() method for training our data for various algorithms. We just need to pass different parameter values for different algorithms. Before train() method, we will first use trainControl() method. It controls the computational nuances of the train() method.

So, we have fit a decision tree model with ‘rpart’ as the method where the criteria were ‘information gain’.
Now, we can check the result of our train() method by a print the results of the fit variable. It shows us the accuracy metrics for different values of cp. Here, cp is a complexity parameter for our dtree.

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.00157208.

So, in the above table, we see the accuracy parameters.

Visualisation of The Fitted Tree

We want to visualize the fitted tree by using the prp() method

We can clearly see that this tree is not that good, as the tree has developed much across the later on child nodes. So there are high chances of overfitting. We need to do something else, which we will show later.

Prediction Using This Decision Tree Model

cp = 0.001886496 is the value with which our model was built.

We now want to see the confusion matrix for the predictions made.

The accuracy is 84 % when information gain is considered the criteria.
Let us try to fit a different decision tree on the criteria of the Gini index.

 

So, in the above table, we see the accuracy parameters.

Visualisation of the newly fitted tree

We see that this tree is much better. As it has more but limited number of nodes.

Prediction using this newly fitted Decision Tree

We now look at the confusion matrix for this model.

So we see that the accuracy is 84 %. So, this tree gives us a slightly better classification result. Now we will proceed to the Random Forests Mechanism.

Training the Random Forest Model

We now fit the model.

So we see that the random forest fit has 500 trees by default in it.

The error rate is 15.91 %, which is quite a bit. We need to check on the predictions and the model performance statistics too.

So, from the plot, we see that the error rate becomes stable after 400 decision trees are already passed. After 400 trees, there is not a significant reduction in the error rate. We also want to plot which variable is more important than the other.

This is a detailed view of the variable impact scenario. For a more sophisticated version, we see the numeric table as follows.

Based on Random Forest variable importance, the variables could be selected for any other further predictive modeling techniques or machine learning.

Here we see variables like ‘Age’, ‘Realtionship.status’, ‘Occupation’, ‘Marrital.status’, etc. are more influential than the other variables.

We now go for predicting using this fitted model.

Predicting using Random Forests

So we predicted the values for ‘Salary’ using Random Forests. We now need to check how good was the prediction.

We go for the Confusion Matrix.

We see that the accuracy of the model is 89 %, which is good. So, the model worked well in such a case.

However, we need to check the model performance also. So we look at the Gini based cumulative lift chart as follows

So the Gini index is about 0.14, which is not that good and the area under the Lorenz curve is too meager, indicating that the Random Forest classifier selected here does not suit the data and gives us inefficient results.

So the model requires furthermore different procedures to be dealt in. For further studies and updates, latest updates or interview tips on data science and machine learning, subscribe to our emails.

You might also like More from author