Random Forest

CART (Classification and Regression Trees)

An Introduction and Motivation for choosing CART:

Two of the very basic ideas behind any algorithm used for predictive analysis are : –

  1. prediction on the basis of regression and
  2. prediction on the basis of classification.

However, both are equally important concepts of data science. Having said that, there are several dissimilarities between the two concepts also.
In case of regression, as we all know the predicted outcome is a numeric variable and that too continuous.
For a classification task, the predicted outcome is not numeric at all and represents categorical classes or factors i.e. the outcome variable in such a task has to be assuming limited number of values which may be binary in nature (dichotomous) or multinomial (having more than 2 classes).
We in our analysis are motivated to work only on the ‘classification’ scheme of tasks from a predictive analysis domain keeping our focus not on regression trees but only on classification trees, as the name suggests ‘Classification and Regression Trees’.
Now coming to the point of Classification Trees, we see that a lot of decision sciences are involved in such classification tasks. Decisions need to be taken on the procedure of classification of the outcome variable when we go for predicting.
These decisions are not certainly human made decisions in such cases. We as the students trying to do some analysis can merely set the parameters of the decision criteria and nothing else after that. The entire process of decision making while doing the classification task is done by machine learning algorithms like ‘Decision Trees’ and ‘Random Forest’.
The name ‘tree’ used here is quite analogous to the tree that we are familiar with in ‘Game Theory’. These are a combinations of ‘Decision Nodes’ and ‘Sub-Nodes’ which are subsequently linked to one another in a logical manner. On the basis of the results obtained at each node criteria, the algorithm itself chooses to operate and implement those criteria and takes decision regarding which rows of the data set satisfy each condition generated at each node and then segregate those rows on its own decision criteria.
This entire thing is a self learnt process for the machine (computer software here). The machine learns from its past decision on each rows of the Training sample and implements the same learnt knowledge later on the Testing sample to give us an idea of its accuracy of self learning.
All these concepts together give us the specific nomenclature ‘Decision Tree Learning’.
By decision trees we mean decision tree models of self learning used by the machine for segregation and classification of the tasks on its own.
However, some times a single decision tree classifier cannot suffice for the perfect classification of tasks. So, we in most cases require another machine learning algorithm called Random Forest, which actually is an ensemble of different and multiple number of decision trees taken together to obtain better predictive performance than could be obtained from any of the constituent learning algorithm alone i.e. a single decision tree model here.
So our focus will be primarily on Random Forest Classifier algorithm and its comparison with the single decision tree algorithm.

Objective of our Coding problem and the relevant Methodology:

We need to build a classifier that classifies the “salary” attribute. We can do different data pre-processing and transformations (e.g. grouping values of attributes, converting them to binary, etc.), but we will be providing explanation why we have chosen to do that.
We will be splitting the data set into training and testing sets to accurately set the parameters and evaluate the quality of the classifier.
We may use any tool such as classifiers in R, Weka, Python, Orange, scikit-learn or other pieces of software to do this. While doing this though, we need to explain more about the classifier used and be sure that we are producing valid results.
However, here we will limit ourselves to the most common classifiers used out of the broad umbrella of ‘CART’, viz. Decision Trees and Random Forests.
We will be using ‘R’ as our desired software tool and write out the entire analysis in an R-Markdown Format.

Data Set:

We have a dataset named ‘Train’ itself and we divide it in a 7:3 ratio into a Testing Data Set and an in-sample Testing data set respectively. As usual we will build the model in R using the training data set and test it on the testing data set.
The data set ‘Train’ is available here: –
[train.csv: https://drive.google.com/file/d/0B7cB2Fb0wrTEMEdiU2FVcXdXZkE/view?usp=sharing]

Data Description:

The data set consists of the following variables: –
ID
Age
Employment class
Fnlwgt
Education level
Education years
Marital status
Occupation
Relationship status
Race
Sex
Capital gain
Capital loss
Work hours per week
Native country
Salary
Some variables are categorical while some are numeric.

Coding Demonstration:

We hereby start with the modelling using Decision Trees and Random Forests

Loading the libraries

Getting the data: –

My file path in my system is: – “C:UP ANALYTICSProjectForests”

Reading the structure of the data

Reading the first few rows

It seems from looking at the initial rows of the Data Frame that the variable ‘ID’ is not useful for any analysis.
The variable ‘Fnlwgt’ is a kind of final sampling weight.

Getting some plots done to understand the data visually: –


So, from the graph it is pretty clear that the non-salaried people mostly accumulate towards the younger age group.
While the salaried people are concentrated arounf the Middle Age groups.


We see that both the Salaried and the Non-Salaried people are concentrated around Fnlwgt values of 150000 to 200000.
So, there is no stark diffrence between the two groups of people.
However, Fnlwgt is the final sampling weight assigned. As it makes no sense to use it. We will drop it.


We see that both the Non-Salaried and the Salaried groups of people are present across all values of Education years.
This signifies that Years of Education does not have that much impact on the distribution of Salaried and Non-Salaried people.


The highest number of non-salaried people are working at about 40 Work hours per week.
Similarly, the highest number of salaried people also are working at about 40 Work hours per week.
However, people working at less than 40 working hours per week are mostly non-salaried.
This is because the number of salaried workers woking less than 40 work hours per week is much lesser.


Capital gain is 0 for all Non-Salaried people.
Capital gain for Salaried people is highest at 0 and gradually decraeses as Capital gain increases.
However Capital gain is skewed a lot. So it’s better we bin them.


Capital gain for Non-Salaried people is highest accumulated at capital gain value of 0.
Capital gain for Salaried people is also highest at 0.
But the occurrence of Non-Salaried people is much much higher than the Salaried people at capital gain value of zero.
However Capital loss is skewed a lot. So it’s better we bin them.

 

 
Non-Salaried people are much more in number for all educational levels except for Doctorate, Masters and Prof-school.

 

 
Number of Non-Slaried people are always higher across all Marital Statuses.

 

 
Non-Salaried people are more across all professions.

 

 
Number of Non-Salaried people are always greater across all Relationship status.

 

 
Non-Salaried people are highest among the Whites and it is higher than the number of salaried people across all Race.

 

Interestingly, although Non-Salaried are more in number for both males and Females, the disparity is less for Males than for the females.
Now we will be doing several bivariate plots

 

 
We see that there is as such no definite pattern shown here.

 

 
To see the trend between Work hours per week and Age we recreate this scatter plot.
We see that there is as such no definite pattern shown here.

 

 
To see the trend between Work hours per week and Education years we recreate this jointplot. We see that there is as such no definite pattern shown here.¶

Setting up the Data

Treatment of the variables ‘Capital gain’ and ‘Capital loss’ into categories.

We divide the variables Capital.gain and Capital.loss into 3 groups namely, ‘None’, ‘Low’, ‘High’

Now, we drop the variables which are uneccessary viz. ‘ID’,‘Native country’,‘Fnlwgt’,‘Capital gain’,‘Capital loss’

We see most of the variables are in their required format.
But Salary is in numeric form.
So we need it to convert it to factors.

Train Test Split

We check for missing values.

So, there are no missing values.
Now we need to train the decision tree classifier.

Training the Decision Tree classifier Model for the sake of understanding.

Caret package provides train() method for training our data for various algorithms. We just need to pass different parameter values for different algorithms. Before train() method, we will first use trainControl() method. It controls the computational nuances of the train() method.

So, we have fit a decision tree model with ‘rpart’ as the method where the criteraia was ‘information gain’.
Now, we can check the result of our train() method by a print the results of the fit variable.
It shows us the accuracy metrics for different values of cp. Here, cp is complexity parameter for our dtree.

So, in the above table we see the accuracy parameters.

Visualisation of the fitted tree.

We want to visualise the fited tree by using the prp() method


We can clearly see that this tree is not that good, as the tree has developed much across the later on child nodes.
So there are high cahnces of overfitting.
We need to do some thing else, which we will show later.

Prediction using this decision tree model

cp = 0.001886496 is the value with which our model was built.

We now want to see the confusion matrix for the predictions made.

The accuracy is 84 % when information gain is considered the criteria.
Let us try to fit a different decision tree on the criteria of gini index.

So, in the above table we see the accuracy parameters.

Visualisation of the newly fitted tree.


We see that this tree is much better. As it has more but limited number of nodes.

Prediction using this newly fitted Decision Tree.

We now look at the confusion matrix for this model.

So we see that the accuracy is 85 %.
So, this tree gives us a slightly better classification result.
Now we will proceed to the Random Forests Mechanism.

Training the Random Forest Model

We now fit the model.

So we see that the random forest fit has 500 trees by default in it.
Error rate is 9.88 %, which is quite a bit.
We need to check on the predictions and the model performance statistics too.


So, from the plot we see that the error rate becomes stable after 400 decision trees are alreday passed.
After 400 trees, there is not a significant reduction in the error rate.
We also want to plot which variable is more important than the other.