Random Forest Algorithm

Data will talk to you if you are willing to listen to it” – Jim Bergeson

Table of Contents

  • Real-time analogy
  • Properties
  • How does it work?
  • Pseudocode
  • Feature importance
  • Important Hyperparameters
  • Implementing Random Forest in Python
  • Advantages and Disadvantages
  • Applications

Real-Time Analogy

Most of you must have experienced interviews once in your lifetime. The basic concept of a random forest algorithm is the same as a company having the interview process. Most of the companies don’t have just one round of interview but multiple rounds like aptitude test, technical interview, HR round etc. to ensure that they make the right decision to hire the best candidate.

Assume that the cumulative score is taken into consideration at each interview stage. The candidate will be evaluated by panels at each round independently. And generally, for each of the interview round, there will be 2-5 panel members. Each of them will be interviewing you by asking different questions and they will individually score you based on the following parameters.

That is where randomness comes into play. There’s a possibility that out of 5 panels, 3 wants to hire you and 2 of them has declined your application or the 5 of them found you good enough. The result always goes with the majority votes.

As the whole interview process broken down into various rounds. Similarly, in a random forest algorithm, the whole sample matrix is divided into various decision trees.  Each decision tree gives its own target result depending upon its own attributes. This is the same as result of the candidate given by the team of panel members at each round of interview. Among these results, the majority of votes is taken into consideration.

Random Forest Algorithm

I hope now you have an overview of random forest algorithm. Grab yourself a cup of coffee and let’s explore more about random forest algorithm.

Properties of Random Forest Algorithm 

  • Random forest is a predictive modeling algorithm (not any descriptive modeling algorithm).
  • The random forest can be used for both classification and regression tasks.
  • It works well with default hyper-parameters.
  • It can be used to rank the importance of variables in a regression or classification problem.
  • The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate.
  • A tree with a low error rate is a strong classifier. Increasing the strength of the individual trees decreases the forest error rate.
  • It runs efficiently on large datasets.

How it works?

Random forest is one of the most powerful supervised machine learning algorithms. It is capable of performing both classification and regression. As the name goes, it creates a forest using multiple decision trees randomly. To understand better about the random forest, first, you need to clear the concept of the decision tree.

Consider a matrix S which is a matrix of training samples.

Where fA1 is the feature A of sample 1, fB1 is the feature B of sample 1 and so on.

Now randomly pick elements from matrix S and create sub-sets. Consider these sub-sets as different trees.

We use these decision trees to make a prediction or classification. Each decision tree gives different target output. The final prediction from the target values is calculated using majority voting.

Suppose the forest has 4 trees. Each tree has to classify the unseen observation as class 1, 2 or 3. Let’s say the first tree classified the new observation as class 2. The second tree classified it as class 1. The third tree classified it as class 3. And finally, the fourth tree classified it as class 1. Now let’s count the votes.

As you can clearly see, class 1 has got the maximum votes hence the new observation will be classified as Class 1.

Pseudo code

  1. Randomly select ‘m’ features from the total features.
  2. Select the root node using best split and form various decision trees.
  3. Predict the outcome using these decision trees.
  4. Calculate the vote for each of the target predicted by each tree.
  5. The target with the highest vote is considered as the final prediction of the random forest algorithm.

Feature Importance

Random forest algorithm is considered as one of the most powerful algorithms because of its capability to find the relative importance of each feature/variable in the dataset. Creating a model with the most important feature has the following benefits:

  1. Make a model simpler to interpret.
  2. Reduce the variance and hence over-fitting.
  3. Reduce the computational cost and time of training.

By looking at the table of feature importance you can decide which features you need to drop as it does not help enough in the prediction process.

Important Hyper-Parameters

n_estimators: number of trees in the forest.

  • instantiate with a value as high as the processor can handle for better predictions. Higher the number of trees, higher is the prediction accuracy.
  • Optional parameter
  • Default value =10

criterion: function to measure the quality of the split.

  • Optional parameter
  • Default = gini
  • This parameter is tree specific

max_features: maximum number of features considered for the best split.

  • Optional parameter
  • Default = “auto”. It will take all the features in a single run.
  • If max_features  = sqrt(n_features), it will take the square root of the total number of features in a single run.

max_depth: maximum depth or height of the tree.

  • Optional parameter
  • Default = None

min_samples_split: minimum number of samples required in a leaf node before splitting. If the number of samples is greater than a threshold value then the node is split.

  • Optional parameter
  • Default = 2

min_samples_leaf: minimum number of data points allowed in a leaf node.

  • A smaller leaf makes the model more prone to capture noise in train data.
  • Optional parameter (Default = 1)

random_state: defines random selection to compare between different models.

  • Optional parameter
  • Default = None

bootstrap: the method for sampling data points(with or without replacement).

By now you must be excited enough to get your hands dirty. So hold your breath and let’s start coding from scratch.

Roadmap to implement any model

Implementing Random Forest in Python

You can use the wine dataset to implement random forest algorithm. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivators. The analysis determined the quantities of 13 constituents found in each of the three types of wines. There are 13 attributes in the dataset and all are continuous.

Random Forest Using Python scikit-learn 

Download the Dataset: Wine.csv

Advantages and Disadvantages

Advantages:

  • Random forest algorithm can be used for both classification and regression.
  • More number of trees can prevent the over-fitting of the model.
  • Higher the number of trees, higher will be the accuracy.
  • It can model for categorical values also.
  • It can handle the missing values.
  • It can also be used for feature engineering: to identify the importance or the rank of each feature/variable in the dataset.

Disadvantages:

  • A large number of trees increases the execution time of this algorithm making it ineffective for real-time applications where the run-time is an important factor.
  • It requires more computational resources.

Application of Random Forest

Random forest algorithm has a wide range of applications.

  • Banking: Random forest is used to detect whether the customers will repay the debt in time.
  • Healthcare management: for identifying the disease in the patients and finding the right combination of components to validate the medicine.
  • E-commerce: to predict whether the customer will buy a particular item depending upon the likelihood of the items he purchased earlier.
  • Stock market: random forest algorithm is used to predict the stock behavior and to calculate the expected profit and loss of the stocks purchased.

For further studies and updates, latest updates or interview tips on data science and machine learning, subscribe to our emails.

You might also like More from author