Decision Tree In Machine Learning

You will be amazed if I tell you that a decision tree has many analogies in real life and has an influence on a wide area of machine learning. There are a bazillion gazillion applications which include detection of Fraudulent financial statements, fault diagnosis, healthcare management, agriculture, pharmacology, manufacturing, and production etc.

A decision tree is a Supervised machine learning algorithm. As the name goes it is a tree-like structure with its root node on the top. This algorithm is one of the most popular techniques used for both classification and regression tree(CART) tasks. Mostly it is used for classification as it doesn’t work that well for regression problems.

• What is a decision tree?
• When to use it?
• Terminologies related to a decision tree
• Classification tree v/s Regression tree
• Types of a decision tree with examples
• Assumptions
• Pseudo-code of decision tree algorithm
• When to split and to terminate the tree?
• Methods of splitting attribute test condition
• Measures of selecting the best split
• How to avoid Over-fitting or under-fitting?
• Implementing decision trees in R
• Results
• Conclusion
• Applications of a Decision tree
• Quick notes

What is a Decision Tree?

A decision tree is a flowchart-like structure in which each internal node represents a test or a condition on an attribute, each branch represents an outcome of the test and each leaf/terminal node holds a class label. It is considered to be a non-parametric method which means that it makes no assumptions about the space distribution and the classifier structure.

Let’s look at the following example. Here the objective is to find out whether the person will cheat or not. Refund, marital status, and taxable income are the potential features/attributes, Refund being the best attribute. Keeping these attributes in mind, a decision tree model is designed to predict the class label.

When to use it?

If you can interpret the relationship between the target variables and the input variables, seeing the plots then you can straight away go for Linear regression. But in case there exists a non-linear or some other complex relationship which you are not able to visualize seeing the data or the plot then decision tree should be preferred. However, cross-validation can only confirm it.

Terminologies related to Decision Tree

• Root node: the topmost tree node which divides into two homogeneous sets.
• Decision node: a sub-node which further splits into other two sub-nodes.
• Terminal/Leaf node: the lowermost nodes or the nodes with no children that represents a class label (decision taken after computing all attributes)
• Splitting: dividing a node into two or more nodes. The splitting technique results in fully grown trees until the criteria of a class attribute are met. But a fully grown tree is likely to over-fit the data which leads to poor accuracy on unseen observations. This is when Pruning comes into the picture.
• Pruning: Process of reducing the size of the tree by removing the nodes which play a minimal role in classifying an instance without reducing the predictive accuracy as measured by a cross-validation set.
• Branch: a sub-section of a decision tree is called a branch

The path from the root to the leaf node is defined as classification rule.

Classification tree v/s Regression tree

The two major difference between the classification tree and the regression tree are:

In Classification tree, the notion of splitting the dataset is to achieve homogeneity. Look at the following figure carefully.

If you look at these data closely, you will find that you can divide them broadly into 5 regions as follows:

You can even split them into more regions or sub-nodes like you can split the region R4 into two parts. For now, let’s proceed with these five regions only. Let’s start by drawing a decision tree.

Hence splitting the tree has resulted in homogeneous sub-nodes.

In a regression tree, the target values do not have classes, we fit the model to the target variables using each of the independent variables. The data is lined up and is split at different points. The error is calculated at each split point using a cost function which is then squared to get ‘sum of squared error’ or SSE. The split point which yields the least SSE is chosen as the root node/split point. This process is implemented recursively. As the best split point is chosen in a greedy manner, it is called Greedy splitting.

Note: It is the target variable that decides the type of decision tree to be used. The predictor may be categorical or numerical

Types of Decision Trees

• Categorical Variable Decision Tree: Decision tree which has categorical target variables then it called as a categorical variable decision tree.

e.g.: will Student pass the exam? A Yes or no

• Continuous Variable Decision Tree: Decision tree which has continuous target variables then it is called as Continuous Variable Decision Tree.

e.g.: to predict whether a customer will pay a renewal premium to the bank or not? To predict this, the bank must have the knowledge of customer’s income which is a significant variable. Hence a decision tree can be built to predict its customer’s income using the knowledge of his occupation and various other variables which will clearly say whether the customer will pay the renewal premium or not.

Assumptions

• In the beginning, the whole dataset is considered the root.
• Best attribute is selected as the root node using some statistical approach.
• Records are split recursively to produce homogeneous sub-nodes.
• If the attributes are continuous, they are discretized before building the model.

Pseudo-code for Decision Tree Algorithm

1. Select the most powerful attribute as the root node.
2. Split the training set into sub-nodes such that each sub-node has identical attribute values.
3. Repeat Step 1 and 2 until you meet the criteria of the class attribute.
4. Perform pruning or remove unwanted nodes if you have a fully grown tree such that it doesn’t affect the prediction accuracy.

When to split and to terminate the tree?

How to split the training record?

Define an attribute test condition at each step to divide the tree into homogeneous sub-nodes. Perform this step recursively. To implement this step, the algorithm must provide a method to specify a test condition and to evaluate the goodness of each test condition. The splitting criteria are different for the classification tree and the regression tree. Various algorithms are used to split the nodes into sub-nodes.

Terminate splitting when:

• You have records that belong to the same class or when you have identical attribute values (homogeneity).
• Criteria of the class attribute are met as defined by the user.
• To use a minimum count on the number of training instances assigned to each terminal/leaf node. If the count is less than some minimum then the node is taken as a final node. e.g: you can stop the splitting once the node value is less than or equal to 100.

Methods of Splitting Attribute Test Condition

• Binary attributes: a test condition that generates only two potential outcomes.

• Nominal attributes: It can have multiple outcomes depending upon the number of distinct values that the corresponding attribute can hold.

• Ordinal Attributes: it can have binary or multiple potential outcomes.

• Continuous attributes: test condition can have either binary outcomes or different ranges of outcomes.

Measures for selecting the best split

Gini Index: it is an impurity measure method. Gini index tells us how good the split is

• Suppose if you select two items at random from a population, do they belong to the same class. If yes, then probability = 1 provided the population is pure.
• It performs only binary splitting.
• For an outcome variable with m classes, Gini impurity index for a rectangle part is defined as:

• Gini index values lies in the range {0, (m-1)/m} for m class scenario.
• It lies in the range {0, 0.5} for 2 class scenario where m=2.
• A higher value of Gini index implies a higher value of purity or homogeneity.
• It works with categorical target variable.

Steps to calculate the Gini index:

Step 1: Calculate (p2 + q2) where p and q are the probability of success and failure respectively.

Step 2: Calculate the weighted Gini index for each split.

Step 3: Compare the weighted Gini index value of each split. Input variable with a higher value is chosen.

e.g.: Suppose you want to group the students based on whether they will play cricket or not (target). And the population can be split using Class and Gender. Now the question is which one to chose to produce more homogeneous groups. Gini index value will tell us which one to choose between class and gender.

Gini index Table

The weighted Gini score for gender is greater than that of class giving more purity. Hence we split using gender.

Entropy: Entropy is the measure of randomness of elements. Mathematically it is defined as :

• Entropy value lies in the range {0, log2 m} for m class scenario and {0,1} for 2 class scenario.
• Entropy is 0 when the sample is completely homogeneous.

• E.g.: let the probability of obtaining 1, 2, 3, on dice be 0.5, 0.25 and 0.25 respectively. Then entropy is calculated as:

Entropy = – (0.5 * log (0.5)) – (0.25 * log (0.25)) -(0.25 * log (0.25) = 0.45

Information gain: information gain is based on a decrease in entropy after a dataset split on an attribute.

Steps to calculate information gain:

Step 1: calculate the overall entropy of the target.

Consider the following dataset.

Step 2: Split the dataset for different attributes. Calculate the entropy for each branch. And add them.

Step 3: Find gain = Entropy (T) – Entropy (T, X)

Step 4: Choose the attribute with the highest gain i.e. Outlook.

Reduction in variance: Reduction in variance is used for continuous target variables regression problems. The split with the lowest variance is selected as the criteria for splitting the population. It is given by:

Where x bar is mean, x is the actual value and n is the total number of values.

Step to calculate variance:

Step 1: Calculate variance at each node.

Step 2: calculate the weighted variance for each split.

Example: – Let’s assign 1 for ‘playing cricket’ and 0 for ‘not playing cricket’.

Calculating Variance for Root node:

Mean (root) = (15*1 + 15*0)/30 = 0.5

Variance (root) = (15*(1–0.5)²+15*(0–0.5)²) / 30 = 0.25

Calculating mean and variance for female sub-node:

Mean (female) = (2*1+8*0)/10=0.2

Variance (female) = (2*(1–0.2)²+8*(0–0.2)²) / 10 = 0.16

Calculating mean and variance for male sub-node:

Mean (Male) = (13*1+7*0)/20=0.65

Variance (Male)= (13*(1–0.65) ²+7*(0–0.65) ²) / 20 = 0.23

Weighted Variance (Gender) = (10/30)*0.16 + (20/30) *0.23 = 0.21

Calculating mean and variance for Class IX sub-node:

Mean of Class IX node = (6*1+8*0)/14=0.43

Variance = (6*(1–0.43)²+8*(0–0.43)²) / 14= 0.24

Calculating mean and variance for Class X sub-node:

Mean of Class X node = (9*1+7*0)/16=0.56

Variance = (9*(1–0.56)²+7*(0–0.56)²) / 16 = 0.25

Weighted Variance (Class) = (14/30)*0.24 + (16/30) *0.25 = 0.25

The variance for Gender split is lower as compared to parent node, so gender would be chosen as splitting criteria.

How to avoid over-fitting or under-fitting?

Splitting of a decision tree results in a fully grown tree and this process continues until a user-defined criteria is met. But this fully grown tree is likely to over-fit the data, giving a poor performance or less accuracy for an unseen observation. That means the tree might work well for training set but might fall to give the predicted results for the test dataset. Sometimes the model might under-fit the data also i.e high test error value. This is when pruning comes into the picture.

Pruning is nothing but reducing the size of the tree that is too large by removing irrelevant nodes so that the misclassification error is reduced. You remove the nodes such that it doesn’t affect the performance/accuracy of the tree model. It is of two type: pre-pruning and post-pruning.

Pre-pruning:

• Also called forward pruning or online pruning.
• Pre-pruning a tree means to terminate some of the branches prematurely as the tree is generated depending upon some measures.
• These measures can be Gini index, information gain etc.

Post-pruning:

• Entirely grow the tree first then remove the nodes in a bottom-up fashion.

If error reduced then replace the sub-tree with the leaf node.

Implementing decision tree in R

We are going to use the cardiotocographic dataset to implement the decision tree. The problem statement is to find whether the patient is normal, suspect or pathological.

The response normal, suspect or pathological state is encoded as 1,2 and 3 respectively using as.factor

When we run the model on validating dataset, we get the above output.

Results:

The above figure shows the misclassification error table for train data and validates data. The columns represent the actual value and the rows represent the predicted values. It says that there were actually 1298 patients who were normal and was predicted as normal. 130 patients who were actually suspect but were predicted as normal. This is misclassified data here.

Conclusion

The decision tree is a classification and regression tree (CART). Mostly it is used for classification. It starts with the root node and divides into various sub-nodes. The best attribute is selected as the root node. Split the tree into sub-nodes such that nodes have identical attribute value. The purity of the nodes is defined using algorithms like Gini index, information gain, entropy etc. a fully grown tree results in over-fitting of data. This leads to poor performance while predicting the test values. Hence pruning is done. Sub-nodes which doesn’t really affect the accuracy of the model, are trimmed from the decision tree. this reduces the depth of the tree and increases the prediction accuracy.

• The data type is not a constraint to this modeling technique as it can handle both numerical and categorical variables.
• It can be used for both linear and non-linear relationships.
• It can handle missing values and heavily skewed data to an extent. Hence it requires less data cleaning as compared to other modeling technique.
• The decision tree is highly versatile.
• It is simple to interpret and visualize.
• The number of hyperparameters to be tuned is almost null unlike in some other modeling technique like k-NN.

• It subjects to over-fitting the model.
• Less prediction accuracy.
• Sensitive towards outliers.
• Sometimes a strong correlation between potential variables may lead to the selection of input variables which will increase the statistical model but the outcome might not be of your interest.
• In the case of categorical variables, information gain gives a biased result for attributes with a large number of categories.
• A decision tree is sensitive towards variance. Small variations in data can result in different trees.

Applications of the Decision Tree

• Business management: many companies maintain their own databases to improve their customer or product services. Hence they use the decision tree to extract information from databases.
• The decision tree is used for detecting Fraudulent statement detection.
• Fault detection and diagnosis: in engineering domain, it helps to detect the faults in the machines like the faulty bearing in the rotator machines.
• Healthcare management: decision trees are used to make predictions in healthcare sectors like a decision tree model was used to predict the reason for developmentally delayed children.
• Energy consumption: it can also be used to estimate the electricity used by an individual.

Quick notes

• A tree too large will over-fit the training data whereas a small tree might fail to capture the important structural information from the sample space.
• To split the tree, define an attribute test condition at each step to divide the tree into homogeneous sub-nodes. Repeat this until you achieve homogeneity.