Decision Tree In Machine Learning
You will be amazed if I tell you that a decision tree has many analogies in real life and has an influence on a wide area of machine learning. There are a bazillion gazillion applications which include detection of Fraudulent financial statements, fault diagnosis, healthcare management, agriculture, pharmacology, manufacturing, and production etc.
A decision tree is a Supervised machine learning algorithm. As the name goes it is a treelike structure with its root node on the top. This algorithm is one of the most popular techniques used for both classification and regression tree(CART) tasks. Mostly it is used for classification as it doesn’t work that well for regression problems.
Table of contents
 What is a decision tree?
 When to use it?
 Terminologies related to a decision tree
 Classification tree v/s Regression tree
 Types of a decision tree with examples
 Assumptions
 Pseudocode of decision tree algorithm
 When to split and to terminate the tree?
 Methods of splitting attribute test condition
 Measures of selecting the best split
 How to avoid Overfitting or underfitting?
 Implementing decision trees in R
 Results
 Conclusion
 Advantages and disadvantages
 Applications of a Decision tree
 Quick notes
What is a Decision Tree?
A decision tree is a flowchartlike structure in which each internal node represents a test or a condition on an attribute, each branch represents an outcome of the test and each leaf/terminal node holds a class label. It is considered to be a nonparametric method which means that it makes no assumptions about the space distribution and the classifier structure.
Let’s look at the following example. Here the objective is to find out whether the person will cheat or not. Refund, marital status, and taxable income are the potential features/attributes, Refund being the best attribute. Keeping these attributes in mind, a decision tree model is designed to predict the class label.
When to use it?
If you can interpret the relationship between the target variables and the input variables, seeing the plots then you can straight away go for Linear regression. But in case there exists a nonlinear or some other complex relationship which you are not able to visualize seeing the data or the plot then decision tree should be preferred. However, crossvalidation can only confirm it.
Terminologies related to Decision Tree
 Root node: the topmost tree node which divides into two homogeneous sets.
 Decision node: a subnode which further splits into other two subnodes.
 Terminal/Leaf node: the lowermost nodes or the nodes with no children that represents a class label (decision taken after computing all attributes)
 Splitting: dividing a node into two or more nodes. The splitting technique results in fully grown trees until the criteria of a class attribute are met. But a fully grown tree is likely to overfit the data which leads to poor accuracy on unseen observations. This is when Pruning comes into the picture.
 Pruning: Process of reducing the size of the tree by removing the nodes which play a minimal role in classifying an instance without reducing the predictive accuracy as measured by a crossvalidation set.
 Branch: a subsection of a decision tree is called a branch
The path from the root to the leaf node is defined as classification rule.
Classification tree v/s Regression tree
The two major difference between the classification tree and the regression tree are:
In Classification tree, the notion of splitting the dataset is to achieve homogeneity. Look at the following figure carefully.
If you look at these data closely, you will find that you can divide them broadly into 5 regions as follows:
You can even split them into more regions or subnodes like you can split the region R4 into two parts. For now, let’s proceed with these five regions only. Let’s start by drawing a decision tree.
Hence splitting the tree has resulted in homogeneous subnodes.
In a regression tree, the target values do not have classes, we fit the model to the target variables using each of the independent variables. The data is lined up and is split at different points. The error is calculated at each split point using a cost function which is then squared to get ‘sum of squared error’ or SSE. The split point which yields the least SSE is chosen as the root node/split point. This process is implemented recursively. As the best split point is chosen in a greedy manner, it is called Greedy splitting.
Note: It is the target variable that decides the type of decision tree to be used. The predictor may be categorical or numerical
Types of Decision Trees
 Categorical Variable Decision Tree: Decision tree which has categorical target variables then it called as a categorical variable decision tree.
e.g.: will Student pass the exam? A Yes or no
 Continuous Variable Decision Tree: Decision tree which has continuous target variables then it is called as Continuous Variable Decision Tree.
e.g.: to predict whether a customer will pay a renewal premium to the bank or not? To predict this, the bank must have the knowledge of customer’s income which is a significant variable. Hence a decision tree can be built to predict its customer’s income using the knowledge of his occupation and various other variables which will clearly say whether the customer will pay the renewal premium or not.
Assumptions
 In the beginning, the whole dataset is considered the root.
 Best attribute is selected as the root node using some statistical approach.
 Records are split recursively to produce homogeneous subnodes.
 If the attributes are continuous, they are discretized before building the model.
Pseudocode for Decision Tree Algorithm
 Select the most powerful attribute as the root node.
 Split the training set into subnodes such that each subnode has identical attribute values.
 Repeat Step 1 and 2 until you meet the criteria of the class attribute.
 Perform pruning or remove unwanted nodes if you have a fully grown tree such that it doesn’t affect the prediction accuracy.
When to split and to terminate the tree?
How to split the training record?
Define an attribute test condition at each step to divide the tree into homogeneous subnodes. Perform this step recursively. To implement this step, the algorithm must provide a method to specify a test condition and to evaluate the goodness of each test condition. The splitting criteria are different for the classification tree and the regression tree. Various algorithms are used to split the nodes into subnodes.
Terminate splitting when:
 You have records that belong to the same class or when you have identical attribute values (homogeneity).
 Criteria of the class attribute are met as defined by the user.
 To use a minimum count on the number of training instances assigned to each terminal/leaf node. If the count is less than some minimum then the node is taken as a final node. e.g: you can stop the splitting once the node value is less than or equal to 100.
Methods of Splitting Attribute Test Condition
 Binary attributes: a test condition that generates only two potential outcomes.
 Nominal attributes: It can have multiple outcomes depending upon the number of distinct values that the corresponding attribute can hold.
 Ordinal Attributes: it can have binary or multiple potential outcomes.
 Continuous attributes: test condition can have either binary outcomes or different ranges of outcomes.
Measures for selecting the best split
Gini Index: it is an impurity measure method. Gini index tells us how good the split is
 Suppose if you select two items at random from a population, do they belong to the same class. If yes, then probability = 1 provided the population is pure.
 It performs only binary splitting.
 For an outcome variable with m classes, Gini impurity index for a rectangle part is defined as:

 Gini index values lies in the range {0, (m1)/m} for m class scenario.
 It lies in the range {0, 0.5} for 2 class scenario where m=2.
 A higher value of Gini index implies a higher value of purity or homogeneity.
 It works with categorical target variable.
Steps to calculate the Gini index:
Step 1: Calculate (p^{2} + q^{2}) where p and q are the probability of success and failure respectively.
Step 2: Calculate the weighted Gini index for each split.
Step 3: Compare the weighted Gini index value of each split. Input variable with a higher value is chosen.
e.g.: Suppose you want to group the students based on whether they will play cricket or not (target). And the population can be split using Class and Gender. Now the question is which one to chose to produce more homogeneous groups. Gini index value will tell us which one to choose between class and gender.
Gini index Table
The weighted Gini score for gender is greater than that of class giving more purity. Hence we split using gender.
Entropy: Entropy is the measure of randomness of elements. Mathematically it is defined as :
 Entropy value lies in the range {0, log_{2} m} for m class scenario and {0,1} for 2 class scenario.
 Entropy is 0 when the sample is completely homogeneous.
 E.g.: let the probability of obtaining 1, 2, 3, on dice be 0.5, 0.25 and 0.25 respectively. Then entropy is calculated as:
Entropy = – (0.5 * log (0.5)) – (0.25 * log (0.25)) (0.25 * log (0.25) = 0.45
Information gain: information gain is based on a decrease in entropy after a dataset split on an attribute.
Steps to calculate information gain:
Step 1: calculate the overall entropy of the target.
Consider the following dataset.
Step 2: Split the dataset for different attributes. Calculate the entropy for each branch. And add them.
Step 3: Find gain = Entropy (T) – Entropy (T, X)
Step 4: Choose the attribute with the highest gain i.e. Outlook.
Reduction in variance: Reduction in variance is used for continuous target variables regression problems. The split with the lowest variance is selected as the criteria for splitting the population. It is given by:
Where x bar is mean, x is the actual value and n is the total number of values.
Step to calculate variance:
Step 1: Calculate variance at each node.
Step 2: calculate the weighted variance for each split.
Example: – Let’s assign 1 for ‘playing cricket’ and 0 for ‘not playing cricket’.
Calculating Variance for Root node:
Mean (root) = (15*1 + 15*0)/30 = 0.5
Variance (root) = (15*(1–0.5)²+15*(0–0.5)²) / 30 = 0.25
Calculating mean and variance for female subnode:
Mean (female) = (2*1+8*0)/10=0.2
Variance (female) = (2*(1–0.2)²+8*(0–0.2)²) / 10 = 0.16
Calculating mean and variance for male subnode:
Mean (Male) = (13*1+7*0)/20=0.65
Variance (Male)= (13*(1–0.65) ²+7*(0–0.65) ²) / 20 = 0.23
Weighted Variance (Gender) = (10/30)*0.16 + (20/30) *0.23 = 0.21
Calculating mean and variance for Class IX subnode:
Mean of Class IX node = (6*1+8*0)/14=0.43
Variance = (6*(1–0.43)²+8*(0–0.43)²) / 14= 0.24
Calculating mean and variance for Class X subnode:
Mean of Class X node = (9*1+7*0)/16=0.56
Variance = (9*(1–0.56)²+7*(0–0.56)²) / 16 = 0.25
Weighted Variance (Class) = (14/30)*0.24 + (16/30) *0.25 = 0.25
The variance for Gender split is lower as compared to parent node, so gender would be chosen as splitting criteria.
How to avoid overfitting or underfitting?
Splitting of a decision tree results in a fully grown tree and this process continues until a userdefined criteria is met. But this fully grown tree is likely to overfit the data, giving a poor performance or less accuracy for an unseen observation. That means the tree might work well for training set but might fall to give the predicted results for the test dataset. Sometimes the model might underfit the data also i.e high test error value. This is when pruning comes into the picture.
Pruning is nothing but reducing the size of the tree that is too large by removing irrelevant nodes so that the misclassification error is reduced. You remove the nodes such that it doesn’t affect the performance/accuracy of the tree model. It is of two type: prepruning and postpruning.
Prepruning:
 Also called forward pruning or online pruning.
 Prepruning a tree means to terminate some of the branches prematurely as the tree is generated depending upon some measures.
 These measures can be Gini index, information gain etc.
Postpruning:
 Entirely grow the tree first then remove the nodes in a bottomup fashion.
If error reduced then replace the subtree with the leaf node.
Implementing decision tree in R
We are going to use the cardiotocographic dataset to implement the decision tree. The problem statement is to find whether the patient is normal, suspect or pathological.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 
#Read data file data < read.csv("C:/Users/DELL/Downloads/Cardiotocographic.csv") data$NSPF < as.factor(mydata$NSP) #partition data into training and validation set set.seed(1234) pd < sample(2, nrow(data), replace=TRUE, prob = c(0.8,0.2)) train < data[pd==1,] validate < data[pd==2,] #Decision tree with party library(party) tree < ctree(NSPF~LB+AC+FM, data=train, controls=ctree_control(mincriterion=0.9, minsplit=200)) print(tree) plot(tree,type="simple") #prediction predict(tree,validate) 
The response normal, suspect or pathological state is encoded as 1,2 and 3 respectively using as.factor
When we run the model on validating dataset, we get the above output.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 
#decision tree with rpart library(rpart) tree1 < rpart(NSPF~LB+AC+FM,train) library(rpart.plot) rpart.plot(tree1) predict(tree1,validate) #Misclassification error for train data tab<table(predict(tree), train$NSPF) print(tab) 1sum(diag(tab))/sum(tab) #misclassification error for validate data testpred < predict(tree,newdata=validate) tab<table(testpred, validate$NSPF) print(tab) 1sum(diag(tab))/sum(tab) 
Results:
The above figure shows the misclassification error table for train data and validates data. The columns represent the actual value and the rows represent the predicted values. It says that there were actually 1298 patients who were normal and was predicted as normal. 130 patients who were actually suspect but were predicted as normal. This is misclassified data here.
Conclusion
The decision tree is a classification and regression tree (CART). Mostly it is used for classification. It starts with the root node and divides into various subnodes. The best attribute is selected as the root node. Split the tree into subnodes such that nodes have identical attribute value. The purity of the nodes is defined using algorithms like Gini index, information gain, entropy etc. a fully grown tree results in overfitting of data. This leads to poor performance while predicting the test values. Hence pruning is done. Subnodes which doesn’t really affect the accuracy of the model, are trimmed from the decision tree. this reduces the depth of the tree and increases the prediction accuracy.
Advantages and Disadvantages of Decision Tree
Advantages
 The data type is not a constraint to this modeling technique as it can handle both numerical and categorical variables.
 It can be used for both linear and nonlinear relationships.
 It can handle missing values and heavily skewed data to an extent. Hence it requires less data cleaning as compared to other modeling technique.
 The decision tree is highly versatile.
 It is simple to interpret and visualize.
 The number of hyperparameters to be tuned is almost null unlike in some other modeling technique like kNN.
Disadvantages
 It subjects to overfitting the model.
 Less prediction accuracy.
 Sensitive towards outliers.
 Sometimes a strong correlation between potential variables may lead to the selection of input variables which will increase the statistical model but the outcome might not be of your interest.
 In the case of categorical variables, information gain gives a biased result for attributes with a large number of categories.
 A decision tree is sensitive towards variance. Small variations in data can result in different trees.
Applications of the Decision Tree
 Business management: many companies maintain their own databases to improve their customer or product services. Hence they use the decision tree to extract information from databases.
 The decision tree is used for detecting Fraudulent statement detection.
 Fault detection and diagnosis: in engineering domain, it helps to detect the faults in the machines like the faulty bearing in the rotator machines.
 Healthcare management: decision trees are used to make predictions in healthcare sectors like a decision tree model was used to predict the reason for developmentally delayed children.
 Energy consumption: it can also be used to estimate the electricity used by an individual.
Quick notes
 A tree too large will overfit the training data whereas a small tree might fail to capture the important structural information from the sample space.
 To split the tree, define an attribute test condition at each step to divide the tree into homogeneous subnodes. Repeat this until you achieve homogeneity.