Decision Tree and Its Implementation In R
A Decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.
Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning. In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter/differentiator in input variables.
Important Parts related to Decision Trees
Let’s look at the basic terminology used with Decision trees:
- Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
- Splitting: It is a process of dividing a node into two or more sub-nodes.
- Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
- Leaf/ Terminal Node: Nodes that do not split is called Leaf or Terminal node.
- Branch / Sub-Tree: A subsection of the entire tree is called a branch or sub-tree.
These are the terms commonly used for decision trees. As we know that every algorithm has advantages and disadvantages, below are the important factors which one should know.
Types of Decision Trees
Types of the decision tree are based on the type of target variable we have. It can be of two types:
Regression Trees: Decision Trees with a continuous target variable are termed as regression trees. We are all familiar with the idea of linear regression as a way of making quantitative predictions. In simple linear regression, a real-valued dependent Variable Y is modeled as a linear function of a real-valued independent variable X plus noise. Even in multiple regression, we let there be multiple independent variables X1, X2, . . .
Xp and frame the model. This all goes along so well as the variables are independent and each has a strictly additive effect on Y. Even though if the variables are not independent, it is possible to incorporate some amount of interactions. However, with more number of variables, it gets tougher and tougher. Moreover, the relationship may no longer be a linear one. Thus, arises the need for regression trees.
Classification Tree: A classification tree is very similar to the regression tree, except it is used to predict a qualitative response rather than a quantitative one. In the case of the classification tree, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs.
In interpreting the results of a classification tree, we are often interested not only in the class predictions corresponding to a particular terminal node region but also in the class proportion among the training observations that fall in the region.
- Easy to Explain: Decision tree is very easy to understand for people, even from the non-analytical background. It does not require any statistical knowledge to read and interpret them. In fact, it is even easier to interpret than linear regression!
- Useful in Data exploration: Decision tree is one of the fastest ways to identify the most significant variables and the relation between two or more variables. With the help of decision trees, we can create new variables/features that have better power to predict the target variable.
- Less data cleaning required: It is not influenced by outliers and missing values to a fair degree.
- Non-Parametric Method: Decision tree is considered to be a non-parametric method. This means that decision trees have no assumptions about the parent distribution and the classification system.
- Overfitting: Overfitting is one of the most practical difficulties for decision tree models. This problem gets solved by setting constraints on model parameters and pruning.
- Lack of predictive accuracy: It is less efficient than regression models and cross-validation models.
- Non-Robust: Decision trees are non-robust, meaning that a small change in the data can cause a large change in the final estimated tree.
Decision Tree In R
For this, we will use the data-set CarSeats, which has the data on the sales of child car seats sold in 400 different stores in the US. It consists of a data frame with 400 observations on the following 11 variables namely:
- Sales: Unit sales (in thousands) at each location
- CompPrice: Price charged by the competitor at each location
- Income: Community income level (in thousands of dollars)
- Advertising: Local advertising budget for the company at each location (in thousands of dollars)
- Population: Population size in the region (in thousands)
- Price: Price company charges for car seats at each site
- ShelveLoc: A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site
- Age: Average age of the local population
- Education: Education level at each location
- Urban: A factor with levels No and Yes to indicate whether the store is in an urban or rural location
- US: A factor with levels No and Yes to indicate whether the store is in the US or not
## n= 400
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
## 1) root 400 164 No (0.59000000 0.41000000)
## 2) ShelveLoc=Bad,Medium 315 98 No (0.68888889 0.31111111)
## 4) Price>=92.5 269 66 No (0.75464684 0.24535316)
## 8) Advertising< 13.5 224 41 No (0.81696429 0.18303571)
## 16) CompPrice< 124.5 96 6 No (0.93750000 0.06250000) *
## 17) CompPrice>=124.5 128 35 No (0.72656250 0.27343750)
## 34) Price>=109.5 107 20 No (0.81308411 0.18691589)
## 68) Price>=126.5 65 6 No (0.90769231 0.09230769) *
## 69) Price< 126.5 42 14 No (0.66666667 0.33333333)
## 138) Age>=49.5 22 2 No (0.90909091 0.09090909) *
## 139) Age< 49.5 20 8 Yes (0.40000000 0.60000000) *
## 35) Price< 109.5 21 6 Yes (0.28571429 0.71428571) *
## 9) Advertising>=13.5 45 20 Yes (0.44444444 0.55555556)
## 18) Age>=54.5 20 5 No (0.75000000 0.25000000) *
## 19) Age< 54.5 25 5 Yes (0.20000000 0.80000000) *
## 5) Price< 92.5 46 14 Yes (0.30434783 0.69565217)
## 10) Income< 57 10 3 No (0.70000000 0.30000000) *
## 11) Income>=57 36 7 Yes (0.19444444 0.80555556) *
## 3) ShelveLoc=Good 85 19 Yes (0.22352941 0.77647059)
## 6) Price>=142.5 12 3 No (0.75000000 0.25000000) *
## 7) Price< 142.5 73 10 Yes (0.13698630 0.86301370) *
The Summary gives all the details and description of the parameters.
Now, We will plot the tree using the below code.