# Decision Tree

#### Introduction

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible
consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning. In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.

# Types of Decision Trees

Types of decision tree is based on the type of target variable we have. It can be of two types:

• Regression Trees: Decision Trees with a continuous target variable are termed as regression trees.

We are all familiar with the idea of linear regression as a way of making quantitative predictions. In simple linear regression, a real-valued dependent Variable Y is modeled as a linear function of a real-valued independent variable X plus noise. Even in multiple regression, we let there be multiple independent variables X1, X2, . . .
Xp and frame the model. This all goes along so well as the variables are independent and each have a strictly additive effect on Y. Even though if the variables are not independent, it is possible to incorporate some amount of
interactions. However, with more number of variables, it gets tougher and tougher. Moreover, the relationship may no longer be a linear one. Thus, arises the need of regression trees.

• Classification Tree: A classification tree is very similar to regression tree, except it is used to predict a qualitative response rather than a quantitative one. In case of classification tree, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs. In interpreting the results of a classification tree, we are often interested not only in the class predictions corresponding to a particular terminal node region, but also in the class proportion among
the training observations that fall in the region.

# Advantages

1. Easy to Explain: Decision tree are very easy to understand for people, even from non-analytical background. It does not require any statistical knowledge to read and interpret them. In fact, it is even easier to interpret than linear regression!
2. Useful in Data exploration: Decision tree is one of the fastest way to identify most significant variables and relation between two or more variables. With the help of decision trees, we can create new variables / features that has better power to predict target variable.
3. Less data cleaning required: It is not influenced by outliers and missing values to a fair degree.
4. Non Parametric Method: Decision tree is considered to be a non-parametric method. This means that decision trees have no assumptions about the parent distribution and the classification system.

# Disadvantages

1. Over fitting: Over fitting is one of the most practical difficulty for decision tree models. This problem gets solved
by setting constraints on model parameters and pruning.
2. Lack of predictive accuracy: It is less efficient than regression models and cross-validation models.
3. Non-Robust: Decision trees are non-robust, meaning that a small change in the data can cause a large change in the final estimated tree.

# DECISION TREE IN R:

For this, we will use the data-set CarSeats, which has the data on the sales of child car seats sold in 400 different stores in US. It consists of a data frame with 400 observations on the following 11 variables namely:

1. Sales: Unit sales (in thousands) at each location
2. CompPrice: Price charged by competitor at each location
3. Income: Community income level (in thousands of dollars)
4. Advertising: Local advertising budget for company at each location (in thousands of dollars)
5. Population: Population size in region (in thousands)
6. Price: Price company charges for car seats at each site
7. ShelveLoc: A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site
8. Age: Average age of the local population
9. Education: Education level at each location
10. Urban: A factor with levels No and Yes to indicate whether the store is in an urban or rural location
11. US: A factor with levels No and Yes to indicate whether the store is in the US or not