Decision Tree

Decision Tree

Introduction

A decision tree is a decision support tool that uses
a tree-like graph or model of decisions and their possible
consequences, including chance event outcomes, resource costs, and
utility. It is one way to display an algorithm that only contains
conditional control statements.
Decision trees are commonly used in operations research, specifically
in decision analysis, to help identify a strategy most likely to reach a
goal, but are also a popular tool in machine learning. In this
technique, we split the population or sample into two or more
homogeneous sets (or sub-populations) based on most significant splitter
/ differentiator in input variables.

Types of Decision Trees

Types of decision tree is based on the type of target variable we have. It can be of two types:

  • Regression Trees: Decision Trees with a continuous target variable are termed as regression trees.

We are all familiar with the idea of linear regression as a way of
making quantitative predictions. In simple linear regression, a
real-valued dependent Variable Y is modeled as a linear function of a
real-valued independent variable X plus noise. Even in multiple
regression, we let there be multiple independent variables X1, X2, . . .
Xp and frame the model.
This all goes along so well as the variables are independent and each
have a strictly additive effect on Y. Even though if the variables are
not independent, it is possible to incorporate some amount of
interactions. However, with more number of variables, it gets tougher
and tougher. Moreover, the relationship may no longer be a linear one.
Thus, arises the need of regression trees.

  • Classification Tree: A classification tree is very
    similar to regression tree, except it is used to predict a qualitative
    response rather than a quantitative one. In case of classification tree,
    we predict that each observation belongs to the most commonly occurring
    class of training observations in the region to which it belongs. In
    interpreting the results of a classification tree, we are often
    interested not only in the class predictions corresponding to a
    particular terminal node region, but also in the class proportion among
    the training observations that fall in the region.

Advantages

1. Easy to Explain: Decision tree are very easy to
understand for people, even from non-analytical background. It does not
require any statistical knowledge to read and interpret them. In fact,
it is even easier to interpret than linear regression!
2. Useful in Data exploration: Decision tree is one
of the fastest way to identify most significant variables and relation
between two or more variables. With the help of decision trees, we can
create new variables / features that has better power to predict target
variable.
3. Less data cleaning required: It is not influenced by outliers and missing values to a fair degree.
4. Non Parametric Method: Decision tree is
considered to be a non-parametric method. This means that decision trees
have no assumptions about the parent distribution and the
classification system.

Disadvantages

1. Over fitting: Over fitting is one of the most
practical difficulty for decision tree models. This problem gets solved
by setting constraints on model parameters and pruning.
2. Lack of predictive accuracy: It is less efficient than regression models and cross-validation models.
3. Non-Robust: Decision trees are non-robust, meaning that a small change in the data can cause a large change in the final estimated tree.

DECISION TREE IN R:

For this, we will use the data-set CarSeats, which has the data on
the sales of child car seats sold in 400 different stores in US.
It consists of a data frame with 400 observations on the following 11 variables namely:

  1. Sales: Unit sales (in thousands) at each location
  2. CompPrice: Price charged by competitor at each location
  3. Income: Community income level (in thousands of dollars)
  4. Advertising: Local advertising budget for company at each location (in thousands of dollars)
  5. Population: Population size in region (in thousands)
  6. Price: Price company charges for car seats at each site
  7. ShelveLoc: A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site
  8. Age: Average age of the local population
  9. Education: Education level at each location
  10. Urban: A factor with levels No and Yes to indicate whether the store is in an urban or rural location
  11. US: A factor with levels No and Yes to indicate whether the store is in the US or not

R-CODE:

attach(Carseats)
high=ifelse(Carseats$Sales<8,“No”,“Yes”)
Car=cbind(Carseats,high)
Car


 





 

 

You might also like More from author

Leave A Reply

Your email address will not be published.