The post Decision Tree and Its Implementation In R appeared first on StepUp Analytics.
]]>A Decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.
Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning. In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter/differentiator in input variables.
Let’s look at the basic terminology used with Decision trees:
These are the terms commonly used for decision trees. As we know that every algorithm has advantages and disadvantages, below are the important factors which one should know.
Types of the decision tree are based on the type of target variable we have. It can be of two types:
Regression Trees: Decision Trees with a continuous target variable are termed as regression trees. We are all familiar with the idea of linear regression as a way of making quantitative predictions. In simple linear regression, a real-valued dependent Variable Y is modeled as a linear function of a real-valued independent variable X plus noise. Even in multiple regression, we let there be multiple independent variables X1, X2, . . .
Xp and frame the model. This all goes along so well as the variables are independent and each has a strictly additive effect on Y. Even though if the variables are not independent, it is possible to incorporate some amount of interactions. However, with more number of variables, it gets tougher and tougher. Moreover, the relationship may no longer be a linear one. Thus, arises the need for regression trees.
Classification Tree: A classification tree is very similar to the regression tree, except it is used to predict a qualitative response rather than a quantitative one. In the case of the classification tree, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs.
In interpreting the results of a classification tree, we are often interested not only in the class predictions corresponding to a particular terminal node region but also in the class proportion among the training observations that fall in the region.
For this, we will use the data-set CarSeats, which has the data on the sales of child car seats sold in 400 different stores in the US. It consists of a data frame with 400 observations on the following 11 variables namely:
R-Code
attach(Carseats) high=ifelse(Carseats$Sales<8,“No”,“Yes”) Car=cbind(Carseats,high) Car attach(Carseats) high=ifelse(Carseats$Sales<8,"No","Yes") Car=cbind(Carseats,high) library(rpart)
tree=rpart(high~.-Sales,Carseats,method="class") tree ## n= 400 ## ## node), split, n, loss, yval, (yprob) ## * denotes terminal node ## ## 1) root 400 164 No (0.59000000 0.41000000) ## 2) ShelveLoc=Bad,Medium 315 98 No (0.68888889 0.31111111) ## 4) Price>=92.5 269 66 No (0.75464684 0.24535316) ## 8) Advertising< 13.5 224 41 No (0.81696429 0.18303571) ## 16) CompPrice< 124.5 96 6 No (0.93750000 0.06250000) * ## 17) CompPrice>=124.5 128 35 No (0.72656250 0.27343750) ## 34) Price>=109.5 107 20 No (0.81308411 0.18691589) ## 68) Price>=126.5 65 6 No (0.90769231 0.09230769) * ## 69) Price< 126.5 42 14 No (0.66666667 0.33333333) ## 138) Age>=49.5 22 2 No (0.90909091 0.09090909) * ## 139) Age< 49.5 20 8 Yes (0.40000000 0.60000000) * ## 35) Price< 109.5 21 6 Yes (0.28571429 0.71428571) * ## 9) Advertising>=13.5 45 20 Yes (0.44444444 0.55555556) ## 18) Age>=54.5 20 5 No (0.75000000 0.25000000) * ## 19) Age< 54.5 25 5 Yes (0.20000000 0.80000000) * ## 5) Price< 92.5 46 14 Yes (0.30434783 0.69565217) ## 10) Income< 57 10 3 No (0.70000000 0.30000000) * ## 11) Income>=57 36 7 Yes (0.19444444 0.80555556) * ## 3) ShelveLoc=Good 85 19 Yes (0.22352941 0.77647059) ## 6) Price>=142.5 12 3 No (0.75000000 0.25000000) * ## 7) Price< 142.5 73 10 Yes (0.13698630 0.86301370) *
The Summary gives all the details and description of the parameters.
summary(tree)
Now, We will plot the tree using the below code.
plot(tree) text(tree)
The post Decision Tree and Its Implementation In R appeared first on StepUp Analytics.
]]>The post Decision Tree In Machine Learning appeared first on StepUp Analytics.
]]>A decision tree is a Supervised machine learning algorithm. As the name goes it is a tree-like structure with its root node on the top. This algorithm is one of the most popular techniques used for both classification and regression tree(CART) tasks. Mostly it is used for classification as it doesn’t work that well for regression problems.
Table of contents
A decision tree is a flowchart-like structure in which each internal node represents a test or a condition on an attribute, each branch represents an outcome of the test and each leaf/terminal node holds a class label. It is considered to be a non-parametric method which means that it makes no assumptions about the space distribution and the classifier structure.
Let’s look at the following example. Here the objective is to find out whether the person will cheat or not. Refund, marital status, and taxable income are the potential features/attributes, Refund being the best attribute. Keeping these attributes in mind, a decision tree model is designed to predict the class label.
If you can interpret the relationship between the target variables and the input variables, seeing the plots then you can straight away go for Linear regression. But in case there exists a non-linear or some other complex relationship which you are not able to visualize seeing the data or the plot then decision tree should be preferred. However, cross-validation can only confirm it.
The path from the root to the leaf node is defined as classification rule.
The two major difference between the classification tree and the regression tree are:
In Classification tree, the notion of splitting the dataset is to achieve homogeneity. Look at the following figure carefully.
If you look at these data closely, you will find that you can divide them broadly into 5 regions as follows:
You can even split them into more regions or sub-nodes like you can split the region R4 into two parts. For now, let’s proceed with these five regions only. Let’s start by drawing a decision tree.
Hence splitting the tree has resulted in homogeneous sub-nodes.
In a regression tree, the target values do not have classes, we fit the model to the target variables using each of the independent variables. The data is lined up and is split at different points. The error is calculated at each split point using a cost function which is then squared to get ‘sum of squared error’ or SSE. The split point which yields the least SSE is chosen as the root node/split point. This process is implemented recursively. As the best split point is chosen in a greedy manner, it is called Greedy splitting.
Note: It is the target variable that decides the type of decision tree to be used. The predictor may be categorical or numerical
e.g.: will Student pass the exam? A Yes or no
e.g.: to predict whether a customer will pay a renewal premium to the bank or not? To predict this, the bank must have the knowledge of customer’s income which is a significant variable. Hence a decision tree can be built to predict its customer’s income using the knowledge of his occupation and various other variables which will clearly say whether the customer will pay the renewal premium or not.
How to split the training record?
Define an attribute test condition at each step to divide the tree into homogeneous sub-nodes. Perform this step recursively. To implement this step, the algorithm must provide a method to specify a test condition and to evaluate the goodness of each test condition. The splitting criteria are different for the classification tree and the regression tree. Various algorithms are used to split the nodes into sub-nodes.
Terminate splitting when:
Gini Index: it is an impurity measure method. Gini index tells us how good the split is
Step 1: Calculate (p^{2} + q^{2}) where p and q are the probability of success and failure respectively.
Step 2: Calculate the weighted Gini index for each split.
Step 3: Compare the weighted Gini index value of each split. Input variable with a higher value is chosen.
e.g.: Suppose you want to group the students based on whether they will play cricket or not (target). And the population can be split using Class and Gender. Now the question is which one to chose to produce more homogeneous groups. Gini index value will tell us which one to choose between class and gender.
The weighted Gini score for gender is greater than that of class giving more purity. Hence we split using gender.
Entropy: Entropy is the measure of randomness of elements. Mathematically it is defined as :
Entropy = – (0.5 * log (0.5)) – (0.25 * log (0.25)) -(0.25 * log (0.25) = 0.45
Information gain: information gain is based on a decrease in entropy after a dataset split on an attribute.
Steps to calculate information gain:
Step 1: calculate the overall entropy of the target.
Consider the following dataset.
Step 2: Split the dataset for different attributes. Calculate the entropy for each branch. And add them.
Step 3: Find gain = Entropy (T) – Entropy (T, X)
Step 4: Choose the attribute with the highest gain i.e. Outlook.
Reduction in variance: Reduction in variance is used for continuous target variables regression problems. The split with the lowest variance is selected as the criteria for splitting the population. It is given by:
Where x bar is mean, x is the actual value and n is the total number of values.
Step to calculate variance:
Step 1: Calculate variance at each node.
Step 2: calculate the weighted variance for each split.
Example: – Let’s assign 1 for ‘playing cricket’ and 0 for ‘not playing cricket’.
Calculating Variance for Root node:
Mean (root) = (15*1 + 15*0)/30 = 0.5
Variance (root) = (15*(1–0.5)²+15*(0–0.5)²) / 30 = 0.25
Calculating mean and variance for female sub-node:
Mean (female) = (2*1+8*0)/10=0.2
Variance (female) = (2*(1–0.2)²+8*(0–0.2)²) / 10 = 0.16
Calculating mean and variance for male sub-node:
Mean (Male) = (13*1+7*0)/20=0.65
Variance (Male)= (13*(1–0.65) ²+7*(0–0.65) ²) / 20 = 0.23
Weighted Variance (Gender) = (10/30)*0.16 + (20/30) *0.23 = 0.21
Calculating mean and variance for Class IX sub-node:
Mean of Class IX node = (6*1+8*0)/14=0.43
Variance = (6*(1–0.43)²+8*(0–0.43)²) / 14= 0.24
Calculating mean and variance for Class X sub-node:
Mean of Class X node = (9*1+7*0)/16=0.56
Variance = (9*(1–0.56)²+7*(0–0.56)²) / 16 = 0.25
Weighted Variance (Class) = (14/30)*0.24 + (16/30) *0.25 = 0.25
The variance for Gender split is lower as compared to parent node, so gender would be chosen as splitting criteria.
Splitting of a decision tree results in a fully grown tree and this process continues until a user-defined criteria is met. But this fully grown tree is likely to over-fit the data, giving a poor performance or less accuracy for an unseen observation. That means the tree might work well for training set but might fall to give the predicted results for the test dataset. Sometimes the model might under-fit the data also i.e high test error value. This is when pruning comes into the picture.
Pruning is nothing but reducing the size of the tree that is too large by removing irrelevant nodes so that the misclassification error is reduced. You remove the nodes such that it doesn’t affect the performance/accuracy of the tree model. It is of two type: pre-pruning and post-pruning.
Pre-pruning:
Post-pruning:
If error reduced then replace the sub-tree with the leaf node.
We are going to use the cardiotocographic dataset to implement the decision tree. The problem statement is to find whether the patient is normal, suspect or pathological.
#Read data file data <- read.csv("C:/Users/DELL/Downloads/Cardiotocographic.csv") data$NSPF <- as.factor(mydata$NSP) #partition data into training and validation set set.seed(1234) pd <- sample(2, nrow(data), replace=TRUE, prob = c(0.8,0.2)) train <- data[pd==1,] validate <- data[pd==2,] #Decision tree with party library(party) tree <- ctree(NSPF~LB+AC+FM, data=train, controls=ctree_control(mincriterion=0.9, minsplit=200)) print(tree) plot(tree,type="simple") #prediction predict(tree,validate)
The response normal, suspect or pathological state is encoded as 1,2 and 3 respectively using as.factor
When we run the model on validating dataset, we get the above output.
#decision tree with rpart library(rpart) tree1 <- rpart(NSPF~LB+AC+FM,train) library(rpart.plot) rpart.plot(tree1) predict(tree1,validate) #Misclassification error for train data tab<-table(predict(tree), train$NSPF) print(tab) 1-sum(diag(tab))/sum(tab) #misclassification error for validate data testpred <- predict(tree,newdata=validate) tab<-table(testpred, validate$NSPF) print(tab) 1-sum(diag(tab))/sum(tab)
The above figure shows the misclassification error table for train data and validates data. The columns represent the actual value and the rows represent the predicted values. It says that there were actually 1298 patients who were normal and was predicted as normal. 130 patients who were actually suspect but were predicted as normal. This is misclassified data here.
The decision tree is a classification and regression tree (CART). Mostly it is used for classification. It starts with the root node and divides into various sub-nodes. The best attribute is selected as the root node. Split the tree into sub-nodes such that nodes have identical attribute value. The purity of the nodes is defined using algorithms like Gini index, information gain, entropy etc. a fully grown tree results in over-fitting of data. This leads to poor performance while predicting the test values. Hence pruning is done. Sub-nodes which doesn’t really affect the accuracy of the model, are trimmed from the decision tree. this reduces the depth of the tree and increases the prediction accuracy.
Advantages
Disadvantages
Quick notes
The post Decision Tree In Machine Learning appeared first on StepUp Analytics.
]]>The post Churn Modelling for Mobile Telecommunications appeared first on StepUp Analytics.
]]>
Churn is one of the biggest threat to the telecommunication industry. Every telecommunication industry deploys the best models that suit their need to avoid the voluntary or involuntary churn of a customer. This is called churn modelling. Below I will take you through the terms frequently used in building this model.
Predicting Churn: Key to a Protective Strategy
Here, We have a sample telecom data on which we will run Churn Modelling using R code.
library(rattle) # The weather data set and normVarNames(). library(randomForest) # Impute missing values using na.roughfix(). library(rpart) # decision tree library(tidyr) # Tidy the data set. library(ggplot2) # Visualize data. library(dplyr) # Data preparation and pipes %>%. library(lubridate) # Handle dates. library(corrgram)
Loading data directly from the web
nm <- read.csv("http://www.sgi.com/tech/mlc/db/churn.names", skip=4, colClasses=c("character", "NULL"), header=FALSE, sep=":")[[1]] dat <- read.csv("http://www.sgi.com/tech/mlc/db/churn.data", header=FALSE, col.names=c(nm, "Churn")) nobs <- nrow(dat) colnames(dat) dsname <- "dat" ds <- get(dsname) dim(ds) (vars <- names(ds)) target<- 'Churn'; ds$phone.number<-NULL; ds$churn<-(as.numeric(ds$Churn) - 1) ds$Churn<-NULL ds$state<-NULL ## Split ds into train and test ## 75% of the sample size smp_size <- floor(0.75 * nrow(ds)) ## set the seed to make your partition reproducible set.seed(123) train_ind <- sample(seq_len(nrow(ds)), size = smp_size) train <- ds[train_ind, ] test <- ds[-train_ind, ] dim(train) dim(test) corrgram(train, lower.panel=panel.ellipse, upper.panel=panel.pie);
Fitting a Model
lm.fit <- lm(churn~., data=train); # Multiple R-squared: 0.1784, Adjusted R-squared: 0.1724
pred.lm.fit<-predict(lm.fit, test); RMSE.lm.fit<-sqrt(mean((test$churn)^2)) RMSE.lm.fit; #0.3232695 # building a simpler model, similar R2 lm.fit.step <- lm(churn ~ international.plan + voice.mail.plan + total.day.charge + total.eve.minutes + total.night.charge + total.intl.calls + total.intl.charge + number.customer.service.calls, data=train); # Multiple R-squared: 0.1767, Adjusted R-squared: 0.174 pred.lm.fit.step <-predict(lm.fit.step, test); RMSE.lm.fit.step <-sqrt(mean((pred.lm.fit.step-test$churn)^2)) RMSE.lm.fit.step; #0.3227848 <- simpler, and better RMSE
# logistic regression using a generalized linear model glm.step <- glm(churn ~ international.plan + voice.mail.plan + total.day.charge + total.eve.minutes + total.night.charge + total.intl.calls + total.intl.charge + number.customer.service.calls, family = binomial, data = train) pred.glm.step <- predict.glm(glm.step, newdata = test, type = "response") RMSE.glm.step <- sqrt(mean((pred.glm.step-test$churn)^2)) RMSE.glm.step; #0.3179586 <- better than the linear model
# build a decision tree based on the selected variables rpart.fit.step <- rpart(churn ~ international.plan + voice.mail.plan + total.day.charge + total.eve.minutes + total.night.charge + total.intl.calls + total.intl.charge + number.customer.service.calls, data=train, method="class"); pred.rpart.step <- predict(rpart.fit.step, test); # See correction below RMSE.rpart.step <- sqrt(mean((pred.rpart.step-test$churn)^2)) RMSE.rpart.step; #0.6742183 <- much worse than the linear model # forgot type="class" pred.rpart.step <- as.numeric(predict(rpart.fit.step, test, type="class")) - 1; RMSE.rpart.step <- sqrt(mean((pred.rpart.step-test$churn)^2)) RMSE.rpart.step; #0.2423902 <- better than the linear model sum(pred.rpart.step==test$churn)/nrow(test) # 0.941247 94% tests are correctly matched
# Build Random Forest Ensemble set.seed(415) rf.fit.step <- randomForest(as.factor(churn) ~ international.plan + voice.mail.plan + total.day.charge + total.eve.minutes + total.night.charge + total.intl.calls + total.intl.charge + number.customer.service.calls, data=train, importance=TRUE, ntree=2000) varImpPlot(rf.fit.step); pred.rf.fit.step <- as.numeric(predict(rf.fit.step, test))-1; RMSE.rf.fit.step <- sqrt(mean((pred.rf.fit.step-test$churn)^2)) RMSE.rf.fit.step; #0.2217221 improvement from the linear model, so a non-linear, decision tree approach is better sum(pred.rf.fit.step==test$churn)/nrow(test) # 0.9508393 95% tests are correctly matched
Algorithm | RMSE | Comment |
Linear Model | 0.3232695 | |
Simpler Linear Model | 0.1767 | |
Logistic Regression | 0.3179586 | better than the linear model |
Decision Tree | 0.6742183 | much worse than the linear model (Overfitting) |
Decision Tree (Without type = “class”) | 0.2423902 | better than the liniar model |
Random Forest | 0.2217221 | improvement from the linear model so a non-linear decision tree approach is better |
The post Churn Modelling for Mobile Telecommunications appeared first on StepUp Analytics.
]]>