The post What Is Classification appeared first on StepUp Analytics.

]]>Classification is the process of predicting the class of given data points. Classes are sometimes called as targets/ labels or categories. Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y).

Unlike regression where you predict a continuous number, you use classification to predict a category. There is a wide variety of classification applications from medicine to marketing.

Let’s say you own a shop and you want to figure out if one of your customers is going to come visit your shop again or not. The answer to that question can only be a ‘Yes’ or ‘No’.

These kind of problems in Machine Learning are known as Classification problems.

Classification problems normally have a categorical output like a ‘yes’ or ‘no’, ‘1’ or ‘0’, ‘True’ or ‘false’. Let’s go through another example:

Say you want to check if on a particular day, a game of cricket is possible or not.

In this case the weather conditions are the dependent factors and based on them, the outcome can either be ‘Play’ or ‘Don’t Play’.

Just like Classification, there are two other types of problems in Machine Learning and they are:** **

**Regression and Clustering**

In the image above, we have the list for all the different algorithms or solutions used for each of the problems.

There are 5 types of algorithms used to solve classification problems and they are-

- Decision Tree
- Naive Bayes
- Random Forest
- Logistic Regression
- KNN

Based on the kind of problem statement and the data in hand, we decide the kind of classification algorithm to be used.

Decision tree builds classification or regression models in the form of a tree structure. It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches and a leaf node represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data.

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below:

Above,

*P*(*c|x*) is the posterior probability of*class*(c,*target*) given*predictor*(x,*attributes*).*P*(*c*) is the prior probability of*class*.*P*(*x|c*) is the likelihood which is the probability of*predictor*given*class*.*P*(*x*) is the prior probability of*predictor*.

Random Forest is a supervised learning algorithm. Like you can already see from its name, it creates a forest and makes it somehow random. The “forest” it builds, is an ensemble of Decision Trees, most of the time trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.

**To say it in simple words: Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.**

One big advantage of random forest is, that it can be used for both classification and regression problems, which form the majority of current machine learning systems. I will talk about the random forest in classification since classification is sometimes considered the building block of machine learning. Below you can see how a random forest would look like with two trees:

Imagine a guy named Andrew, that wants to decide, to which places he should travel during a one-year vacation trip. He asks people who know him for advice. First, he goes to a friend, and asks Andrew where he traveled to in the past and if he liked it or not. Based on the answers, he will give Andrew some advice.

This is a typical decision tree algorithm approach. Andrews friend created rules to guide his decision about what he should recommend, by using the answers of Andrew.

Afterward, Andrew starts asking more and more of his friends to advise him and they again ask him different questions, where they can derive some recommendations from. Then he chooses the places that where recommend the most to him, which is the typical Random Forest algorithm approach.

The k-nearest-neighbours algorithm is a classification algorithm, and it is supervised: it takes a bunch of labelled points and uses them to learn how to label other points. To label a new point, it looks at the labelled points closest to that new point (those are its nearest neighbours), and has those neighbours vote, so whichever label the most of the neighbours have is the label for the new point (the “k” is the number of neighbours it checks).

*k*-Nearest Neighbour is a lazy learning algorithm which stores all instances correspond to training data points in n-dimensional space. When an unknown discrete data is received, it analyzes the closest k number of instances saved (nearest neighbors)and returns the most common class as the prediction and for real-valued data, it returns the mean of k nearest neighbors.

In the distance-weighted nearest neighbor algorithm, it weights the contribution of each of the k neighbors according to their distance using the following query giving greater weight to the closest neighbors.

Usually, KNN is robust to noisy data since it is averaging the k-nearest neighbors.

- What is Logistic Regression?
- How it works
- Logistic VS. Linear Regression
- Advantages / Disadvantages
- When to use it
- Implementation in Python

**Logistic regression** is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent the binary/categorical outcome, we use dummy variables.

You can also think of logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as the dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a **logit** function.

**Logistic regression **was developed by statistician **David Cox** in 1958. This binary logistic model is used to estimate the probability of a binary response based on one or more predictor (or independent) variables (features). It allows one to say that the presence of a risk factor increases the probability of a given outcome by a specific percentage.

Like all regression analyses, the **logistic regression** is predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

**Application of Logistic Regression:** It’s being used in Healthcare, Social Sciences & various ML for advanced research & analytics.

Logistic Regression measures the relationship between the dependent variable (our label, what we want to predict) and the one or more independent variables (our features), by estimating probabilities using it’s underlying logistic function.

These probabilities must then be transformed into binary values in order to actually make a prediction. This is the task of the logistic function, also called the sigmoid function. The Sigmoid-Function is an S-shaped curve that can take any real-valued number and map it into a value between the range of 0 and 1, but never exactly at those limits. This values between 0 and 1 will then be transformed into either 0 or 1 using a threshold classifier.

The picture below illustrates the steps that logistic regression goes through to give you your desired output.

Below you can see how the logistic function (sigmoid function) looks like:

We want to maximize the likelihood that a random data point gets classified correctly, which is called Maximum Likelihood Estimation. Maximum Likelihood Estimation is a general approach to estimating parameters in statistical models. You can maximize the likelihood using different methods like an optimization algorithm.

Newton’s Method is such an algorithm and can be used to find the maximum (or minimum) of many different functions, including the likelihood function. Instead of Newton’s Method, you could also use Gradient Descent.

You may be asking yourself what the difference between logistic and linear regression is. Logistic regression gives you a discrete outcome but linear regression gives a continuous outcome. A good example of a continuous outcome would be a model that predicts the value of a house. That value will always be different based on parameters like it’s size or location. A discrete outcome will always be one thing (you have cancer) or another (you have no cancer).

It is a widely used technique because it is very efficient, does not require too many computational resources, it’s highly interpretable, it doesn’t require input features to be scaled, it doesn’t require any tuning, it’s easy to regularize, and it outputs well-calibrated predicted probabilities.

Like linear regression, logistic regression does work better when you remove attributes that are unrelated to the output variable as well as attributes that are very similar (correlated) to each other. Therefore Feature Engineering plays an important role in regards to the performance of Logistic and also Linear Regression. Another advantage of Logistic Regression is that it is incredibly easy to implement and very efficient to train. I typically start with a Logistic Regression model as a benchmark and try using more complex algorithms from there on.

Because of its simplicity and the fact that it can be implemented relatively easy and quick, Logistic Regression is also a good baseline that you can use to measure the performance of other more complex Algorithms.

A disadvantage of it is that we can’t solve non-linear problems with logistic regression since it’s decision surface is linear. Just take a look at the example below that has 2 binary features from 2 examples.

It is clearly visible that we can’t draw a line that separates these 2 classes without a huge error. To use a simple decision tree would be a much better choice.

Logistic Regression is also not one of the most powerful algorithms out there and can be easily outperformed by more complex ones. Another disadvantage is its high reliance on a proper presentation of your data. This means that logistic regression is not a useful tool unless you have already identified all the important independent variables. Since its outcome is discrete, Logistic Regression can only predict a categorical outcome. It is also an Algorithm that is known for its vulnerability to overfitting.

Like I already mentioned, Logistic Regression separates your input into two „regions” by a linear boundary, one for each class. Therefore it is required that your data is linearly separable, like the data points in the image below:

In other words: You should think about using logistic regression when your Y variable takes on only two values (e.g when you are facing a classification problem). Note that you could also use Logistic Regression for multiclass classification, which will be discussed in the next section.

Importing the Essential libraries

Importing the Dataset

Splitting the dataset into the Training set and Test set

Feature Scalling

Fitting logistic regression to the training set

Predicting the Test Set Result

Making the Confusion Matrix

Output of Confusion Matrix

Tips – so the confusion matrix it means that 65+24 =89 are the correct predictions and 8+3 =11 are the incorrect predictions.

Visualizing the Training set Results

Output of the Training Set

Visualizing the Test Set

Output of the Test Set

The post What Is Classification appeared first on StepUp Analytics.

]]>The post Classification Techniques On Life Expectancy Data appeared first on StepUp Analytics.

]]>In this article, we will learn about Classification Techniques. We as humans have been blessed with the concept of classification. We classify everything from our closet, where all the jeans go under one rack and all the shirts go in another meant only for shirts, to the apps on our phones and the files on our computers, where we have separate folders for each kind of files or apps.

Now a more “data scientific’ definition of classification is that it is a form of data analysis that extracts models describing important data classes or a task of predicting the value of the categorical variable (class or target). Basically, finding out which set of predefined categories will a new observation belong to. A very common example for this is with emails where we wish to classify certain emails as spam and others as not spam. The machine is able to achieve this task by learning from the training data whose class is already known.

Classification algorithms can only be used when we have discrete labels as outputs. A situation like the above example where emails are classified as spam or not, where there are only two possible outcomes, is called as binary classification.

Another type is a multi-labeled classification. In multi labeled classification multiple labels may be assigned to one instance. This is mostly used for audio and video classification, text classification, sentiment classification in sentiment analysis etc.

Anyway, that was the basics and the sort of prerequisite information required to move forward with this article.

In this article, we will classify the continents, which is the label and will be used as a class, in the Life Expectancy DataSet.

This is a very small dataset with 6 columns and 223 rows, one for each country. The columns are Rank, Country, Overall Life, Male Life, Female Life, and Continent.

To perform this classification we will use 5 different classification techniques and algorithms and calculate the precision and accuracy for each of the algorithms and compare them. The 5 classification algorithms are:

**KNN**— K Nearest Neighbour Algorithm uses similarity measures like distance functions (distance measures) to classify the new data points after going through training.**SVM**— Support Vector Machine is a supervised learning algorithm, it will create a model that will assign the new points to one or the other categories using the training set. The assignment could be linear or non-linear according to the problem.**OneR**— OneR is basically One Rule Algorithm, this algorithm generates one rule for each predictor in the data and then selects the rule with the smallest error as the answer. Even though this seems and is a very simple algorithm as it generates only one rule yet it is known to perform better than some of the more complex classification algorithms.**RIPPER**—RIPPER is a rule-based learner that builds a set of rules that identify the classes while minimizing the amount of error. The error is defined by the number of training examples misclassified by the rules. It is a direct way of performing rule-based classification.**C 4.5**— C4.5 is a statistical classifier as it generates a decision tree. It builds a decision tree from the training data just like ID3 does and at each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or another. This is in a way, an indirect method of rule-based classification.

Let us now begin with the analysis using R Programming and let us see which classifier performs the best. We will use the following libraries/packages throughout the code: e1071, class, caret, rJava, RWeka.

#loading libraries library("e1071") library(class) library(caret) library(rJava) library(RWeka)

The first step of data preprocessing will involve the following:

- Importing the data set in R by using the
*read.csv()*function. - Performing some visual descriptive analysis by looking at the data set and getting a summary of the data set by using the
*summary() and str()*function. - Converting the class label, Continent, to a categorical variable by factoring it.
- Some irrelevant columns will also be removed, the ones which won’t be used in the analysis. Like the first, Rank column.

#importing csv file in R dataset <- read.csv(file.choose()) #displaying head(first five) elements head(dataset) str(dataset) #dimentions dim(dataset) #Converting Continent to factor dataset[c("Continent")]<- lapply(dataset[c("Continent")], factor) #removing the first (irrelevant) coulmn dataset <- dataset[,-1] str(dataset) summary(dataset)

head(), str(), dim() functions

str(), summary() functions after the removal of the first column and the factor conversation

Although the Continent column was already of factor data type we still ran the command to make it factor nevertheless. With this view of the data we can get a clear idea of how the data looks, the *head()* function enables that. The summary functions show us some vital descriptive information.

Most importantly we can see how many countries lie in which continent, this will help us later while checking the accuracy. We can also observe the mean of the overall, male and female life expectancy, which is **72.49**, **70.04** and **75.02** respectively. Medians, quartiles, mix, max values can also be observed.

For the second part of the data pre-processing we will:

- Dividing the dataset into training and test set in 80:20 ratio by using the sampling method to generate the random permutation of training and test elements.
- Saving the train and test samples in a list in the Output variable.
- We can see the train and test samples by printing the Output variable.

#sampling 80% training data traindata <- sample(seq_len(nrow(dataset)), size = floor(0.80 * nrow(dataset))) data_train <- dataset[traindata, ] data_test <- dataset[-traindata,] t_train <- dataset$Continent[traindata] t_test <- dataset$Continent[-traindata] output<-list(data_train,data_test,t_train,t_test) #a view of the divided data(into train and test) print(output)

KNN classification will be performed with help of preprocessing and train methods, available in the caret package. The tuned length in train method will be chosen as 20 on basis of fit model results, it will help us to automatically select the best value.

In our case, K is chosen to be 5. Also, accuracy was used to select the optimal model using the largest value.

#KNN #setting seed set.seed(12345) knn_train_test<-output let_train<-knn_train_test[[1]] let_test<-knn_train_test[[2]] #Preprocessing and training trainX <- let_train[,names(let_train) != "Continent"] preProcValues <- preProcess(x = trainX,method = c("center", "scale")) print(preProcValues)

#Fit Model- Using Caret's train model to find best k ctrl <- trainControl(method="repeatedcv",repeats = 3) knnFit <- train(Continent~., data = let_train, method = "knn", trControl = ctrl,preProcess = c("center","scale"), tuneLength = 20) print(knnFit) plot(knnFit) #Make predictions knnPredict <- predict(knnFit,newdata = let_test ) knnPredict #Confusion Matrix confusionMatrix(knnPredict, let_test$Continent ) #Accuracy knnoutput<-mean(knnPredict== let_test$Continent) knnoutput

preprocessing using the caret package and finding the knn fit i.e. value of K

plot depicting the choice of the value of K by using accuracy

prediction, confusion matrix, and accuracy

- First and foremost we observe that the best value of K has been chosen to be 5 by looking at the highest accuracy (by repeated cross-validation).
- The plot also shows the highest accuracy value of 0.562 at K=5 and a very close competition is given by K=17 at the accuracy of 0.559.
- The
**accuracy by KNN is 44%.**

SVM classification function will be deployed with help of tune method and using e1071 package. The SVM fit classification will be tuned by choosing kernel as linear and cost as 1 from tune method.

#SVM #setting seed set.seed(12345) train_test<- output let_train<-train_test[[1]] let_test<-train_test[[2]] #Fit model svmfit <- svm(Continent ~., data = let_train, kernel = "linear", scale = FALSE) svmfit svm

#Tune to check best performance tuned <- tune(svm, Continent ~., data = let_train, kernel = "linear", ranges = list(cost=c(0.001,0.01,.1,1,10,100))) summary(tuned) #Make predictions p <- predict(svmfit, let_test, type="class") length(let_test$Continent) table(p, let_test$Continent) #Analyse results #Confusion matrix confusionMatrix(p, let_test$Continent ) #Accuracy #print(mean(p== let_test$Continent)) svmoutput<-mean(p== let_test$Continent) svmoutput

fitting the model using svmfit()

tuning to check for the best performance and predicting the classes

creating the confusion matrix and calculating the accuracy

- We have observed an
**accuracy of 55% with SVM**

#OneR #setting seed set.seed(12345) oner_train_test<- output let_train<-oner_train_test[[1]] let_test<-oner_train_test[[2]] #Fitting model model <- OneR(Continent~.,let_train) model #prediction pred <- predict(model, let_test) pred table(pred,let_test$Continent) summary(model) #confusion matrix confusionMatrix(pred, let_test$Continent) #Accuracy acc<-mean(pred==let_test$Continent) acc

model 1/2

model 2/2

prediction() function, table, and summary of the model

confusion matrix and accuracy

- We observe the modeling of 178 instances, mapped using the training data.
- While predicting using the test data there are only 38 correctly classified instances while 140 are wrongly classified instances because Africa has been taken as the one rule.
- This makes the
**accuracy only 20% with OneR Algorithm.**

Ripper classification function will be deployed with help of JRip method in RWeka package.

#RIPPER Algorithm #setting seed set.seed(12345) ripper_train_test<- output let_train<-ripper_train_test[[1]] let_test<-ripper_train_test[[2]] #fitting model using Weka control function of JRip model1 <- JRip(Continent~., data=let_train) model1 #prediction pred1 <- predict(model1, let_test) pred1 table(pred1, let_test$Continent) summary(model1) #confusion matrix confusionMatrix(pred1, let_test$Continent) #Accuracy acc<-mean(pred1==let_test$Continent) acc

modeling, prediction, tabulation of the prediction and summary

confusion matrix and accuracy

- While predicting using the test data 95 are correctly classified instances while 83 are wrongly classified.
- The confusion matrix clearly shows which continent was classified as what.
**48% accuracy**can be observed by using the RIPPER Algorithm.

C4.5 classification function has been deployed with help of J48 method in RWeka package.

#C.45 Algorithm #setting seed set.seed(12345) c45_train_test<- output let_train<-c45_train_test[[1]] let_test<-c45_train_test[[2]] # fit model-Using Weka Control function of J48 fit <- J48(Continent~., data=let_train) # summarize the fit summary(fit) # make predictions c45predictions <- predict(fit, let_test) # summarize accuracy tb<-table(c45predictions, let_test$Continent) #Confusion Matrix confusionMatrix(c45predictions, let_test$Continent ) #Accuracy #print(mean(c45predictions== let_test$Continent)) c45output<-mean(c45predictions== let_test$Continent) c45output

summary of the fit, confusion matrix, and accuracy

- By summarizing the fit we can see that 138 are correctly classified instances while 40 are wrongly classified.
**The accuracy obtained is 48%**by using C4.5 Algorithm.- The accuracy is very similar to the RIPPER Algorithm.

Finally, let us list out all the accuracy values from the various classifiers used in this article.

- KNN — 44%
- SVM — 55%
- OneR — 20%
- RIPPER — 48%
- C4.5 — 48%

Clearly, SVM has outperformed all the other classification techniques by a good margin. RIPPER and C4.5 were the closest, with both showing 48% accuracy, which is pretty impressive. OneR algorithm performed the worst with only 20% accuracy.

The post Classification Techniques On Life Expectancy Data appeared first on StepUp Analytics.

]]>