# Classification Techniques On Life Expectancy Data

**A Taxonomy of Continents**

In this article, we will learn about Classification Techniques. We as humans have been blessed with the concept of classification. We classify everything from our closet, where all the jeans go under one rack and all the shirts go in another meant only for shirts, to the apps on our phones and the files on our computers, where we have separate folders for each kind of files or apps.

Now a more “data scientific’ definition of classification is that it is a form of data analysis that extracts models describing important data classes or a task of predicting the value of the categorical variable (class or target). Basically, finding out which set of predefined categories will a new observation belong to. A very common example for this is with emails where we wish to classify certain emails as spam and others as not spam. The machine is able to achieve this task by learning from the training data whose class is already known.

Classification algorithms can only be used when we have discrete labels as outputs. A situation like the above example where emails are classified as spam or not, where there are only two possible outcomes, is called as binary classification.

Another type is a multi-labeled classification. In multi labeled classification multiple labels may be assigned to one instance. This is mostly used for audio and video classification, text classification, sentiment classification in sentiment analysis etc.

Anyway, that was the basics and the sort of prerequisite information required to move forward with this article.

In this article, we will classify the continents, which is the label and will be used as a class, in the Life Expectancy DataSet.

This is a very small dataset with 6 columns and 223 rows, one for each country. The columns are Rank, Country, Overall Life, Male Life, Female Life, and Continent.

To perform this classification we will use 5 different classification techniques and algorithms and calculate the precision and accuracy for each of the algorithms and compare them. The 5 classification algorithms are:

**KNN**— K Nearest Neighbour Algorithm uses similarity measures like distance functions (distance measures) to classify the new data points after going through training.**SVM**— Support Vector Machine is a supervised learning algorithm, it will create a model that will assign the new points to one or the other categories using the training set. The assignment could be linear or non-linear according to the problem.**OneR**— OneR is basically One Rule Algorithm, this algorithm generates one rule for each predictor in the data and then selects the rule with the smallest error as the answer. Even though this seems and is a very simple algorithm as it generates only one rule yet it is known to perform better than some of the more complex classification algorithms.**RIPPER**—RIPPER is a rule-based learner that builds a set of rules that identify the classes while minimizing the amount of error. The error is defined by the number of training examples misclassified by the rules. It is a direct way of performing rule-based classification.**C 4.5**— C4.5 is a statistical classifier as it generates a decision tree. It builds a decision tree from the training data just like ID3 does and at each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or another. This is in a way, an indirect method of rule-based classification.

Let us now begin with the analysis using R Programming and let us see which classifier performs the best. We will use the following libraries/packages throughout the code: e1071, class, caret, rJava, RWeka.

1 2 3 4 5 6 7 |
#loading libraries library("e1071") library(class) library(caret) library(rJava) library(RWeka) |

**Data Preprocessing**

The first step of data preprocessing will involve the following:

- Importing the data set in R by using the
*read.csv()*function. - Performing some visual descriptive analysis by looking at the data set and getting a summary of the data set by using the
*summary() and str()*function. - Converting the class label, Continent, to a categorical variable by factoring it.
- Some irrelevant columns will also be removed, the ones which won’t be used in the analysis. Like the first, Rank column.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
#importing csv file in R dataset <- read.csv(file.choose()) #displaying head(first five) elements head(dataset) str(dataset) #dimentions dim(dataset) #Converting Continent to factor dataset[c("Continent")]<- lapply(dataset[c("Continent")], factor) #removing the first (irrelevant) coulmn dataset <- dataset[,-1] str(dataset) summary(dataset) |

**Output**

head(), str(), dim() functions

str(), summary() functions after the removal of the first column and the factor conversation

**Observation**

Although the Continent column was already of factor data type we still ran the command to make it factor nevertheless. With this view of the data we can get a clear idea of how the data looks, the *head()* function enables that. The summary functions show us some vital descriptive information.

Most importantly we can see how many countries lie in which continent, this will help us later while checking the accuracy. We can also observe the mean of the overall, male and female life expectancy, which is **72.49**, **70.04** and **75.02** respectively. Medians, quartiles, mix, max values can also be observed.

For the second part of the data pre-processing we will:

- Dividing the dataset into training and test set in 80:20 ratio by using the sampling method to generate the random permutation of training and test elements.
- Saving the train and test samples in a list in the Output variable.
- We can see the train and test samples by printing the Output variable.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
#sampling 80% training data traindata <- sample(seq_len(nrow(dataset)), size = floor(0.80 * nrow(dataset))) data_train <- dataset[traindata, ] data_test <- dataset[-traindata,] t_train <- dataset$Continent[traindata] t_test <- dataset$Continent[-traindata] output<-list(data_train,data_test,t_train,t_test) #a view of the divided data(into train and test) print(output) |

**KNN — K Nearest Neighbor Algorithm**

KNN classification will be performed with help of preprocessing and train methods, available in the caret package. The tuned length in train method will be chosen as 20 on basis of fit model results, it will help us to automatically select the best value.

In our case, K is chosen to be 5. Also, accuracy was used to select the optimal model using the largest value.

1 2 3 4 5 6 7 8 9 10 11 |
#KNN #setting seed set.seed(12345) knn_train_test<-output let_train<-knn_train_test[[1]] let_test<-knn_train_test[[2]] #Preprocessing and training trainX <- let_train[,names(let_train) != "Continent"] preProcValues <- preProcess(x = trainX,method = c("center", "scale")) print(preProcValues) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
#Fit Model- Using Caret's train model to find best k ctrl <- trainControl(method="repeatedcv",repeats = 3) knnFit <- train(Continent~., data = let_train, method = "knn", trControl = ctrl,preProcess = c("center","scale"), tuneLength = 20) print(knnFit) plot(knnFit) #Make predictions knnPredict <- predict(knnFit,newdata = let_test ) knnPredict #Confusion Matrix confusionMatrix(knnPredict, let_test$Continent ) #Accuracy knnoutput<-mean(knnPredict== let_test$Continent) knnoutput |

**Output**

preprocessing using the caret package and finding the knn fit i.e. value of K

plot depicting the choice of the value of K by using accuracy

prediction, confusion matrix, and accuracy

**Observation**

- First and foremost we observe that the best value of K has been chosen to be 5 by looking at the highest accuracy (by repeated cross-validation).
- The plot also shows the highest accuracy value of 0.562 at K=5 and a very close competition is given by K=17 at the accuracy of 0.559.
- The
**accuracy by KNN is 44%.**

**SVM — Support Vector Machines **

SVM classification function will be deployed with help of tune method and using e1071 package. The SVM fit classification will be tuned by choosing kernel as linear and cost as 1 from tune method.

1 2 3 4 5 6 7 8 9 10 11 |
#SVM #setting seed set.seed(12345) train_test<- output let_train<-train_test[[1]] let_test<-train_test[[2]] #Fit model svmfit <- svm(Continent ~., data = let_train, kernel = "linear", scale = FALSE) svmfit svm |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
#Tune to check best performance tuned <- tune(svm, Continent ~., data = let_train, kernel = "linear", ranges = list(cost=c(0.001,0.01,.1,1,10,100))) summary(tuned) #Make predictions p <- predict(svmfit, let_test, type="class") length(let_test$Continent) table(p, let_test$Continent) #Analyse results #Confusion matrix confusionMatrix(p, let_test$Continent ) #Accuracy #print(mean(p== let_test$Continent)) svmoutput<-mean(p== let_test$Continent) svmoutput |

**Output**

fitting the model using svmfit()

tuning to check for the best performance and predicting the classes

creating the confusion matrix and calculating the accuracy

**Observation**

- We have observed an
**accuracy of 55% with SVM**

**OneR — One Rule Algorithm**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
#OneR #setting seed set.seed(12345) oner_train_test<- output let_train<-oner_train_test[[1]] let_test<-oner_train_test[[2]] #Fitting model model <- OneR(Continent~.,let_train) model #prediction pred <- predict(model, let_test) pred table(pred,let_test$Continent) summary(model) #confusion matrix confusionMatrix(pred, let_test$Continent) #Accuracy acc<-mean(pred==let_test$Continent) acc |

**Output**

model 1/2

model 2/2

prediction() function, table, and summary of the model

confusion matrix and accuracy

**Observation**

- We observe the modeling of 178 instances, mapped using the training data.
- While predicting using the test data there are only 38 correctly classified instances while 140 are wrongly classified instances because Africa has been taken as the one rule.
- This makes the
**accuracy only 20% with OneR Algorithm.**

**RIPPER – Repeated Incremental Pruning to Produce Error Reduction Algorithm**

Ripper classification function will be deployed with help of JRip method in RWeka package.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
#RIPPER Algorithm #setting seed set.seed(12345) ripper_train_test<- output let_train<-ripper_train_test[[1]] let_test<-ripper_train_test[[2]] #fitting model using Weka control function of JRip model1 <- JRip(Continent~., data=let_train) model1 #prediction pred1 <- predict(model1, let_test) pred1 table(pred1, let_test$Continent) summary(model1) #confusion matrix confusionMatrix(pred1, let_test$Continent) #Accuracy acc<-mean(pred1==let_test$Continent) acc |

**Output**

modeling, prediction, tabulation of the prediction and summary

confusion matrix and accuracy

**Observation**

- While predicting using the test data 95 are correctly classified instances while 83 are wrongly classified.
- The confusion matrix clearly shows which continent was classified as what.
**48% accuracy**can be observed by using the RIPPER Algorithm.

**C 4.5 Algorithm **

C4.5 classification function has been deployed with help of J48 method in RWeka package.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
#C.45 Algorithm #setting seed set.seed(12345) c45_train_test<- output let_train<-c45_train_test[[1]] let_test<-c45_train_test[[2]] # fit model-Using Weka Control function of J48 fit <- J48(Continent~., data=let_train) # summarize the fit summary(fit) # make predictions c45predictions <- predict(fit, let_test) # summarize accuracy tb<-table(c45predictions, let_test$Continent) #Confusion Matrix confusionMatrix(c45predictions, let_test$Continent ) #Accuracy #print(mean(c45predictions== let_test$Continent)) c45output<-mean(c45predictions== let_test$Continent) c45output |

**Output**

summary of the fit, confusion matrix, and accuracy

**Observation**

- By summarizing the fit we can see that 138 are correctly classified instances while 40 are wrongly classified.
**The accuracy obtained is 48%**by using C4.5 Algorithm.- The accuracy is very similar to the RIPPER Algorithm.

**Conclusion **

Finally, let us list out all the accuracy values from the various classifiers used in this article.

**Accuracy Values — by Classifiers**

- KNN — 44%
- SVM — 55%
- OneR — 20%
- RIPPER — 48%
- C4.5 — 48%

Clearly, SVM has outperformed all the other classification techniques by a good margin. RIPPER and C4.5 were the closest, with both showing 48% accuracy, which is pretty impressive. OneR algorithm performed the worst with only 20% accuracy.