Classification Techniques On Life Expectancy Data

A Taxonomy of Continents

In this article, we will learn about Classification Techniques. We as humans have been blessed with the concept of classification. We classify everything from our closet, where all the jeans go under one rack and all the shirts go in another meant only for shirts, to the apps on our phones and the files on our computers, where we have separate folders for each kind of files or apps.

Now a more “data scientific’ definition of classification is that it is a form of data analysis that extracts models describing important data classes or a task of predicting the value of the categorical variable (class or target). Basically, finding out which set of predefined categories will a new observation belong to. A very common example for this is with emails where we wish to classify certain emails as spam and others as not spam. The machine is able to achieve this task by learning from the training data whose class is already known.

Classification algorithms can only be used when we have discrete labels as outputs. A situation like the above example where emails are classified as spam or not, where there are only two possible outcomes, is called as binary classification.

Another type is a multi-labeled classification. In multi labeled classification multiple labels may be assigned to one instance. This is mostly used for audio and video classification, text classification, sentiment classification in sentiment analysis etc.

Anyway, that was the basics and the sort of prerequisite information required to move forward with this article.

In this article, we will classify the continents, which is the label and will be used as a class, in the Life Expectancy DataSet.

This is a very small dataset with 6 columns and 223 rows, one for each country. The columns are Rank, Country, Overall Life, Male Life, Female Life, and Continent.

To perform this classification we will use 5 different classification techniques and algorithms and calculate the precision and accuracy for each of the algorithms and compare them. The 5 classification algorithms are:

  • KNN — K Nearest Neighbour Algorithm uses similarity measures like distance functions (distance measures) to classify the new data points after going through training.
  • SVM — Support Vector Machine is a supervised learning algorithm, it will create a model that will assign the new points to one or the other categories using the training set. The assignment could be linear or non-linear according to the problem.
  • OneR — OneR is basically One Rule Algorithm, this algorithm generates one rule for each predictor in the data and then selects the rule with the smallest error as the answer. Even though this seems and is a very simple algorithm as it generates only one rule yet it is known to perform better than some of the more complex classification algorithms.
  • RIPPER —RIPPER is a rule-based learner that builds a set of rules that identify the classes while minimizing the amount of error. The error is defined by the number of training examples misclassified by the rules. It is a direct way of performing rule-based classification.
  • C 4.5 — C4.5 is a statistical classifier as it generates a decision tree. It builds a decision tree from the training data just like ID3 does and at each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or another. This is in a way, an indirect method of rule-based classification.

Let us now begin with the analysis using R Programming and let us see which classifier performs the best. We will use the following libraries/packages throughout the code: e1071, class, caret, rJava, RWeka.

Data Preprocessing

The first step of data preprocessing will involve the following:

  • Importing the data set in R by using the read.csv() function.
  • Performing some visual descriptive analysis by looking at the data set and getting a summary of the data set by using the summary() and str() function.
  • Converting the class label, Continent, to a categorical variable by factoring it.
  • Some irrelevant columns will also be removed, the ones which won’t be used in the analysis. Like the first, Rank column.

Output

head(), str(), dim() functions

str(), summary() functions after the removal of the first column and the factor conversation

Observation

Although the Continent column was already of factor data type we still ran the command to make it factor nevertheless. With this view of the data we can get a clear idea of how the data looks, the head() function enables that. The summary functions show us some vital descriptive information.

Most importantly we can see how many countries lie in which continent, this will help us later while checking the accuracy. We can also observe the mean of the overall, male and female life expectancy, which is 72.49, 70.04 and 75.02 respectively. Medians, quartiles, mix, max values can also be observed.

For the second part of the data pre-processing we will:

  • Dividing the dataset into training and test set in 80:20 ratio by using the sampling method to generate the random permutation of training and test elements.
  • Saving the train and test samples in a list in the Output variable.
  • We can see the train and test samples by printing the Output variable.

KNN — K Nearest Neighbor Algorithm

KNN classification will be performed with help of preprocessing and train methods, available in the caret package. The tuned length in train method will be chosen as 20 on basis of fit model results, it will help us to automatically select the best value.

In our case, K is chosen to be 5. Also, accuracy was used to select the optimal model using the largest value.

Output

preprocessing using the caret package and finding the knn fit i.e. value of K

plot depicting the choice of the value of K by using accuracy

prediction, confusion matrix, and accuracy

Observation

  • First and foremost we observe that the best value of K has been chosen to be 5 by looking at the highest accuracy (by repeated cross-validation).
  • The plot also shows the highest accuracy value of 0.562 at K=5 and a very close competition is given by K=17 at the accuracy of 0.559.
  • The accuracy by KNN is 44%.

SVM — Support Vector Machines 

SVM classification function will be deployed with help of tune method and using e1071 package. The SVM fit classification will be tuned by choosing kernel as linear and cost as 1 from tune method.

 

Output

fitting the model using svmfit()

tuning to check for the best performance and predicting the classes

creating the confusion matrix and calculating the accuracy

Observation

  • We have observed an accuracy of 55% with SVM

OneR — One Rule Algorithm

Output

model 1/2

model 2/2

prediction() function, table, and summary of the model

confusion matrix and accuracy

Observation

  • We observe the modeling of 178 instances, mapped using the training data.
  • While predicting using the test data there are only 38 correctly classified instances while 140 are wrongly classified instances because Africa has been taken as the one rule.
  • This makes the accuracy only 20% with OneR Algorithm.

RIPPER – Repeated Incremental Pruning to Produce Error Reduction Algorithm

Ripper classification function will be deployed with help of JRip method in RWeka package.

Output

modeling, prediction, tabulation of the prediction and summary

confusion matrix and accuracy

Observation

  • While predicting using the test data 95 are correctly classified instances while 83 are wrongly classified.
  • The confusion matrix clearly shows which continent was classified as what.
  • 48% accuracy can be observed by using the RIPPER Algorithm.

C 4.5 Algorithm 

C4.5 classification function has been deployed with help of J48 method in RWeka package.

Output

summary of the fit, confusion matrix, and accuracy

Observation

  • By summarizing the fit we can see that 138 are correctly classified instances while 40 are wrongly classified.
  • The accuracy obtained is 48% by using C4.5 Algorithm.
  • The accuracy is very similar to the RIPPER Algorithm.

Conclusion 

Finally, let us list out all the accuracy values from the various classifiers used in this article.

Accuracy Values — by Classifiers

  • KNN — 44%
  • SVM — 55%
  • OneR — 20%
  • RIPPER — 48%
  • C4.5 — 48%

Clearly, SVM has outperformed all the other classification techniques by a good margin. RIPPER and C4.5 were the closest, with both showing 48% accuracy, which is pretty impressive. OneR algorithm performed the worst with only 20% accuracy.

 

You might also like More from author