Learning From Imbalanced Dataset

The unbalanced dataset, a problem often found in the real-world application, can cause a seriously negative effect on classification performance of machine learning algorithms. There have been many attempts at dealing with the classification of unbalanced datasets. In this article, Learning from imbalanced dataset we will learn and check on measures and steps in order to cater to this problem.

What Is Imbalanced Dataset?

A data set is called imbalanced if it contains many more samples from one class than from the rest of the classes. Data sets are unbalanced when at least one class is represented by only a small number of training examples (called the minority class) while other classes make up the
Majority.

In this scenario, classifiers can have good accuracy on the majority class but very poor accuracy on the minority class(es) due to the influence that the larger majority class has on traditional training criteria.

For example, in medical diagnosis of a certain cancer, if the cancer is regarded as the positive   class, and non-cancer (healthy) as negative, then missing cancer (the patient is actually positive but is classified as negative; thus it is also called ―false negative) is much more serious (thus expensive) than the false-positive error.

The patient could lose his/her life because of the delay in the correct diagnosis and treatment. Similarly, if carrying a bomb is positive, then it is much more expensive to miss a terrorist who carries a bomb to a flight than searching an innocent person.

The unbalanced dataset problem appears in many real-world applications like text categorization, fault detection, fraud detection, oil-spills detection in satellite images, toxicology, cultural modeling, medical diagnosis.

Because of this unequal class distribution, the performance of the existing classifiers tends to be biased towards the majority class.

The Most Common Techniques To Deal With Unbalanced Data

It includes resizing training datasets, cost-sensitive classifier, and the snowball method. Recently, several methods have been proposed with good performance on unbalanced data. These approaches include modified SVMs, k nearest neighbor (KNN), neural networks, genetic programming, rough set based algorithms, probabilistic decision tree and learning methods.

The next sections focus on some of the methods in detail.

1. Sampling Methods

An easy Data level methods for balancing the classes consist of resampling the original data set, either by oversampling the minority class or by under-sampling the majority class until the classes are approximately equally represented.

Both strategies can be applied in any learning system since they act as a preprocessing phase.

Over-Sampling

The simplest method to increase the size of the minority class corresponds to random over-sampling, that is, a nonheuristic method that balances the class distribution through the random replication of positive examples. Nevertheless, since this method replicates existing examples in the minority class, overfitting is more likely to occur.

Two types of Over-Sampling

Random Oversampling: This sampling balances the data by randomly oversampling the minority class.
Informative oversampling: Informative oversampling uses a pre-specified criterion and synthetically generates minority class observations.

Undersampling

Under-sampling is an efficient method for classing-imbalance learning. This method uses a subset of the majority class to train the classifier. Since many majority class examples are ignored, the training set becomes more balanced and the training process becomes faster. The most common preprocessing technique is random majority under-sampling (RUS), IN RUS, Instances of the majority class are randomly discarded from the dataset.

Random Oversampling: Random undersampling method randomly chooses observations from majority class which is eliminated until the data set gets balanced.
Informative Oversampling: Informative undersampling follows a pre-specified selection criterion to remove the observations from majority class.

Here we have two better algorithms in Informative oversampling  BalanceCascade and EasyEnsemble.

BalanceCascade: It takes a supervised learning approach where it develops an ensemble of the classifier and systematically selects which majority class to the ensemble.
EasyEnsemble: At first, it extracts several subsets of the independent sample (with replacement) from majority class. Then, it develops multiple classifiers based on the combination of each subset with minority class. As you see, it works just like an unsupervised learning algorithm.

However, the main drawback of under-sampling is that potentially useful information contained in these ignored examples is neglected.

2. Synthetic Data Generation

It handles imbalances by generates artificial data. It does not replicate and add the observations from the minority class. It is also a type of oversampling technique.

In synthetic data generation, we find synthetic minority oversampling technique (SMOTE) is a powerful and widely used method. SMOTE algorithm creates artificial data based on feature space (rather than data space) similarities from minority samples. We can also say, it generates a random set of minority class observations to shift the classifier learning bias towards minority class.

What smote does is simple. First, it finds the n-nearest neighbors in the minority class for each of the samples in the class . Then it draws a line between the the neighbors an generates random points on the lines.

See the above image so it finds the 5 nearest neighbors to the sample points. then draws a line to each of them. Then create samples on the lines with class == minority class.

3. Cost-Sensitive Learning (CSL)

In regular learning, we treat all misclassifications equally, which causes issues in imbalanced classification problems, as there is no extra reward for identifying the minority class over the majority class. Cost-sensitive learning changes this and uses a function C(p, t) (usually represented as a matrix) that specifies the cost of misclassifying an instance of class t as class p.

This allows us to penalize misclassifications of the minority class more heavily than we do with misclassifications of the majority class, in hopes that this increases the true positive rate. A common scheme for this is to have the cost equal to the inverse of the proportion of the data-set that the class makes up. This increases the penalization as the class size decreases.Sample cost function matrix

SVM AND IMBALANCED DATASETS
The success of SVM is very limited when it is applied to the problem of learning from imbalanced datasets in which negative instances heavily outnumber the positive instances. Even though undersampling the majority class does improve SVM performance, there is an inherent loss of valuable information in this process.

Practical implementation on Imbalanced Data Set

Here, I will take the  “Credit card fraud Data Set”.

The Credit Card Fraud Detection Problem includes modeling past credit card transactions with the knowledge of the ones that turned out to be fraud. This model is then used to identify whether a new transaction is fraudulent or not. Our aim here is to detect 100% of the fraudulent transactions while minimizing the incorrect fraud classifications.

Observations 

  1. The data set is highly skewed, consisting of 492 frauds in a total of 284,807 observations. This resulted in only 0.172% fraud cases. This skewed set is justified by the low number of fraudulent transactions.
  2. The dataset consists of numerical values from the 28 ‘Principal Component Analysis (PCA)’ transformed features, namely V1 to V28. Furthermore, there is no metadata about the original features provided, so pre-analysis or feature study could not be done.
  3. The ‘Time’ and ‘Amount’ features are not transformed data.
  4. There is no missing value in the dataset.

Inferences are drawn

  1. Owing to such imbalance in data, an algorithm that does not do any feature analysis and predicts all the transactions as non-frauds will also achieve an accuracy of 99.828%. Therefore, accuracy is not a correct measure of efficiency in our case. We need some other standard of correctness while classifying transactions as fraud or non-fraud.
  2. The ‘Time’ feature does not indicate the actual time of the transaction and is more of a list of the data in chronological order. So we assume that the ‘Time’ feature has little or no significance in classifying a fraud transaction. Therefore, we eliminate this column from further analysis.

I have created “Jupyter notebook” on “credit fraud detection” using unbalanced data set.

To complete solution Please visit my GitHub Repo: Code-never ends

You might also like More from author