# Outlier Detection Techniques Using R

In an analytical problem first we have to prepare the data because the data we get is not easy to analyze. When preparing the data, we have to face many problems like presence of missing values, presence of outliers etc. In this article I will describe what is outlier and how to deal with it.

Suppose, there are two teams playing football match. Now all the team players of team A are under age 19 and all the team players of team B are under age 19, except one of age 24. Now the player has an experience in district level football. Now, everybody know that team B will be going to be the winner. So, this will not be a fair game. Here, the player aged 24 becomes unusual compared to others. So, this is the concept of outlier. An outlier is an unusual/suspicious value which isn’t as other data points. Let us consider we have a variable having where the most values are from 01 – 49. Now, suppose there is a number like 94 which is quite unnatural. So, here this is the outlier. In real life data we get many outliers. We have to know how to deal with this otherwise it may cause a misrepresentation of the entire dataset.

Now choosing a data point as an outlier depends on your purpose. Suppose a company had noticed that their yearly sale was around 10 – 20 Lakhs. Now the owner decides that if the sale is below 5 Lakhs or above 25 Lakhs, it will consider as an unusual value. So, for this company the outlier value is more than 25 Lakhs or less than 5 Lakhs.

Outliers arise due to changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations.

Now, the question is how to detect if there is any outlier in a data? As I have said it depends on your purpose but there are some methods to detect outliers in a data. The standard method is Tukey’s method, discussed below.

Suppose we have a variable assuming the values X_{1}, X_{2}, X_{3,}…, X_{n}. Now from the values we have to first determine the first quartile (Q1) and the third quartile (Q3) and the inter-quartile range (IQR = Q3 – Q1) based on the sample observations. Now the values outside the range is considered as *outliers*. Now, the values outside the range is known as* extreme* outliers and the values outside the range but inside the range is called *mild* outliers.

Suppose we have the values, 1, 60, 2, 1, 4, 4, 1, 1, 6, -30.Here Q1 = 1, Q3 = 4 and IQR = 3. So, the values outside the range (-3.5, 8.5) are the outliers. So, here the outliers are -30 and 60.

We have many functions in R. I prefer the function *boxplot.stats(x)$out* in the package *grDevices*. Let’s see the codes to understand how it works.

1 2 3 4 |
library(grDevices) x <- c(1, 60, 2, 1, 4, 4, 1, 1, 6, -30, 70) x[which(x %in% boxplot.stats(x)$out)] |

1 2 3 4 5 |
library(grDevices) x <- c(1, 60, 2, 1, 4, 4, 1, 1, 6, -30, 70) x[which(x %in% boxplot.stats(x)$out)] [1] 60 -30 70 |

Now you may ask why we are using box-plot function here. Actually, box-plot is the visualization where we can see how the data is distributed along with if there are any outliers or not. Let’s see the box-plot below to understand that.

The red circles are the outliers in this data.

We also have some test to detect outliers like *dixon.test()*, *chisq.out.test()* in ** outliers** package. But I prefer a test

*rosnerTest()*in

**package in R. Let’s see how it works.**

*EnvStats*Form the box-plot we got 3 outliers, so a parameter in *ronserTest()* is *k* which is the number you think that how many outliers are there. I will prefer to put, what you get from the box-plot adding with 1 or 2. Here we can see 3 outliers from the box-plot, so we are putting k = 4. Now see how the test performs,

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
rosnerTest(x, k = 4, warn = F) OUTPUT: > rosnerTest(x, k = 4, warn = F) Results of Outlier Test ------------------------- Test Method: Rosner's Test for Outliers Hypothesized Distribution: Normal Data: x Sample Size: 11 Test Statistics: R.1 = 2.067720 R.2 = 2.508654 R.3 = 2.630493 R.4 = 1.816061 Test Statistic Parameter: k = 4 Alternative Hypothesis: Up to 4 observations are not from the same Distribution. Type I Error: 5% Number of Outliers Detected: 3 i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier 1 0 10.909091 28.577804 70 11 2.067720 2.354730 TRUE 2 1 5.000000 21.924112 60 2 2.508654 2.289954 TRUE 3 2 -1.111111 10.982309 -30 10 2.630493 2.215004 TRUE 4 3 2.500000 1.927248 6 9 1.816061 2.126645 FALSE |

Here as you can see, 3 outliers are detected and we also get the values corresponding to the TRUE values of Outlier from the last table.

So, these are the methods to detect, visualize and test for outliers in data.

Now, the main question that should arise is, how should we deal with these outliers?

Presence of outliers may lead us to a bad result. So either we have to remove them or we have to replace them with some representative values.

There are several statistical methods to deal with outliers. Let’s discuss about the methods.

**Trimming: **

Suppose we have 1003 data on height (in ft) of adult people of a village. We are to find the mean height of the adult people.

1 2 3 4 5 |
height <- c((sample(seq(4,8,0.001),1000, replace = T)),101.51,-0.2346,601) boxplot(height,outcol = "red", outcex = 1.5) height <- c((sample(seq(4,8,0.001),1000, replace = T)),101.51,-0.2346,601) boxplot(height,outcol = "red", outcex = 1.5) |

Here we can see from the box-plot that there are three outliers. Now we will check which values are those outliers. Now, as I have 1003 data if we remove 3 data from it, it doesn’t affect too much to the data. So, we remove/trim them from the data. You can notice that one is usual value which is negative we must have to remove it first. Then we perform trimming.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
height <- height[-which(height < 0)] rosnerTest(height,k = 4) height <- height[-c(1001,1002)] boxplot(height,main = "After removing the outliers") height <- height[-which(height < 0)] rosnerTest(height,k = 4) Results of Outlier Test ------------------------- Test Method: Rosner's Test for Outliers Hypothesized Distribution: Normal Data: height Sample Size: 1002 Test Statistics: R.1 = 31.163267 R.2 = 29.440518 R.3 = 1.694513 R.4 = 1.696956 Test Statistic Parameter: k = 4 Alternative Hypothesis: Up to 4 observations are not from the same Distribution. Type I Error: 5% Number of Outliers Detected: 2 i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier 1 0 6.687901 19.070918 601.000 1002 31.163267 4.040471 TRUE 2 1 6.094183 3.240969 101.510 1001 29.440518 4.040225 TRUE 3 2 5.998767 1.179827 7.998 838 1.694513 4.039978 FALSE 4 3 5.996766 1.178719 7.997 919 1.696956 4.039731 FALSE > height <- height[-c(1001,1002)] > boxplot(height,main = "After removing the outliers") |

Now we can see the outliers are removed and we can perform analysis as required to get a better result.

You can easily visualize the problem of outliers comparing the histograms of height before and after removing the outliers.

So, this is the concept of trimming.

**Replacing with some representative values:**

Now, if there are few number of observations where each of the data point is as important as others, in that case we don’t use trimming. What we do here, we replac the outliers with some representative values like mean, median, minimum or maximum values etc.

Let’s replace the outliers with minimum and maximum value. This method is also known as Tukey’s method. Here what we do is, we first remove the outliers. Now, after removing outliers we have minimum and maximum value. We assign that minimum value to those outliers which were less than and assign that maximum value to those which were greater than This way we can replace the values of outliers. Let’s check how it helps to get rid of outliers. Let us consider the same example taken before but here we take 13 samples.

1 2 3 4 5 |
height <- c((sample(seq(4,8,0.001),10, replace = T)),0.00125,25.48,60) boxplot(height,outcol = "red", outcex = 1.5) height <- c((sample(seq(4,8,0.001),10, replace = T)),0.00125,25.48,60) boxplot(height,outcol = "red", outcex = 1.5) |

Here we have three suspicious values. Let’s replace them with minimum and maximum values.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
rosnerTest(height,k = 4,warn = F) height <- sort(height) for(i in 1:length(height)){ if(height[i] > quantile(height,0.75)+1.5*IQR(height)){ height[i] <- max(height[1:(i-1)]) } } height <- sort(height,decreasing = T) for(i in 1:length(height)){ if(height[i] < quantile(height,0.25)-1.5*IQR(height)){ height[i] <- min(height[1:(i-1)]) } } boxplot(height,main = "After replacing the outliers") > rosnerTest(height,k = 4,warn = F) Results of Outlier Test ------------------------- Test Method: Rosner's Test for Outliers Hypothesized Distribution: Normal Data: height Sample Size: 13 Test Statistics: R.1 = 3.092005 R.2 = 2.959174 R.3 = 2.589846 R.4 = 1.962536 Test Statistic Parameter: k = 4 Alternative Hypothesis: Up to 4 observations are not from the same Distribution. Type I Error: 5% Number of Outliers Detected: 3 i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier 1 0 11.617635 15.647570 60.00000 13 3.092005 2.462033 TRUE 2 1 7.585771 6.047034 25.48000 12 2.959174 2.411560 TRUE 3 2 5.959023 2.300435 0.00125 11 2.589846 2.354730 TRUE 4 3 6.554800 1.241659 4.11800 8 1.962536 2.289954 FALSE |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
> height <- sort(height) > for(i in 1:length(height)){ + if(height[i] > quantile(height,0.75)+1.5*IQR(height)){ + height[i] <- max(height[1:(i-1)]) + } + } > height <- sort(height,decreasing = T) > for(i in 1:length(height)){ + if(height[i] < quantile(height,0.25)-1.5*IQR(height)){ + height[i] <- min(height[1:(i-1)]) + } + } > boxplot(height,main = "After replacing the outliers") |

Now we have the dataset free from the outliers’ effects. We can perform required analysis.

Let’s visualize the effect of replacing the outliers.

So this is the method of replacing.

Now, replacing with minimum and maximum values results heavy tailed. If you replace with mean or median it will increase the number of observations near the centre of the dataset. So now if your data is bell shaped you can use the mean or median to concentrate to the centre. If your data is “U” shaped you can use maximum and minimum value. So, it’s your choice according to how data is distributed.

A better approach is *capping*. What we do here? We replace the outliers which are less than with the 5^{th} percentile and replace the outliers which are greater than with the 95^{th} percentile of the data.

**Outlier Analysis or Clustering:**

Now, suppose a company finds that a particular dealer continuously makes some huge loss in every month. So, the company can’t neglect that one. Then the company has to analyze why he is making these losses. Similarly, if there are many outliers then separate analysis should be done for outliers. One way to perform this is cluster analysis. Here we concentrate to each cluster and can have separate analysis for the outliers.

**Note:**

**Detection of outliers in multivariate setup and how to deal with them:**

What I have discussed in the outlier detection section, that is for univariate data. What we do in bi-variate or multivariate setup? Let’s take the simplest one, i.e. bi-variate data.

As we know the simple representation of bi-variate data is scatter plot. We can see a pattern in this plot. Suppose the pattern is linear (see the image below).

Here we can see that except the first few points, we can have a moderate linear pattern. But, here the first few points are outliers because they are far from the others data points. Now, statistically if you want to see if they are outliers or not we can verify it by cook’s distance. We have a function in R *cooks.distance()* in ** stats** library.

1 2 3 4 5 6 7 8 9 |
cd <- cooks.distance(lm(y~x)) plot(cd,pch=1, cex = 1) abline(h = 4*mean(cd,na.rm = T), col = "blue") text(x=1:length(cd)+2, y=cd, labels=ifelse(cd>4*mean(cd, na.rm=T),names(cd),""), col="blue") cd <- cooks.distance(lm(y~x)) plot(cd,pch=1, cex = 1) abline(h = 4*mean(cd,na.rm = T), col = "blue") text(x=1:length(cd)+2, y=cd, labels=ifelse(cd>4*mean(cd, na.rm=T),names(cd),""), col="blue") |

Here analysing the cook’s distance we can see the outliers with their indexes (blue marks).

Now, to deal with these outliers we can replace the y values of the outlier pairs with some representative values. Here representative value will be the predicted value of those *y’s* with the given corresponding *x’s*. The prediction will be based on the other values except the outliers.

We can use several imputations like kNN (k-th Nearest Neighbours), mice (Multivariate Imputation by Chained Equations) etc to deal with those outliers in bi-variate data, moreover multivariate data also.

Hope you have got a clear idea about what is outlier, how to detect them and how to deal with these outliers. It’s very important to take the outliers into account at the time of data preparation. Otherwise, it will mislead you far from a good interpretation.