# Missing Value Imputation Techniques In R

Missing values occur when no data is available for a column of an observation. They are expressed by a symbol “NA” which means “Not Available” in R. Missing values introduces vagueness and miss interpretability in any form of statistical data analysis. In this article, I will take you through Missing Value Imputation Techniques in R with sample data. There are many types of missing values in a data.

Let’s see the three main types missing values according to their pattern of occurrence in a data set.

- Missing completely at random (MCAR)
- Missing at random (MAR)
- Not missing at random (NMAR)

**Missing Completely at Random (MCAR)**

It occurs when the missing values occur entirely at random and are independent of other variables in the observation. Here we are assuming that the variable of missing data is completely unrelated to the other variables or columns in the data. For example

Suppose that we have a database of school students with 4 columns **Student.Id**, **Name**, **Gender**, and **Number of Subjects**. With the data available we cannot determine the number of subjects for the given missing observation because the missing data is completely independent of the other observations in the data.

**Missing at Random (MAR)**

An alternative assumption to MCAR is MAR or Missing at Random. It assumes that we can predict the missing value on the basis of other available data.

From the given data we can build a predictive model that Number of subjects can be predicted on the basis of independent variables like class and age. So in these cases, we can use some advanced imputation techniques to determine the missing values.

MAR is always a safer assumption than MCAR. This is because any statistical analysis which is performed under the assumption of MCAR is also valid for MAR, but the reverse is not true.

**Not Missing at Random (NMAR)**

NMAR is also known as* nonignorable missing data*. It is completely different from MCAR or MAR. It is a case where we cannot determine the value of the missing data with any of the advanced imputation techniques. For example, if there is a question in a questionnaire which is a very sensitive issue and it is likely to be avoided by the people filling out the questionnaire, or anything that we don’t know. This is known as *missing not at random data**.*

In the present study, I have used the **iris data set** which is already present in the R software. Though the dataset does not have any missing values, I have introduced missing values randomly into the data set to execute the six most popular methods of missing value treatment.

1 2 |
D <- iris |

Saving the dataset in a dataframe “D”.

The data frame has four columns

- Sepal.Length
- Sepal.Width
- Petal.Length
- Petal.Width
- Species

Of these variables, the first four are numeric and the fifth variable “Species” is a factor with three levels.

**NOTE**: I have used only the first column i.e. **Sepal.Length** to explain the imputation techniques.

1 2 |
str(D) |

**Method 1: Deleting The Observations**

This is the best avoidable method unless the data type is MCAR. We have to see whether the deletion of the data will affect any of the statistical analysis done with the data or not. Moreover, it is only performed if there is a sufficient amount of data available after deleting those observations with “NA” values and deleting them does not create any bias or not a representation of any variable.

Creating another data set from the original dataset “D”

1 2 |
df <- D |

Introducing “NA” values randomly into the dataset.

1 2 3 |
df<-as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.95, 0.05), size = length(cc), replace = TRUE) ])) |

Determining the number of “NA” values in the data set.

1 2 |
sapply(df,function(x)sum(is.na(x))) |

Deleting the observations or rows which have “NA” values.

1 2 |
df<-na.omit (df) |

Now, there is also an alternative to this, which is using **na.action = na.omit **directly in the model.

1 2 |
sapply(df,function(x)sum(is.na(x))) |

**Method 2: Deleting The Columns Or Variables**

This method is used only when a certain variable has a very high number of NA values in comparison to other variables. So using the previous method would lead to a loss of too many observations from the dataset. Now here we also need to see whether the given variable is an important predictor of the dependent variable or not. Then decide the better approach to deal with it.

** **Creating another data set from the original dataset “D”

1 2 |
df1 <- D |

Introducing NA values to the first column of the dataset

1 2 |
df1$Sepal.Length [20:140] <-NA |

Determining the number of “NA” values in the data set.

1 2 |
sapply(df1,function(x)sum(is.na(x))) |

We can see that out of 150 observations 121 values in the Sepal.Length column is missing.

Deleting the variable Sepal.Length from the dataset.

1 2 3 4 5 6 |
df1$Sepal.Length<-NULL df1<-df1 [,-1] ## Another way to do it. sapply(df1,function(x)sum(is.na(x))) |

**Method 3: Imputing The Missing Values With Means And Medians**

This is a very common technique of replacing the NA values. It is often used when there is not much variation in the data or the variable is not that important predictor of the dependent variable. Though one can easily calculate the mean or median value to impute the missing values, this method leads to an artificial reduction of the variation in the dataset.

Moreover, it reduces the standard error which invalidates most hypothesis tests. Also, it introduces a wrong representation of the relationship of the variable with other variables in the dataset.

Creating another data set from the original dataset “D”

1 2 |
df2<-D |

Introducing NA values randomly into the dataset.

1 2 3 |
set.seed (123) df2<-as.data.frame (lapply (df2, function (cc) cc [sample(c (TRUE, NA), > prob = c (0.60, 0.40), size = length (cc), replace = TRUE)])) |

Here I am saving a copy of the variable Sepal.Length as “**original**”, consisting of the values which have been replaced as NA values in the data set. This is done so that later one can calculate the MSE, RMSE, and MAPE to see the accuracy of the imputation method.

for more details about **MSE**, **RMSE**, and **MAPE **please open this link

1 2 3 |
fn<-ifelse (is.na (df2$Sepal.Length) ==TRUE, df2$Sepal.Length, 0) original<-D$Sepal.Length [is.na (fn)] |

**Imputation With Mean**

Calculating the value of mean for the variable Sepal.Length and saving it as** predictmean**

1 2 3 |
predictmean <-round (mean (df2$Sepal.Length, na.rm = TRUE), digits = 1) df21<-df2 |

Replacing the missing values in the Sepal.Length column with the mean value

1 2 |
df21$Sepal.Length [is.na (df2$Sepal.Length)] <- predictmean |

For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library **“Metrics” **Saving the output of accuracy in **Accuracy_mean**

1 2 3 4 5 6 7 |
library (Metrics) mae_mean<-mae (original, predictmean) rmse_mean<-rmse (original, predictmean) mape_mean<-mape (original, predictmean) Accuracy_mean<-cbind (mae=mae_mean, rmse=rmse_mean, mape=mape_mean) Accuracy_mean |

**Imputation With Median**

Calculating the value of the median for the variable Sepal.Length and saving it as** predictmedian**

1 2 3 |
predictmedian <-round(median(df2$Sepal.Length,na.rm= TRUE),digits = 1) df22<-df2 |

Replacing the missing values in the Sepal.Length column with the median value

1 2 |
df22$Sepal.Length[is.na(df2$Sepal.Length)] <- predictmedian |

For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library **“Metrics”**

Saving the output of accuracy in **Accuracy_median**

1 2 3 4 5 6 7 8 9 |
library(Metrics) mae_median<-mae(original,predictmedian) rmse_median<-rmse(original,predictmedian) mape_median<-mape(original,predictmedian) Accuracy_median<-cbind(mae=mae_median,rmse=rmse_median,mape=mape_median) Accuracy_median |

**Method 4: Imputing The Missing Values With RPART**

This method of imputation is used when the missing data is of MAR type. So we create a decision tree model to predict the values of the missing data which was previously trained with the data which was already present in the dataset. This method can be used to predict both numeric and factor variables. Here we are predicting a numeric variable, so we choose the **method = “anova”. **But we write **method= “class” **in case of factor variables. It is also important to exclude the missing values from the model by using **na.action=na.omit.**

We have saved the model by the name **fitrpart**

1 2 3 4 5 |
library(rpart) # tree based model fitrpart <- rpart(Sepal.Length ~ ., data=df2[!is.na(df2$Sepal.Length),], method="anova",na.action=na.omit) |

Here we are predicting the missing values of Sepal.Length with the model and saving it in **predictrpart**

1 2 |
predictrpart<- predict(fitrpart, df2[is.na(df2$Sepal.Length),]) |

For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library **“Metrics”**

Saving the output of accuracy in **Accuracy_rpart**

1 2 3 4 5 6 7 8 9 |
library(Metrics) mae_rpart<-mae(original,predictrpart) rmse_rpart<-rmse(original,predictrpart) mape_rpart<-mape(original,predictrpart) Accuracy_rpart<-cbind(mae=mae_rpart,rmse=rmse_rpart,mape=mape_rpart) Accuracy_rpart |

**Method 5: Imputing The Missing Values With kNN**

The kNN algorithm computes the distance between the data point and its k nearest neighbors using the Euclidean distance in multidimensional space and imputes the missing values with the weighted average of the values taken by the k nearest neighbors.

Things to remember:

1.**k** is the number of nearest neighbors used to find the values of the missing data points.

2.**variable** is the variable which consists of missing values we choose to impute. We can choose more than one variable by, **variable=** **c(“a”, “b”, “c” ) **where a, b, and c are the variables which consist of missing values

Disadvantages of using Knn

- Time-consuming in case of large datasets
- The choice of the value of k is very critical. Higher values may include the attributes which we may not need. Lower values may miss out important attributes of the data

Executing the algorithm

1 2 3 4 5 |
library(VIM) df23<- kNN(df2,variable="Sepal.Length",k=6 ) df23$Sepal.Length_imp<-NULL |

Saving the values of variable Sepal.Length that was imputed using kNN in a vector **predictkNN**

1 2 |
predictkNN <- df23[is.na(df2$Sepal.Length), "Sepal.Length"] |

For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library **“Metrics” **Saving the output of accuracy in **Accuracy_kNN**

1 2 3 4 5 6 7 8 9 10 |
library(Metrics) mae_kNN<-mae(original,predictkNN) rmse_kNN<-rmse(original,predictkNN) mape_kNN<-mape(original,predictkNN) Accuracy_kNN<-cbind(mae=mae_kNN,rmse=rmse_kNN,mape=mape_kNN) Accuracy_kNN |

**NOTE:** There is another package **library (****DMwR)** where function** knnImputation () **can be used to do the imputation. (For more details about the function: **Click**

**Method 6: Imputing The Missing Values With MICE**

Mice or Multivariate Imputation via Chained Equations is a package that uses multiple imputations for a missing data treatment. Now, as multiple imputations create multiple predictions for each missing value; they take into account the uncertainty in the imputation and give the best standard errors. If there is not much information in the given data used to prepare the model, the imputations will be highly variable, leading to high standard errors in the analysis.

Things to remember:-

**m**is the number of imputations. In order to achieve better estimates of standard errors, more imputations are better. But the default value of m is**5**.**maxit**is the number of iterations for each imputation. Mice use an iterative algorithm. That does multiple imputations for all variables and at the end, they reach a point of convergence.**method=****“pmm”**refers to the imputation method. In this case, we are using**predictive mean matching**as imputation method.**methods (mice)**can be used to see the list of the imputation methods that are available. The continuous data imputation is done by predictive mean matching by default, but we can also use**Bayesian linear regression**(**method= “norm”**) for this dataset. There are other methods which can be used according to the requirements of the data. For e.g. use**method= “logreg”**for categorical data (levels=2) use**method= “****polyreg****”**for categorical data (levels>=2) and etc

1 2 3 4 |
library (mice) fitmice<-mice (df2, m=10, maxit=30, method="pmm") df24<-complete (fitmice) |

Saving the values of variable Sepal.Length that was imputed using Mice in a vector **predictmice**

1 2 |
predictmice<-df24 [is.na (df2$Sepal.Length), "Sepal.Length"] |

For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library **“Metrics” **Saving the output of accuracy in **Accuracy_mice**

1 2 3 4 5 6 7 8 9 |
library (Metrics) mae_mice<-mae (original, predictmice) rmse_mice<-rmse (original, predictmice) mape_mice<-mape (original, predictmice) Accuracy_mice<-cbind (mae=mae_mice, rmse=rmse_mice, mape=mape_mice) Accuracy_mice |

**Method 7: Imputing The Missing Values With ****missFOREST**

The missForest function is used particularly in the case of mixed-type data. It can be used to impute continuous and categorical data including complex interactions and non-linear relations. It uses the given data in the data frame to train the random forest model and then uses the model to predict the missing values. It yields an out-of-bag (OOB) imputation error estimate without the need of a test set or elaborate cross-validation.

1 2 3 4 5 6 7 8 |
library (missForest) Executing the algorithm fitmissforest<-missForest (df2) Saving the output in df25 df25<-fitmissforest$ximp |

Saving the values of variable Sepal.Length that was imputed using missForest in a vector **predictmissforest**

1 2 |
predictmissforest<-round (df25 [is.na (df2$Sepal.Length), "Sepal.Length"], digits=1) |

For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library **“Metrics” **Saving the output of accuracy in **Accuracy_missforest**

1 2 3 4 5 6 7 8 9 |
library (Metrics) mae_missforest<-mae (original, predictmissforest) rmse_missforest<-rmse (original, predictmissforest) mape_missforest<-mape (original, predictmissforest) Accuracy_missforest<-cbind (mae=mae_missforest, rmse=rmse_missforest, mape=mape_missforest) Accuracy_missforest |

Creating a data frame **Accuracy** consisting of all the accuracy outputs of the last **five** methods

1 2 3 4 5 |
Accuracy<-cbind(Methods=c("Mean","Median","rpart","kNN","mice","missForest"), rbind.data.frame(Accuracy_mean,Accuracy_median,Accuracy_rpart, Accuracy_kNN,Accuracy_mice,Accuracy_missforest)) library (ggplot2) |

Creating a scatter plot to visualize the best method with least **MAPE** (Mean absolute percentage error) Similarly, you can see all the plots of **MAE** and **RMSE**.

Scatter plot

1 2 3 4 |
#Scatter plot ggplot(data=Accuracy, aes(x=Methods, y=mape)) + geom_point() |

So, from the above diagram, we can see that for the given data set mice and missForest gives us the best output. But we cannot conclude that for each and every dataset mice or missForest will be the best method of missing value treatment. Because missing values have different types and patterns. So, I think the best way is to test the models with the given data and then use the best model to impute the missing values in the data set

**Other Missing Values Treatment Methods**

Missing values imputation with **missMDA**

Missing values imputation with **Fuzzy K-means Clustering
**Missing values imputation with

**bpca (Bayesian Principle Component Analysis)**