Missing Value Imputation Techniques In R

Missing values occur when no data is available for a column of an observation. They are expressed by a symbol “NA” which means “Not Available” in R. Missing values introduces vagueness and miss interpretability in any form of statistical data analysis. In this article, I will take you through Missing Value Imputation Techniques in R with sample data. There are many types of missing values in a data.

Let’s see the three main types missing values according to their pattern of occurrence in a data set.

  • Missing completely at random (MCAR)
  • Missing at random (MAR)
  • Not missing at random (NMAR)

Missing Completely at Random (MCAR)

It occurs when the missing values occur entirely at random and are independent of other variables in the observation. Here we are assuming that the variable of missing data is completely unrelated to the other variables or columns in the data. For example

Suppose that we have a database of school students with 4 columns Student.Id, Name, Gender, and Number of Subjects. With the data available we cannot determine the number of subjects for the given missing observation because the missing data is completely independent of the other observations in the data.

Missing at Random (MAR)

An alternative assumption to MCAR is MAR or Missing at Random. It assumes that we can predict the missing value on the basis of other available data.

From the given data we can build a predictive model that Number of subjects can be predicted on the basis of independent variables like class and age. So in these cases, we can use some advanced imputation techniques to determine the missing values.

MAR is always a safer assumption than MCAR. This is because any statistical analysis which is performed under the assumption of MCAR is also valid for MAR, but the reverse is not true.

Not Missing at Random (NMAR)

NMAR is also known as nonignorable missing data. It is completely different from MCAR or MAR.  It is a case where we cannot determine the value of the missing data with any of the advanced imputation techniques. For example, if there is a question in a questionnaire which is a very sensitive issue and it is likely to be avoided by the people filling out the questionnaire, or anything that we don’t know.  This is known as missing not at random data.

In the present study, I have used the iris data set which is already present in the R software. Though the dataset does not have any missing values, I have introduced missing values randomly into the data set to execute the six most popular methods of missing value treatment.

Saving the dataset in a dataframe “D”.

The data frame has four columns

  • Sepal.Length
  • Sepal.Width
  • Petal.Length
  • Petal.Width
  • Species

Of these variables, the first four are numeric and the fifth variable “Species” is a factor with three levels.
NOTE: I have used only the first column i.e. Sepal.Length to explain the imputation techniques.

Method 1: Deleting The Observations

This is the best avoidable method unless the data type is MCAR. We have to see whether the deletion of the data will affect any of the statistical analysis done with the data or not. Moreover, it is only performed if there is a sufficient amount of data available after deleting those observations with “NA” values and deleting them does not create any bias or not a representation of any variable.

Creating another data set from the original dataset “D”

Introducing “NA” values randomly into the dataset.

Determining the number of “NA” values in the data set.

Deleting the observations or rows which have “NA” values.

Now, there is also an alternative to this, which is using na.action = na.omit directly in the model.

Method 2: Deleting The Columns Or Variables

This method is used only when a certain variable has a very high number of NA values in comparison to other variables. So using the previous method would lead to a loss of too many observations from the dataset. Now here we also need to see whether the given variable is an important predictor of the dependent variable or not. Then decide the better approach to deal with it.

 Creating another data set from the original dataset “D”

Introducing NA values to the first column of the dataset

Determining the number of “NA” values in the data set.

We can see that out of 150 observations 121 values in the Sepal.Length column is missing.

Deleting the variable Sepal.Length from the dataset.

Method 3: Imputing The Missing Values With Means And Medians

This is a very common technique of replacing the NA values. It is often used when there is not much variation in the data or the variable is not that important predictor of the dependent variable. Though one can easily calculate the mean or median value to impute the missing values, this method leads to an artificial reduction of the variation in the dataset.

Moreover, it reduces the standard error which invalidates most hypothesis tests. Also, it introduces a wrong representation of the relationship of the variable with other variables in the dataset. 

Creating another data set from the original dataset “D”

Introducing NA values randomly into the dataset.

Here I am saving a copy of the variable Sepal.Length as “original”, consisting of the values which have been replaced as NA values in the data set. This is done so that later one can calculate the MSE, RMSE, and MAPE to see the accuracy of the imputation method.

for more details about MSE, RMSE, and MAPE please open this link

Imputation With Mean

Calculating the value of mean for the variable Sepal.Length and saving it as predictmean

Replacing the missing values in the Sepal.Length column with the mean value

For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library “Metrics” Saving the output of accuracy in Accuracy_mean

Imputation With Median

Calculating the value of the median for the variable Sepal.Length and saving it as predictmedian

Replacing the missing values in the Sepal.Length column with the median value

For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library “Metrics”

Saving the output of  accuracy in Accuracy_median

Method 4: Imputing The Missing Values With RPART

This method of imputation is used when the missing data is of MAR type. So we create a decision tree model to predict the values of the missing data which was previously trained with the data which was already present in the dataset. This method can be used to predict both numeric and factor variables. Here we are predicting a numeric variable, so we choose the method = “anova”. But we write method= “class” in case of factor variables. It is also important to exclude the missing values from the model by using na.action=na.omit.

We have saved the model by the name fitrpart

Here we are predicting the missing values of Sepal.Length with the model and saving it in predictrpart

For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library “Metrics”

Saving the output of  accuracy in  Accuracy_rpart

Method 5: Imputing The Missing Values With kNN

The kNN algorithm computes the distance between the data point and its k nearest neighbors using the Euclidean distance in multidimensional space and imputes the missing values with the weighted average of the values taken by the k nearest neighbors.

Things to remember:

1.k is the number of nearest neighbors used to find the values of the missing data points.

2.variable is the variable which consists of missing values we choose to impute. We can choose more than one variable by, variable= c(“a”, “b”, “c” ) where a, b, and c are the variables which consist of missing values

Disadvantages of using Knn

  • Time-consuming in case of large datasets
  • The choice of the value of k is very critical. Higher values may include the attributes which we may not need. Lower  values may miss out important attributes of the data

Executing the algorithm

Saving the values of variable Sepal.Length  that was imputed using kNN in a vector predictkNN

For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library “Metrics” Saving the output of  accuracy in  Accuracy_kNN

NOTE: There is another package library (DMwR) where function knnImputation () can be used to do the imputation. (For more details about the function: Click

Method 6: Imputing The Missing Values With MICE

Mice or Multivariate Imputation via Chained Equations is a package that uses multiple imputations for a missing data treatment. Now, as multiple imputations create multiple predictions for each missing value; they take into account the uncertainty in the imputation and give the best standard errors. If there is not much information in the given data used to prepare the model, the imputations will be highly variable, leading to high standard errors in the analysis.

Things to remember:-

  1. is the number of imputations. In order to achieve better estimates of standard errors, more imputations are better. But the default value of m is 5.
  2. maxit is the number of iterations for each imputation. Mice use an iterative algorithm. That does multiple imputations for all variables and at the end, they reach a point of convergence.
  3. method= “pmm” refers to the imputation method. In this case, we are using predictive mean matching as imputation method. methods (mice) can be used to see the list of the imputation methods that are available. The continuous data imputation is done by predictive mean matching by default, but we can also use Bayesian linear regression ( method= “norm” ) for this dataset. There are other methods which can be used according to the requirements of the data. For e.g. use method= “logreg” for categorical data (levels=2) use method= “polyreg for categorical data (levels>=2) and  etc

Saving the values of variable Sepal.Length that was imputed using Mice in a vector predictmice

For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library “Metrics” Saving the output of accuracy in   Accuracy_mice

Method 7: Imputing The Missing Values With missFOREST

The missForest function is used particularly in the case of mixed-type data. It can be used to impute continuous and categorical data including complex interactions and non-linear relations. It uses the given data in the data frame to train the random forest model and then uses the model to predict the missing values. It yields an out-of-bag (OOB) imputation error estimate without the need of a test set or elaborate cross-validation.

Saving the values of variable Sepal.Length that was imputed using missForest in a vector predictmissforest

For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library “Metrics” Saving the output of accuracy in Accuracy_missforest

Creating a data frame Accuracy consisting of all the accuracy outputs of the last five methods

Creating a scatter plot to visualize the best method with least MAPE (Mean absolute percentage error)  Similarly, you can see all the plots of MAE and RMSE.

Scatter plot

So, from the above diagram, we can see that for the given data set mice and missForest gives us the best output. But we cannot conclude that for each and every dataset mice or missForest will be the best method of missing value treatment. Because missing values have different types and patterns. So, I think the best way is to test the models with the given data and then use the best model to impute the missing values in the data set

Other Missing Values Treatment Methods

Missing values imputation with missMDA
Missing values imputation with Fuzzy K-means Clustering
Missing values imputation with bpca (Bayesian Principle Component Analysis)

You might also like More from author