Data Science: Python for Data Preprocessing

In the last article, we had look at how the raw vegetables are obtained for preparing a dish i.e. raw data extraction through web scraping from various websites. Now it’s time to clean these vegetables and process it to make it apt for various dishes.

Similarly, we will be preprocessing the data by cleaning it, removing insignificant features and then performing data exploration. These steps comprise of data preprocessing/data wrangling which is mandatory before visualization of data and the feature engineering.

Continuing with the first part of this series, we will be looking at different techniques involved in the preprocessing of data. This series of articles will be covering the following topics:-

  1. Web Scraping
  2. Data Preprocessing
  3. Feature Engineering and Model Selection

We will look into the following topics which are covered under preprocessing of raw data:-

  1. Data Preprocessing
    • Data Exploration
      • Identifying variables
    • Data Cleaning
      • Dropping Features
      • Missing Values Identification
      • Formatting the data.
    • Data Visualization/ Exploratory Data Analysis
  2. Text data preprocessing

Data Preprocessing

Data preprocessing involves a collection of steps which helps to purify the data and extract the useful and remove the insignificant information. Data obtained from real-world is incomplete, inconsistent and it also contains numerous errors. Thus to counter this issue with the data, we are using data preprocessing which aids in removing discrepancies in names and related problems.

In the real world, we generally have two broad categories of data. In the first kind of data, we have continuous and categorical features and then in the second kind of data we have the text data. So these two types of data require different steps for preprocessing. So first let’s have a look at the steps for continuous and categorical features.

To understand the following steps of data handling and visualization we will be using two different datasets which are available on Kaggle. The first dataset is regarding the information of Cities of India and the other dataset is related to Online Shopping.

Data Exploration

Initially, after loading the dataset, we have to study the dataset which is used by us for gathering the general insights which data possess. To gather these insights, we look at the continuous features. For this we will use the head(), info() and describe() function of pandas.

The head() function helps us to know the columns and rows contained in the dataset. By default, the head() function displays the first 5 values of the dataset.

After looking at some initial values, we can determine the shape i.e. number of columns and rows contained in the dataset by using a shape attribute of the dataframe.

The Info() function tells us the count of rows which consists of values in the given number of columns. By comparing the rows displayed through shape attribute and info() function, we can predict if there are any cells with NaN values. In the above example, we have no empty cells as all the 493 rows consist of some values. Here we can also know the data type of all the columns of the dataset.

Identifying variables

We can look at the variables/columns of the data which can act as predictor variables and target variables. So this can build a connection between such variables. In the mentioned example, state_name and name_of_city are the predictor variables whereas other columns can be the target variables.

Moreover, we can also remove the columns which are of no significance and cannot provide any sort of insight into the dataset. For example, here in this dataset, we will remove the state_code, dist_code and location are not of any use. So we will remove them from our dataframe so that these columns do not hinder the results of other operations.

The drop() function removes the columns from the dataframe and axis = 1 parameter specifies that we are wanting to remove the columns.

Univariate Analysis

In univariate analysis, we perform the analysis of variables based upon their type i.e. either they will be continuous or categorical. If we have continuous variables which are also the case with a dataset which we have, we can use describe()  function. Through this function, we can discern the different measures which reveal the central tendencies and spread of the variable.

Here the values which describe function displays are count, mean, standard deviation, minimum and maximum value along with the three quartile values. For categorical values, we can use a frequency table for a better understanding of different categories.

As the column ‘state_name’ is categorical, we can find the frequency of different states in the dataset. This is done by using value_counts() function.

Data Cleaning

Now after having gone through the data, we will be looking to clean the dataset by removing erroneous values, redundant data. Also, we would want to remove the columns with missing values. Moreover, it is suggested to nullify the impact of outliers as well.

For performing data cleaning, we would be using a different dataset which is related to Online Shopping.

When it comes to data cleaning, we focus onto three main steps which are as follows:-

Dropping Features

As we had a look earlier, we removed three columns from the dataset as they were of no significance. Similarly, if there are some more columns which have a large number of missing columns, then it is recommended to drop that column comprehensively. Along with this, at this stage, we perform changes which are necessary to make the data apt for preprocessing steps.

As we can see the column names are starting with capital letters, thus we would want them to be in lowercase so that it does not hinder in the proceedings.

By using rename() function, we will be able to rename the column names as per our requirement.

We can clearly see that the column names have been changed into

Missing Value Treatment

Managing the missing values in the dataset plays a crucial part in data preprocessing. If we do not handle the missing values, then we can get misleading results. First, for checking missing values, we can use the following code snippet.

In the above code snippet, we look at null values and sort them in ascending order. We can see that the cust_id column has the most number of missing values.

In this code snippet, we look at the rows with missing values. Thus after knowing this, we can decide how we will handle these missing values. The different ways of managing missing values are as follows:-

  • We can fill these missing values with random values like ‘0’
  • Ignoring missing values, if they are less in number
  • Another way is to fill these missing rows with mean, mode or median of the columns. This method is more preferred than the other two methods.

By using the dropna() function, we can remove the rows with empty values and then store it in a new dataframe.

When we check for the missing values, we can see that there are no missing values.

Formatting Data

Formatting of data involves making the data types compatible with other data types of the columns, removing abbreviations and also creating new columns by using values from existing columns. In this example, we will be formatting the following things.

If we look in the above example of this dataset, the description is present in the upper case. Thus to change it, we will be using the lower() function to make the description to lower case.

In the above snippet, we can see cust_id column is of float type and to change this, we will be using the code shown above.

Here in this picture, the cust_id data type is now of integer type. Therefore, this shows how to change the data type as per our needs.

Data Visualization/ Exploratory Data Analysis

After completing the preprocessing of data, the next step is to perform the visualization of data. This is also known as Exploratory Data Analysis. We will use both the datasets for visualization and getting insights from them. First, let’s look at some visualizations from the cities dataset.

This snippet code displays the top 5 cities with the highest population with all the column values.

For obtaining states with the highest population, we will be using the groupby() function and then plotting by using the high_pop dataframe.

Clearly discernible from this plot that states like Uttar Pradesh, Maharashtra has the highest population whereas states like Meghalaya, Mizoram, and Nagaland have the lowest population.

Now let’s look at some of the visualizations of the other dataset.

This displays the highest money spending customers from different countries.

The above visualization helps us to know the number of orders on different days of the week. There can be more visualizations as well. It depends on our creativity and curiosity about what we want to know from the dataset.

Text data preprocessing

Most of the data obtained from the websites and other sources are text data and thus, it is required to process them in a different manner. I have covered text data preprocessing which was regarding Natural Language Processing.

You can have a look at the article and know what are the steps required for text data preprocessing. Generally, the steps involved for preprocessing text data is discussed as follows:-

Converting text to lowercase:It is recommended because this removes anomalies between identical words which are present in both upper and lowercase.

Remove Numbers:It is a tedious task to process text data with numbers, so we remove numbers from it.

Removing punctuation and special letters:All the punctuation and special letters are providing no information. So we always drop them from text.

Removing stop words:These stop words are the prepositions like a, an, the etc. which are in abundance in data but are of no use. So they are also removed.

After the above steps, then we have options of some specific steps which can be taken as per our requirements. The Jupyter notebooks for this article can be referred from here.

You might also like More from author