DATA HANDLING IN R

DATA HANDLING WITH ‘dplyr’ PACKAGE IN R

What is data handling?
The first and foremost knowledge needed for Data Analyst or Data Scientist is how to handle the data? Now the question is what is data handling?

Data handling means gathering and recording the information gathered and present it in a way that is meaningful to others.

Let us take an example, let’s go back to the early 20’s, do you remember of a phone directory, which consists of peoples name and their phone numbers. The names are arranged in alphabetical order, that means the names are arranged in a systematic manner that is why it is possible to find the number of a particular person. This is an example of data handling as the data is arranged in such a way that is meaningful to others.

Now we come to the two different approaches towards data handling i.e. the statistical approach and the non-statistical approach towards data handling.

The non-statistical approach to the data handling simply arranging your data in a form that is meaningful to others. It can the simple arrangement of names according to the alphabetical order on a sheet of paper, so that when we want the information for a given person we can do it easily.

The statistical approach is arranging the data in a meaningful manner and extracting some information from the data which can be used to gain information about the data. Let us suppose that we have observations on the weight of 1,000 students in a random sequence, then after looking at the data, we can`t say anything about the distribution of the weights of the students. For having an information about the above data, we have to arrange the data in a given order, we have to find the mean and standard deviation of the data. So these are some of the points which we have to keep in my mind before starting the data analysis for any data.

dplyr package in R programming

One of the most important packages in R programming is the dplyr package which is used for data handling and manipulation in the data frame. The d in the name reinforces that the package is meant to work with data.frames in R.  The dplyr package can be used to extract different columns (i.e. different variables) from a data frame, extracting rows from a data frame, adding new variables to the data frame, for applying functions to different variables of data frame, splitting the data according to a variable.

In this article, we will try learning these qualities of the dplyr package using examples. We will take the “mtcars” data present in R.

About the data

The data frame consists of 32 observations on 11 variables. The dataset comprises of fuel consumption and 10 aspects of automobile design and performance for 32 automobiles.

Selecting different columns (“select” function)

There always comes a situation while analyzing to select different columns i.e. extracting different variables from a data frame. The dplyr package is very handy in performing this task for us. Suppose we have the ‘mtcars’ data and we want to extract the “mpg” variable from this dataset, then the “select” command in the dplyr package comes in use

Code:

The different columns can be selected using partial matching with the column names. This work can be done using the “dplyr” functions “starts_with”, “ends_with” and “contains”.


Selecting different rows(“filter” and “slice” function)

Many of the times we have to select rows of the data frame using logical expressions. The work is done using the “filter” command of the “dplyr” package in R. Suppose we have to select all those automobiles which 4 gears. This work can be done using the filter package.

Code:

Different logical operators can be used in the “filter” command. Just assume that we have to extract the information from all the automobiles which have 4 gears and 2 carburettors. This work can be done using the code:

 

While the filter command is used for specifying rows with logical expressions, the “slice” command is used for selecting rows by row numbers.

Adding new columns to the data (“mutate” function)

In most of the data analysis tasks, there is a need for modifying the existing columns or adding new columns to the data frame. The “mutate” function of the “dplyr” package comes very handily. Just for an example, if we have the data containing a column as the “price” variable which is the price in dollars and we want to add another variable as “price2” variable which is the price in rupees, thus we can calculate the “price2” variable by multiplying by a suitable value.

In such conditions, the “mutate” functions can be used. The example of “mutate” function can be understood using the code below:

Let us create a new variable in the “y” data frame “mtcars” as (mpg/cyl):

In a similar manner, the existing columns can also be modified by the “mutate” function of “dplyr” package.

Applying functions to different columns of the data frame(“summarise” function):

One of the most important functions in the “dplyr” package is the “summarise” function which is used to apply functions to a column in a data frame. The “summarise” function applies a function to the column of a data frame and returns a result of length one such as mean, median or other similar functions.

Suppose we have a data frame and we want to find the average value for a specific column, then, in this case, the “summarise” can be used.

Let us find the mean of the “mpg” variable in the “mtcars” data frame :

Multiple functions can be used in the “summarise” command.

The “summarise” function is similar to the “base” package in R.

Grouping data using a factor variable and then applying a function to a column(“group_by” function):

Sometimes we need to group the data using a factor variable present in the data and apply the functions to the column to the partitioned data. Consider the “mtcars” data frame, we have a variable as “cyl” that represents the no of cylinders in the automobile, this variable can be used to partition the data. And thus we can find the average “mpg” for automobiles having different cylinders.

The output gives the mean “mpg” for automobiles having the number of cylinders as 4,6,8  separately.

The grouping can be done using different factors. It can be better understood using the example below:

The command above split the data using the two-factor variables “cyl” and “vs” and then calculate the mean “mpg” for each combination.

Arranging the data frame according to a variable( the “arrange” function):

We can use a variable for sorting a data frame i.e. the data frame is arranged as the sorted variable. Suppose in the “mtcars” data, we want data which is arranged according to the “mpg” variable. This can be done using the “arrange” function of the “dplyr” package. An example of the “arrange” function is shown below:

The default sorting is always done in ascending order.  We can use “desc()”  function in order to arrange the data in the descending order.

 

Conclusion:

From the above functions which we used above, we can say that the “dplyr” package is easy to code, faster to execute. The “dplyr” makes data handling and manipulation much easier instead of using the “base” package in R.

You might also like More from author

Leave A Reply

Your email address will not be published.