dplyr package in R

dplyr package in R

WHAT IS “dplyr”?

“dplyr” package, widely used in R, is basically a grammar of data manipulation. It is written and maintained by Hadley Wickham.

WHY “dplyr”?

The package helps in transformation and summarization of data frames (i.e., data recorded in tabular form with rows and columns). It provides the most important verbs available to the users to work on R. Besides, it also allows the users to use the same interface while working data in different forms, be it in data frame or a table or from a database itself.

INSTALLATION AND USE

The code to install this package:

Now we will discuss about a set of functions in the package, which performs common data manipulation operations.

The functions discussed below are:
1. Filter ()
2. Select ()
3. Mutate ()
4. Arrange ()
5. Summarize ()
6. Group_by ()
7. Piping ()

IMPORT DATA

We explain the above functions using a data set available in R – “flights”.
To get this data set we have to install and then call two packages,

Storing the data set ‘flights’ with the name ‘data’,

This data frame contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics.

DATA CLEANING

This is a vital step for all the analytical models. The data imported is not clean, as we can see huge amounts of missing values. So we have to do missing value treatment.
To check if there are missing values,

We see there are 46595 NA values, which are considered missing. We remove these and store the rest dataset in a new ‘data1’,

Now we are left with 327346 records to work on.
Secondly, we see that the variables like, “dep_delay” and “arr_delay” has negetive values, which ultimately makes no sense. The logic being,
‘dep_delay’ = ‘dep_time’ – ‘sched_dep_time’ , should be the case.
But if we have ‘dep_delay’ negetive means that ‘sched_dep_time’ is more than ‘dep_time’, which means that the flight departed before time, so it is ethically not delayed. Hence negetive means nothing relevant under “dep_delay”.
Same logic goes for “arr_delay”. Hence negetive means nothing relevant for this as well.
So we replace the negative values with 0 for both the fields, showing that there is no delay in those flights.

Thirdly, we see the structure of the data is not appropriate,

We have to correct the data types of few fields,

Fourthly, we convert the data set into data frame, to work easily with it,

Now the data looks fine to be used for our analysis. Data cleaning part is now over.

Starting with the function of ‘dplyr’

Filter ()
This functions returns only the rows that matches with the condition entered by the user. It is called the filtering process, where the rows returned as output holds the given condition true. There can be one or more than one condition given by the user at a time.
Some useful filter functions are – . == , > , >= , < , <= . & , | , ! . xor() , is.na() . between() , near()
The syntax being: filter (dataset name, conditions)
For example, if we want the records of students whose age is more than 15, from student dataset: filter (student, age > 15)
For multiple conditions, we mention all the conditions using filter functions in between them. Example, if we want the records of female students having age more than 15: filter (student, sex == “F” & age > 15)
Working with our data set, first showing with a single condition,

Here we get only the flight records whose ‘origin’ is ‘EWR’. there are 117127 such records.
For multiple conditions,

Here there are two conditions mentioned, and the output records must fulfil both the conditions simultaneously. That is, the flight ‘origin’ has to be ‘EWR’ and simultaneously the flight ‘dest’ has to be ‘IAH’. We get 3923 such records.
Also we notice that the number of output records for multiple conditions is less than that for single condition. Hence the simplest observation is that more the conditions imposed to come true, lesser is the number of output records.

Select ()

This function is usually called when there is a large data set, i.e., the number of variables considered and the observations are both huge. It often happens that when we work on any data set, we are not interested in the whole set of observations, instead we want to work on a particular set of observations only. Hence it helps us to extract a part of the original large data set of our interest and work on it.
There are few functions / arguments which work only inside select (). These are: . starts_with () , ends_with () , contains () . matches () , num_range ()
The syntax being: select (table name, the columns we want to display separated by commas)
For example, if we want to extract name, sex and age of the students, select (student, name, sex, age)
We can also use the functions inside the select statement to extract the desired records. Example, select (student, starts_with (“total”)) Here we get the records of all the students whose column name starts with “total”. Example, it can be the column “total marks”.
Now if we use a minus sign before the column names then it means that we want to drop those particular columns from the extracted table. For example, select (student, -name)
It will extract a table from the data frame student without the column “name”. In other words it will extract the whole of the student table, after dropping the column “name”.
Showing with our present data set,

Here we extract records of only three columns from the dataset- sched_dep_time, sched_arr_time and flight. There shows 327346 records with only 3 variables.
Now if we want to extract all the columns that has column name containing “arr”, then,

This will show the 4 columns that starts with “arr”- arr_time, sched_arr_time, arr_delay and carrier.
Similarly we can use other embedded functions also as mentioned above.
Next if we want to extract the whole data set, except the column ‘year’, ‘month’, and ‘day’ in it,

We see that these three columns does not show up in the output. Hence we can work with the rest of the dataset. So we see only 16 variables are there now, as desired.

Mutate ()

This function creates a new column to the existing data frame. The column thus created should essentially be the function of the existing variables in the concerned data frame.
There are few useful functions which are used in mutate (): . Arithmetical operators like, – , + , * , / . log () . Cumulative functions like, cumsum (), cummin (), cummax (), etc. . if_else (), etc…
The syntax being: mutate (table name, derived column name = the calculations with the existing column)
For example, if we want to find the average marks of all the students, mutate (student, avg_marks = (maths_marks + eng_marks)/2)
This will eventually create a new column “avg_marks” containing marks of the individual students.
Another feature of this function is that we can drop a particular variable by setting its value as NULL.
Example, mutate (student, address = NULL)
This command will set the address column will NULL. In this way we can drop the unrequired variables.
Explaining this by using our dataset.
If we want to specify the flights whose ‘arr_delay’ is more than 100 to be “Bad rated flight” and the rest to be “Average rated flight”,

We see that a new column is added to the dataset, which shows ‘Flight_Remarks’ for individual records. Hence we now get 327346 records for 20 variables.
Now suppose if we want, we can drop the column “time_hour” and work with rest of the columns, as it is just the concatenation of ‘year’, ‘day’ and ‘month’ columns in the same data set. This can be done by,

As the value of the column “time_hour” is set NULL, it implies that the particular column is dropped. Hence the whole data set, except the column “time_hour” is show as an output. so, We see 18 variables now.

Arrange ()

This function is used to re-order rows according to the variable specified by the user. The default re-arranging pattern is ascending. To make it descending we need to mention desc (). This function also allows group_by () in it, for arranging records according to groups.
The syntax is: arrange (table name, col names by which we want to arrange separated by comas)
For example, if we want student records to be arranged in order of total marks, arrange (student, total_marks)
If we want to order the students according to highest to lowest marks, arrange (table name, desc (total_marks))
Explaining this by using our dataset.
If we want to arrange the data set in the order of column “distance”, so that it is easy for us to identify which flights are for short trips or which one is for long trips. Also we can easily determine the distance between the origin and destination countries sorted,

We see that the records are arranged according to “distance”, but by default in ascending order of the distance amount.
Now if we want the same records but by descending order of the “distance”, to identify the longest route flights easily,

Now we get our desired output.

Summarise ()

This function draws a summary statistics from a particular column in a data frame. In other words, it brings down to a single value from multiple values. The function works more significantly when used on group level data, created by the function group by (). The output thus formed after applying this function will be one row per group.
The aggregate functions used in summarise () are: . mean (), median () . max() , min() . n () , first (), last (), distinct (), etc.
The syntax is:
summarise (table name, aggregate functions function of the existing variables separated by commas)
For example, if we want to know the minimum, maximum and average marks of the student dataset,
summarise (student, min(total_marks), max(total_marks), mean(total_marks))
Explaining this by using our dataset.
Suppose we want to know the maximum, minimum and average ‘distance’ of the flights in 2013,

Here we see that three single values are derived from a whole data set, the three values showing the maximum, minimum and average ‘distance’ covered by all the flights in 2013, as desired.
Group_by ()
This is used when we want to group the dataset with respect to a particular attribute.
From our data set if we want to group the records according to the year first, then month and then day,

Here we get all the records but these are sorted and grouped according to the year first, then month and then by day.

Piping ()

This is used mainly when we have multiple operations to execute. Writing multiple commands in separate lines make the program look clumsy. So we can write multiple commands in a single line, connecting all with piping function. The pipe looks like, %>%.
Now working with our data set if we want to group by the data set on the basis of ‘year’, then ‘month’ and then ‘day’. Also we want to extract only certain fields like, ‘arr_delay’, ‘dep_delay’, ‘flight’, ‘origin’, ‘dest’ and ‘distance’. We can create this using a single command,

So we get 327346 records and only 9 variables, just like we wanted.

Combination of Group_by () and Piping ()

Suppose the task is to find the number of flights falling under each ‘carrier’.
So we can start it, by grouping the data set on the basis of ‘carrier’. Creating a dummy column ‘count’ and then adding the value with every occurrence of unique ‘carrier’ codes.

We get 327346 records showing 21 attributes. The two extra columns shown are for ‘count’ and the new column created by mutate function, ‘Num_of_flights’.
Now we see that a lot of repeatations in the ‘Num_of_flights’ column. It is not easy for us to understand, exactly which ‘carrier’ has how many flights in total.
So we extract two columns, ‘carrier’ and ‘Num_of_flights’ in a new data, ‘c2’.

We get all the records of only these two above mentioned columns.
Now we take only the unique codes under ‘carrier’ column, to make the look of what we want easier,

Finally we reach our destination. We see 16 different ‘carrier’ codes and the number of flights under them.
Now, suppose if we want to cross-check to whether the grouping is done properly or not, we can add the number of flights under each ‘carrier’ and see if we get our original number of records we had in data2,

So we see the value comes to 327346, exactly the same number of records we were originally working with. Hence the grouping done is correct.
Hence to conclude “dplyr” is a very powerful package that can make easy calculations and manipulations on data sets, which can actually make our life easier.

You might also like More from author

Leave A Reply

Your email address will not be published.