The post Functional Programming in R Using “Purrr” Package appeared first on StepUp Analytics.

]]>“R is a Functional Programming Language which means R provides a set of tools to create and manipulate functions ”

Hadley Wickham and Lionel Henry are the authors of Purrr and it is a part of Tidyverse. Functions are also objects in R they are treated in the same way as Vectors:

create, assign them to objects and later use these objects in other functions and another important feature is passing functions as arguments to other functions: the core part of apply family functions as well as the Purrr package.

What’s the point of using Purrr package since we already have the Apply family functions. BTW If you are not familiar with the apply family then here’s the link.

Passing functions as arguments make ‘R’ users hassle-free. In other words less code and less verbose. Now, Let’s see the first function in the Purrr package called Map.

Map function’s syntax **map (.x,** **.f, …)** is pretty much same as the Apply family functions except mapply. In mapply, the function argument comes first then followed by data arguments.

The first argument **.x** is always a vector followed by **.f** is always a function which is applied to every element of **.x**. These map functions work on the same logic as Apply family functions.

Let’s create a Dataframe:

a <- sample(x =1:10,size = 6) #A sample of 6 elements without replacement! b <- rep(x = sample(x = 1:10,size = 3),each=2) #"rep" Repeats the elements of sample function by two times(each=2). c <- c("India","USA","UK","Australia","China","Canada") d <- rnorm(n = 6)#rnorm generates 6 random numbers from normal distribution with mean=0 and sd=1. df <-data.frame(a,b,c,d,stringsAsFactors = F)#data.frame always shows the character vector(Here'c') as Factors! df

# Load the Purr Library library(purrr) m_l <- map(df[,c("a","b","d")],mean) # Map function iterates over the columns of the DF # and applies the mean function to each col # Map function always returns a list! m_l

#Base package's lapply is the closest function to perform same operation. l<-lapply(df[,c("a","b","d")],mean) l #If we had passed the entire df without subsetting the Character vector 'c' then #both the functions(map and lapply) return a list with 'c' component as NA. #If the data argument is a Vector: m_v<-map(a,sqrt) m_v #Then map function iterates over each ELEMENT of the vector and applies the function.

The **… **(dot dot dot) argument in map functions is used to pass the additional arguments to the function **.f **and the dot **(.) **before x and f denote that these argument names are highly unlikely to be the argument names which we pass through **…, **To avoid the confusion: the first two arguments belong to the map function and the rest of them belong to the function that we are mapping.

#If the data argument is a List: l <-list(a=a,b=b) m_l<-map(l,mean) m_l #Then map function iterates over each COMPONENT of the list and applies the function. #Sigificanc of ... argument in map functions: df1<-df df1$a[1]<-NA df1$a m_ddd<-map(df1,mean,na.rm=TRUE) m_ddd#Warning message is because of the Character vector 'c' in the DF. m_ddd<-map(df1[,c("a","b","d")],mean,na.rm=TRUE)#Wouldn't return a warning message!

Type-Inconsistent or Unstable functions: The type of return object depends on the input. sapply is a Type-Inconsistent function.

All the **Purrr** functions are Type-Consistent or stable, which means they will always return the type you are expecting regardless of the input.

There are a plethora of map functions depending upon the return type of the object.

- map()returns a list
- map_lgl()returns a logical vector
- map_int()returns an integer vector
- map_dbl()returns a double vector
- map_chr()returns a character vector.

# Plethora of map functions m_dbl <- map_dbl(df[,c("a","b","d")],mean) m_dbl m_lgl <- map_lgl(df,is.character) m_lgl m_chr <-map_chr(df,class) m_chr # Choosing a appropriate map function is important: m_int<-map_int(df[, c("a", "b", "d")], median) m_int

# You see the above error because vector 'd' is of type double! class(d) typeof(d) # All these Map functions EITHER return the type that we EXPECT or an ERROR! # So map_int wouldn't work here! m_d1<-map_dbl(df[, c("a", "b","d")], median) m_d1

# Ways of specifying .f in map functions: # The function created below is called an Anonymous function or Lamba function. # You may not use few functions very often so by creating an Anonymous function on the fly to save the time. m1<-map(b,function(x) x^2)#conventional way of squaring every element of a vector. m1 # The other cool way of defining an anonymous function is by "formula" shortcut. m2<-map(b, ~(.^2))#Tilda(~) signifies that its a formula and the dot acts as placeholder for data argument'x'. m2 # Purrr is a time saver!

# Shortcut for subsetting l<- list(list(a=a,b=b),list(a=c,b=d)) l # We can subset all the 'a' elements from a list of lists. # Using sapply s_n<-lapply(l, function(x){x[["a"]]})#By name s_n s_p<-lapply(l,function(x){x[[1]]})#By Position s_p l_n<-map(l, "a")#By Name l_n l_p<-map(l, 1)#By Position l_p

This is important, The real application of these shortcuts are much more useful in the following scenario:

You built several linear models and you want to compare the “r square” or Accuracy of the same models. So to do that, Save those **models** in a list then by using the shortcuts provided by the map functions.

# head(mtcars) # Build 3 models using mtcars dataset to compare R square or Accuracy m <- lm(mpg~wt,mtcars) m1 <- lm(mpg~wt+gear, mtcars)#Add an Independent Variable m2 <- lm(mpg~wt+gear+disp, mtcars)#To compare Accuracy models <- list(m,m1,m2)#save them in a list # Pipe with Purrr saves alot time models %>% map(summary) %>% map_dbl("r.squared")

# Purr's Walk library(purrr) map(10,~(plot(rnorm(.)))) # The map function returns a value,but plot doesn't return a value so we see NULL # This is called a "side effect",Purr's Walk function is designed for this purpose # Walk works on the functions that don't return anything. walk(10,~(plot(rnorm(.)))) # Now we only see the plot which we needed in the first place.

What if we want a function to iterate over 2 arguments, map2 and walk2 to the rescue!

map2(**.x**, **.y**, **.f**, **….**), The syntax is similar to map function but has an additional argument “.y”, which is used to iterate over another object.

The function will take the **.x**‘s first element as the first argument and the **.y**‘s first element as the second argument and so on till the last elements of **.x** and .**y**.

library(purrr) # map2,mapply, walk and walk2 map2(list(mtcars$gear),list(mtcars$mpg),~plot(.x,.y)) # scatter plot of gears vs mpg

mapply(function(x,y){plot(x,y)},x =list(mtcars$gear),y = list(mtcars$mpg)) # mapply is closest function to map2.

# We can get rid of the NULL part by using walk2 walk2(list(mtcars$gear),list(mtcars$mpg),~plot(mtcars$gear,mtcars$mpg))

What if we want a function to iterate over 3 or more objects, so for that Purrr provides a function called **pmap** or **map_n** rather than map3,map4 and so on.

# pmap or map_n # rnorm(n=6,m=1,sd=2) # rnorm(n=4,m=2,sd=1) # rnorm(n=2,m=1,sd=1) # The above functions can be written using pmap as following: # Provide the arguments in a nested list or list of lists format. n <- list(6,4,2) mu <- list(1,2,1) sd <- list(2,1,1) p <- pmap(list(n,mu,sd),rnorm)#By default pmap matches the elements in list to the function by position. pmap(list(mu,n,sd),rnorm)# is different than 'p' object. # So it is always a safer way to provide arguments by name in the list

These are some of the functions of Purrr package, See R’s Documentation on Purrr for some more functions.

The post Functional Programming in R Using “Purrr” Package appeared first on StepUp Analytics.

]]>The post Application of Chi-Square Distribution appeared first on StepUp Analytics.

]]>Basically, Chi-Square (with one degree of freedom) variable is the square of a standard normal variable and Chi-Square distribution has additive property (Sum of two independent Chi-Square distributions is also a Chi-Square variable). Let’s discuss the different uses of Chi-Square distribution in the testing of hypothesis in real life situations.

Suppose, we have two cricket teams A and B. In team A, there is a good player, David. In team B, Rohit is a good player. Now, assume that Rohit is better than David, but from this point of view, we can’t conclude that winning possibility of team B is greater than team A.

It may happen that the team understanding in team A is much better than team B. Similarly, we can understand any underlying distribution of a population not only by testing for mean but we have to test for variance also.

Now, consider a random sample of size 10 as 205, 203, 191, 196, 200, 201, 200, 200, 200, 198. We want to test whether the population variance is greater than 6 or not. Here are testing of the hypothesis is,

**H _{0}: σ^{2} = σ_{0}^{2}(=6)**

library(EnvStats) x <- c(205, 203, 191, 196, 200, 201, 200, 200, 200, 198) varTest(x,sigma.squared = 6, alternative = "two.sided", conf.level = 0.95)

> library(EnvStats) > x <- c(205, 203, 191, 196, 200, 201, 200, 200, 200, 198) > varTest(x,sigma.squared = 6, alternative = "two.sided", conf.level = 0.95) Results of Hypothesis Test -------------------------- Null Hypothesis: variance = 6 Alternative Hypothesis: True variance is not equal to 6 Test Name: Chi-Squared Test on Variance Estimated Parameter(s): variance = 14.71111 Data: x Test Statistic: Chi-Squared = 22.06667 Test Statistic Parameter: df = 9 P-value: 0.01734015 95% Confidence Interval: LCL = 6.960081 UCL = 49.029964

So, here the value of the test statistic is 22.06667. Now, we have to find the critical value of χ^{2}_{– }distribution with df 9.

Using the code,** qchisq(0.05/2,9,lower.tail = F) **and

We can also decide whether to reject the null hypothesis based on the p-value. Now, we can adjust the alternative hypothesis according to the question whether left-sided, right-sided or two-sided. We can also change the confidence level as required.

If we have two categorical variables and our interest is to find if there is any statistical dependency or not, we use the Chi-Square test. Suppose we want to check whether smoking causes cancer or not. We have data in a table format where we have the frequencies as follows,

Now, we have to check whether smoking and having cancer are statistically independent or not. Here the test statistic is

which follows the Chi-Square distribution with df (l-1)(k-1) under the null hypothesis, where *O _{ij}* and

Using R we can easily perform the test with the function chisq.test() from *stats *library as follows,

library(stats) data <- as.data.frame(matrix(c(35,30,40,60),nrow = 2)) names(data) <- c("Cancer","No Cancer") row.names(data) <- c("Smoker","Non-Smoker") data chisq.test(data,correction = T)

> library(stats) > data <- as.data.frame(matrix(c(35,30,40,60),nrow = 2)) > names(data) <- c("Cancer","No Cancer") > row.names(data) <- c("Smoker","Non-Smoker") > data Cancer No Cancer Smoker 35 40 Non-Smoker 30 60 > chisq.test(data,correction = T) Results of Hypothesis Test -------------------------- Alternative Hypothesis: Test Name: Pearson's Chi-squared test with Yates' continuity correction Data: data Test Statistic: X-squared = 2.513288 Test Statistic Parameter: df = 1 P-value: 0.1128901

Here the p-value 0.1128901 which is greater than 0.05, so we can’t reject the null hypothesis (independence) at 5% level of significance.

The above example is of 2X2 contingency table. But we can also test where we have k X l contingency table for two categorical variables.

From the name of this test, one can easily understand its purpose. In this test, there is one categorical variable on which we can fit a model and check how good the model is. Suppose one has 3 dices and one observed how many sixes appeared while throwing them simultaneously 100 times. Suppose the data he got is,

Now we want to check is the dice are fair or not. If the dice are fair then the no. of sixes will follow the binomial distribution with n = 3, p = 1/6. So, it’s our null distribution. We use R to check if we fit this population as Binomial (3, 1/6), then is it a good fitting? Here the null hypothesis is, the dice are fair.

x <- c(37,29,24,10) nullprob = dbinom(c(0,1,2,3),3,1/6) chisq.test(x, p = nullprob)

The statistic we use here is

which follows the Chi-Square distribution with df (n-1) under the null hypothesis, where *O _{i}* and

> x <- c(37,29,24,10) > nullprob = dbinom(c(0,1,2,3),3,1/6) > chisq.test(x, p = nullprob) Results of Hypothesis Test -------------------------- Alternative Hypothesis: Test Name: Chi-squared test for given probabilities Data: x Test Statistic: X-squared = 246.8211 Test Statistic Parameter: df = 3 P-value: 3.186801e-53

Here the p-value of the test is less than 0.05 so we can reject the null hypothesis at 5% level of significance and conclude that the Binomial fitting is not good. So, this is how a goodness of fit test is performed for a categorical variable.

Suppose there are some channels of movies, sports, news, and music. We want to check whether the distributions of watching TV channels are identical between adult males and females or differ. This is a test for checking the homogeneity of watching TV in adult males’ and females’ population.

So, basically, we have one categorical variable which has different levels. We have the samples on two or more different populations and we want to check whether the populations are distributed identically among the levels of the variable. Here the statistic used for testing the null hypothesis, that the distribution of the levels of the variables among the populations is identical

*O _{ij}* and

Consider the following data:

We want to check whether the watching types of TV channels are identical or different between males and females. We have 100 males and 100 females in the sample. Here we use the Chi-Square test using R.

males <- c(28,20,39,13) females <- c(35,11,19,35) data <- as.data.frame(rbind(males,females)) names(data) <- c("Moives","Sports","News","Music") data chisq.test(data)

Running these above codes we have the result:

> males <- c(28,20,39,13) > females <- c(35,11,19,35) > data <- as.data.frame(rbind(males,females)) > names(data) <- c("Moives","Sports","News","Music") > data Moives Sports News Music males 28 20 39 13 females 35 11 19 35 > chisq.test(data) Results of Hypothesis Test -------------------------- Alternative Hypothesis: Test Name: Pearson's Chi-squared test Data: data Test Statistic: X-squared = 20.37057 Test Statistic Parameter: df = 3 P-value: 0.0001422209

Here the p-value is less than 0.01, so we can reject the null hypothesis at 1% level of significance. We can conclude that the male and the female populations are not homogeneous in watching TV channels.

So, these are the Chi-Square tests. Basically, all the non-parametric tests where the Chi-Square statistics are used, the testing procedures are more or less the same, with some slight changes depending on your purpose of the hypothesis.

The post Application of Chi-Square Distribution appeared first on StepUp Analytics.

]]>The post DATA HANDLING IN R appeared first on StepUp Analytics.

]]>**What is data handling?
**The first and foremost knowledge needed for Data Analyst or Data Scientist is how to handle the data? Now the question is what is data handling?

Data handling means gathering and recording the information gathered and present it in a way that is meaningful to others.

Let us take an example, let’s go back to the early 20’s, do you remember of a phone directory, which consists of peoples name and their phone numbers. The names are arranged in alphabetical order, that means the names are arranged in a systematic manner that is why it is possible to find the number of a particular person. This is an example of data handling as the data is arranged in such a way that is meaningful to others.

Now we come to the two different approaches towards data handling i.e. the statistical approach and the non-statistical approach towards data handling.

The non-statistical approach to the data handling simply arranging your data in a form that is meaningful to others. It can the simple arrangement of names according to the alphabetical order on a sheet of paper, so that when we want the information for a given person we can do it easily.

The statistical approach is arranging the data in a meaningful manner and extracting some information from the data which can be used to gain information about the data. Let us suppose that we have observations on the weight of 1,000 students in a random sequence, then after looking at the data, we can`t say anything about the distribution of the weights of the students. For having an information about the above data, we have to arrange the data in a given order, we have to find the mean and standard deviation of the data. So these are some of the points which we have to keep in my mind before starting the data analysis for any data.

One of the most important packages in R programming is the **dplyr** package which is used for data handling and manipulation in the data frame. The d in the name reinforces that the package is meant to work with data.frames in R. The dplyr package can be used to extract different columns (i.e. different variables) from a data frame, extracting rows from a data frame, adding new variables to the data frame, for applying functions to different variables of data frame, splitting the data according to a variable.

In this article, we will try learning these qualities of the dplyr package using examples. We will take the “mtcars” data present in R.

The data frame consists of 32 observations on 11 variables. The dataset comprises of fuel consumption and 10 aspects of automobile design and performance for 32 automobiles.

There always comes a situation while analyzing to select different columns i.e. extracting different variables from a data frame. The dplyr package is very handy in performing this task for us. Suppose we have the ‘mtcars’ data and we want to extract the “mpg” variable from this dataset, then the “select” command in the dplyr package comes in use

**Code:**

> data<-mtcars > data<-head(data) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 > library(dplyr) > select(data,mpg) mpg Mazda RX4 21.0 Mazda RX4 Wag 21.0 Datsun 710 22.8 Hornet 4 Drive 21.4 Hornet Sportabout 18.7 Valiant 18.1

The different columns can be selected using partial matching with the column names. This work can be done using the “dplyr” functions “starts_with”, “ends_with” and “contains”.

select(data,starts_with("m")) mpg Mazda RX4 21.0 Mazda RX4 Wag 21.0 Datsun 710 22.8 Hornet 4 Drive 21.4 Hornet Sportabout 18.7 Valiant 18.1

Selecting different rows(“filter” and “slice” function)

Many of the times we have to select rows of the data frame using logical expressions. The work is done using the “filter” command of the “dplyr” package in R. Suppose we have to select all those automobiles which 4 gears. This work can be done using the filter package.

**Code:**

filter(mtcars,gear==4) mpg cyl disp hp drat wt qsec vs am gear carb 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 5 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 6 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 7 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 8 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 9 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 10 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 11 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 12 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Different logical operators can be used in the “filter” command. Just assume that we have to extract the information from all the automobiles which have 4 gears and 2 carburettors. This work can be done using the **code:**

filter(mtcars,gear==4,carb==2) mpg cyl disp hp drat wt qsec vs am gear carb 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 2 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 3 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 4 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

> slice(mtcars,1:5) mpg cyl disp hp drat wt qsec vs am gear carb 1 21.0 6. 160. 110. 3.90 2.62 16.5 0. 1. 4. 4. 2 21.0 6. 160. 110. 3.90 2.88 17.0 0. 1. 4. 4. 3 22.8 4. 108. 93. 3.85 2.32 18.6 1. 1. 4. 1. 4 21.4 6. 258. 110. 3.08 3.22 19.4 1. 0. 3. 1. 5 18.7 8. 360. 175. 3.15 3.44 17.0 0. 0. 3. 2.

While the filter command is used for specifying rows with logical expressions, the “slice” command is used for selecting rows by row numbers.

In most of the data analysis tasks, there is a need for modifying the existing columns or adding new columns to the data frame. The “mutate” function of the “dplyr” package comes very handily. Just for an example, if we have the data containing a column as the “price” variable which is the price in dollars and we want to add another variable as “price2” variable which is the price in rupees, thus we can calculate the “price2” variable by multiplying by a suitable value.

In such conditions, the “mutate” functions can be used. The example of “mutate” function can be understood using the code below:

Let us create a new variable in the “y” data frame “mtcars” as (mpg/cyl):

> mutate(data,y=mpg/cyl) mpg cyl disp hp drat wt qsec vs am gear carb y 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 3.500000 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3.500000 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 5.700000 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3.566667 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 2.337500 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 3.016667

In a similar manner, the existing columns can also be modified by the “mutate” function of “dplyr” package.

One of the most important functions in the “dplyr” package is the “summarise” function which is used to apply functions to a column in a data frame. The “summarise” function applies a function to the column of a data frame and returns a result of length one such as mean, median or other similar functions.

Suppose we have a data frame and we want to find the average value for a specific column, then, in this case, the “summarise” can be used.

Let us find the mean of the “mpg” variable in the “mtcars” data frame :

data<-mtcars > summarise(data,mean(mpg)) mean(mpg) 1 20.09062

Multiple functions can be used in the “summarise” command.

data<-mtcars > summarise(data,mean(mpg),median(disp)) mean(mpg) median(disp) 1 20.09062 196.3

The “summarise” function is similar to the “base” package in R.

Sometimes we need to group the data using a factor variable present in the data and apply the functions to the column to the partitioned data. Consider the “mtcars” data frame, we have a variable as “cyl” that represents the no of cylinders in the automobile, this variable can be used to partition the data. And thus we can find the average “mpg” for automobiles having different cylinders.

> data<-mtcars > summarise(group_by(data,cyl),mean(mpg)) cyl `mean(mpg)` 1 4. 26.7 2 6. 19.7 3 8. 15.1

The output gives the mean “mpg” for automobiles having the number of cylinders as 4,6,8 separately.

The grouping can be done using different factors. It can be better understood using the example below:

> summarise(group_by(data,cyl,vs),mean(mpg)) cyl vs `mean(mpg)` 1 4. 0. 26.0 2 4. 1. 26.7 3 6. 0. 20.6 4 6. 1. 19.1 5 8. 0. 15.1

The command above split the data using the two-factor variables “cyl” and “vs” and then calculate the mean “mpg” for each combination.

We can use a variable for sorting a data frame i.e. the data frame is arranged as the sorted variable. Suppose in the “mtcars” data, we want data which is arranged according to the “mpg” variable. This can be done using the “arrange” function of the “dplyr” package. An example of the “arrange” function is shown below:

> data<-head(mtcars) > arrange(data,mpg) mpg cyl disp hp drat wt qsec vs am gear carb 1 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 2 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 4 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 5 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 6 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1

The default sorting is always done in ascending order. We can use “desc()” function in order to arrange the data in the descending order.

arrange(data,desc(mpg)) mpg cyl disp hp drat wt qsec vs am gear carb 1 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 2 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 4 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

From the above functions which we used above, we can say that the “dplyr” package is easy to code, faster to execute. The “dplyr” makes data handling and manipulation much easier instead of using the “base” package in R.

The post DATA HANDLING IN R appeared first on StepUp Analytics.

]]>The post Different Ways Of Variable Reduction appeared first on StepUp Analytics.

]]>Similarly, there is the same possibility in Data Science field where we find the relationship between independent variables and dependent variables and we should give less importance on less important dependent variable and we should consider only the important independent variables which have an effect on the regression analysis with dependent variable using different methods.

Basically, the variable reduction process can be done in two ways:

**Feature selection****Feature extraction**

In Feature selection, we discuss

**backward elimination****forward elimination****bidirectional elimination**

and in Feature extraction, we discuss

**Correlation****Analysis****PCA****Exploratory factor analysis****Multicollinearity****Linear discriminate analysis**.**Wald chi-square****method**

The variable reduction is a crucial step for accelerating model building without losing the potential predictive power of the data. With the advent of Big Data and sophisticated data mining techniques, the number of variables encountered is often tremendous making variable selection or dimension reduction techniques imperative to produce models with acceptable accuracy and generalization.

It may be noted that the following techniques are not used in the given order and moreover before going to take care of variable reduction we should emphasize more in univariate analysis of variables. To check the frequency distribution summary regression analysis and the most important thing is checking the missing value in any variable.

In **BACKWORD ELIMINATION** during regression analysis the independent variables which are less important those variables are eliminated from backward direction.

In **FORWARD ELIMINATION** during regression analysis the independent variables which are less important those variables are eliminated from forward direction.

In **BIDIRECTIONAL ELIMINATION** during regression analysis the independent variables which are less important those variables are eliminated from both directions.

In this case, we describe many methods of reduction of variables and we take care about dimension reduction also. Among the first is **CORRELATION ANALYSIS**. Correlation is the linear relationship between variables. Suppose we want to find a relationship between a hundred independent variables with one dependent variable.

For this, we create a correlation matrix. On the basis of correlation, we take those independent variables as explanatory variable among them which are highly correlated with the dependent variable. The sign of the correlation coefficient indicates the direction of association and it always lies between -1 (perfect negative linear association) and 1 (perfect positive linear association). A zero value of r indicates no linear relationship.

Now we discuss the correlation between two independent variables. A higher correlation coefficient (r) between two independent variables implies redundancy, indicating a possibility that they are measuring the same construct. In such a scenario, it would be prudent to select either of the two variables in the consideration or to adopt an alternative approach to selection which involves two most widely used techniques viz. Principal Component Analysis (PCA) and Exploratory Factor Analysis.

Now we discuss the next approach of variable reduction **PRINCIPAL COMPONENT ANALYSIS (PCA)** Principal Component Analysis is a variable reduction procedure and helps in obtaining a smaller number of variables called Principal Components, which account for most of the variance in the observed variables from a group of a large number of redundant (correlated) variables.

Suppose among 100 explanatory variables just 44 variables are highly correlated. Among 44 variables some variables are such type that the correlation among 3^{rd} and 5^{th} variables is 0.87 and that of between 3^{rd }and 8^{th} variables is 0.85. So this correlation is highly significant in this case PCA is necessary. Principal Component Analysis can be performed on a set of correlated variables to obtain a new composite variable (Principal Component) which will have the properties of all the variables in question.

Linear combination of optimally-weighted variables under consideration and can be used for subsequent analysis. One can compute as many principal components as the number of independent variables which can be further analyzed and retained on the basis of the variability explained by them.

Now we discuss the important variable reduction approach **Exploratory Factor Analysis** is also a variable reduction procedure, similar to Principal Component Analysis in many respects but the underlying procedure for both the techniques remain the same but there are conceptually dissimilarities between this two method which will be explained here.

Factor analysis is a statistical technique concerned with the reduction of a set of observable dependent variables in terms of a small number of latent factors. The underlying assumption of factor analysis is that there exists a number of unobserved latent variables (or “factors”) that account for the correlations among observed variables, such that if the latent variables are partialled out or held constant, the partial correlations among observed variables all become zero.

In other words, the latent factors determine the values of the observed variables. The term “common” in common factor analysis describes the variance that is analyzed. It is assumed that the variance of a single variable can be decomposed into common variance that is shared by other variables included in the model, and unique variance that is unique to a particular variable and includes the error component. **Common factor analysis (CFA)** analyzes only the common variance of the observed variables; principal component analysis considers the total variance and makes no distinction between common and unique variance. The selection of one technique over the other is based upon several criteria.

Next, we look at **MULTICOLLINEARITY** which occurs when independent variables are highly correlated among themselves.

Now we discuss another popular method of variable reduction, **Wald Chi-Square**. The Wald Chi-Square test statistic is the squared ratio of the Estimate to the Standard Error of the respective predictor.

Now we discuss another method of variable reduction i.e. **Linear Discriminant Analysis** that also works as a dimensional reduction algorithm, it means that it reduces the number of dimension from original to C — 1 number of features where C is the number of classes. In this example, we have 3 classes and 18 features, LDA will reduce from 18 features to only 2 features. After reducing, neural network model will be applied to classification task.

**Executable Code:**

library(faraway) str(divusa) df=data.frame(divusa[,2:7]) head(df) set.seed(1000) round(cor(df),2) library(GGally) library(caTools) library(MASS) split <- sample.split(rownames(df),SplitRatio = 0.80) train <- df[split == T , ];str(train) test <- df[split == F , ] ; str(test) ggcorr(df) model <- lm(divorce ~ . , data = train ) summary(model) vif(model)

Now we will work with the Model

model1 <- lm(divorce ~ femlab + marriage + birth + military , data = train ) ; summary(model1) model2 <- lm(divorce ~ femlab + marriage + birth , data = train ) ; summary(model2) pc<-prcomp(train[,-1] , center = T , scale. = T) attributes(pc) print(pc) summary(pc) library(MASS) linear <- lda(divorce ~ . , data = train ) linear attributes(linear)

library(faraway) str(divusa) 'data.frame': 77 obs. of 7 variables: $ year : int 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 ... $ divorce : num 8 7.2 6.6 7.1 7.2 7.2 7.5 7.8 7.8 8 ... $ unemployed: num 5.2 11.7 6.7 2.4 5 3.2 1.8 3.3 4.2 3.2 ... $ femlab : num 22.7 22.8 22.9 23 23.1 ... $ marriage : num 92 83 79.7 85.2 80.3 79.2 78.7 77 74.1 75.5 ... $ birth : num 118 120 111 110 111 ... $ military : num 3.22 3.56 2.46 2.21 2.29 ...

df=data.frame(divusa[,2:7]) head(df) divorce unemployed femlab marriage birth military 1 8.0 5.2 22.70 92.0 117.9 3.2247 2 7.2 11.7 22.79 83.0 119.8 3.5614 3 6.6 6.7 22.88 79.7 111.2 2.4553 4 7.1 2.4 22.97 85.2 110.5 2.2065 5 7.2 5.0 23.06 80.3 110.9 2.2889 6 7.2 3.2 23.15 79.2 106.6 2.1735

set.seed(1000) round(cor(df),2) divorce unemployed femlab marriage birth military divorce 1.00 -0.21 0.91 -0.53 -0.72 0.02 unemployed -0.21 1.00 -0.26 -0.27 -0.31 -0.40 femlab 0.91 -0.26 1.00 -0.65 -0.60 0.05 marriage -0.53 -0.27 -0.65 1.00 0.67 0.26 birth -0.72 -0.31 -0.60 0.67 1.00 0.14 military 0.02 -0.40 0.05 0.26 0.14 1.00

library(GGally) ggcorr(df) library(caTools) library(MASS) split <- sample.split(rownames(df),SplitRatio = 0.80) train <- df[split == T , ];str(train) 'data.frame': 61 obs. of 6 variables: $ divorce : num 8 6.6 7.1 7.2 7.2 7.5 7.8 7.8 8 7.5 ... $ unemployed: num 5.2 6.7 2.4 5 3.2 1.8 3.3 4.2 3.2 8.7 ... $ femlab : num 22.7 22.9 23 23.1 23.1 ... $ marriage : num 92 79.7 85.2 80.3 79.2 78.7 77 74.1 75.5 67.6 ... $ birth : num 118 111 110 111 107 ... $ military : num 3.22 2.46 2.21 2.29 2.17 ...

test <- df[split == F , ] ; str(test) 'data.frame': 16 obs. of 6 variables: $ divorce : num 7.2 6.1 7.5 9.4 13.6 9.9 9.2 9.2 10 21.9 ... $ unemployed: num 11.7 24.9 21.7 9.9 3.9 2.9 4.3 5.5 5.2 6.1 ... $ femlab : num 22.8 24.9 25.3 28.5 31.8 ... $ marriage : num 83 61.3 71.8 88.5 106.2 ... $ birth : num 119.8 76.3 78.5 83.4 113.3 ... $ military : num 3.56 1.94 1.95 13.5 10.98 ... ggcorr(df)

model <- lm(divorce ~ . , data = train ) summary(model) Call: lm(formula = divorce ~ ., data = train) Residuals: Min 1Q Median 3Q Max -3.7757 -0.8095 0.1004 0.8099 4.2183

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.33958 3.75962 1.154 0.253385 unemployed -0.12820 0.06594 -1.944 0.057008 . femlab 0.36660 0.03339 10.980 1.76e-15 *** marriage 0.10869 0.02758 3.941 0.000231 *** birth -0.13537 0.01823 -7.426 7.55e-10 *** military -0.02294 0.01487 -1.543 0.128537 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.69 on 55 degrees of freedom Multiple R-squared: 0.913, Adjusted R-squared: 0.9051 F-statistic: 115.4 on 5 and 55 DF, p-value: < 2.2e-16

vif(model) unemployed femlab marriage birth military 2.033954 3.072896 2.553974 2.411061 1.255901 model1 <- lm(divorce ~ femlab + marriage + birth + military , data = train ) ; summary(model1) Call: lm(formula = divorce ~ femlab + marriage + birth + military, data = train) Residuals: Min 1Q Median 3Q Max -3.5737 -1.0960 0.0504 1.0498 3.7809

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.86188 2.70601 -0.319 0.751 femlab 0.40512 0.02753 14.714 < 2e-16 *** marriage 0.12694 0.02657 4.777 1.32e-05 *** birth -0.11922 0.01662 -7.172 1.80e-09 *** military -0.01585 0.01477 -1.074 0.288 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.731 on 56 degrees of freedom Multiple R-squared: 0.907, Adjusted R-squared: 0.9004 F-statistic: 136.5 on 4 and 56 DF, p-value: < 2.2e-16

model2 <- lm(divorce ~ femlab + marriage + birth , data = train ) ; summary(model2) Call: lm(formula = divorce ~ femlab + marriage + birth, data = train) Residuals: Min 1Q Median 3Q Max -3.5830 -1.0663 0.0771 1.0841 4.0233 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.03003 2.57878 0.012 0.991 femlab 0.39637 0.02633 15.052 < 2e-16 *** marriage 0.11726 0.02503 4.685 1.78e-05 *** birth -0.11982 0.01664 -7.203 1.46e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.733 on 57 degrees of freedom Multiple R-squared: 0.9051, Adjusted R-squared: 0.9001 F-statistic: 181.2 on 3 and 57 DF, p-value: < 2.2e-16

pc<- prcomp(train[,-1] , center = T , scale. = T) attributes(pc) $names [1] "sdev" "rotation" "center" "scale" "x" $class [1] "prcomp" print(pc) Standard deviations (1, .., p=5): [1] 1.5317557 1.2139813 0.8226453 0.5723273 0.4191302 Rotation (n x k) = (5 x 5): PC1 PC2 PC3 PC4 PC5 unemployed 0.2590132 -0.64154699 0.53135939 -0.13974986 0.468462139 femlab 0.4691847 0.49679795 -0.22686585 0.04723785 0.692356829 marriage -0.5858718 -0.02705203 0.08701897 0.70051315 0.397154149 birth -0.5688098 -0.07288045 -0.31653177 -0.65389192 0.378651419 military -0.2144470 0.57928049 0.74727801 -0.24483719 -0.008770997

summary(pc) Importance of components: PC1 PC2 PC3 PC4 PC5 Standard deviation 1.5318 1.2140 0.8226 0.57233 0.41913 Proportion of Variance 0.4693 0.2948 0.1353 0.06551 0.03513 Cumulative Proportion 0.4693 0.7640 0.8993 0.96487 1.00000 library(MASS) linear <- lda(divorce ~ . , data = train ) linear Call: lda(divorce ~ ., data = train) Prior probabilities of groups: 6.1 6.6 7.1 7.2 7.5 7.8 8 8.3 8.4 8.5 8.7 8.8 8.9 9.3 0.01639344 0.01639344 0.03278689 0.03278689 0.03278689 0.04918033 0.03278689 0.01639344 0.01639344 0.01639344 0.01639344 0.01639344 0.01639344 0.03278689 9.4 9.5 9.6 9.9 10.1 10.3 10.6 10.9 11 11.2 12 12.4 13.4 14.4 0.03278689 0.01639344 0.03278689 0.01639344 0.03278689 0.01639344 0.03278689 0.01639344 0.01639344 0.03278689 0.01639344 0.01639344 0.01639344 0.01639344 14.9 15.8 17 17.9 18.2 19.3 19.5 19.8 20.3 20.5 20.7 20.9 21.1 21.2 0.01639344 0.01639344 0.01639344 0.01639344 0.01639344 0.01639344 0.01639344 0.01639344 0.01639344 0.01639344 0.01639344 0.03278689 0.03278689 0.03278689 21.5 22.6 22.8 0.01639344 0.03278689 0.01639344

From the above entire discussion and example, we really realize the importance of variable reduction properly and implement it on the real-life example. Because sometimes this type of a situation takes place to predict a single variable we take 80 to 90 independent variables some them mean same thing or some of them are also irrelevant in predicting the dependent variable also and some of them has no relationship with the dependent variable as a result prediction can’t be correct and expenses will be large. For all these reasons reduction of variables is so much necessary simultaneously different methods of reducing variables should be adopted carefully and stepwise. Now we

The post Different Ways Of Variable Reduction appeared first on StepUp Analytics.

]]>The post Outlier Detection Techniques Using R appeared first on StepUp Analytics.

]]>Suppose, there are two teams playing football match. Now all the team players of team A are under age 19 and all the team players of team B are under age 19, except one of age 24. Now the player has an experience in district level football. Now, everybody know that team B will be going to be the winner. So, this will not be a fair game. Here, the player aged 24 becomes unusual compared to others. So, this is the concept of outlier. An outlier is an unusual/suspicious value which isn’t as other data points. Let us consider we have a variable having where the most values are from 01 – 49. Now, suppose there is a number like 94 which is quite unnatural. So, here this is the outlier. In real life data we get many outliers. We have to know how to deal with this otherwise it may cause a misrepresentation of the entire dataset.

Now choosing a data point as an outlier depends on your purpose. Suppose a company had noticed that their yearly sale was around 10 – 20 Lakhs. Now the owner decides that if the sale is below 5 Lakhs or above 25 Lakhs, it will consider as an unusual value. So, for this company the outlier value is more than 25 Lakhs or less than 5 Lakhs.

Outliers arise due to changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations.

Now, the question is how to detect if there is any outlier in a data? As I have said it depends on your purpose but there are some methods to detect outliers in a data. The standard method is Tukey’s method, discussed below.

Suppose we have a variable assuming the values X_{1}, X_{2}, X_{3,}…, X_{n}. Now from the values we have to first determine the first quartile (Q1) and the third quartile (Q3) and the inter-quartile range (IQR = Q3 – Q1) based on the sample observations. Now the values outside the range is considered as *outliers*. Now, the values outside the range is known as* extreme* outliers and the values outside the range but inside the range is called *mild* outliers.

Suppose we have the values, 1, 60, 2, 1, 4, 4, 1, 1, 6, -30.Here Q1 = 1, Q3 = 4 and IQR = 3. So, the values outside the range (-3.5, 8.5) are the outliers. So, here the outliers are -30 and 60.

We have many functions in R. I prefer the function *boxplot.stats(x)$out* in the package *grDevices*. Let’s see the codes to understand how it works.

library(grDevices) x <- c(1, 60, 2, 1, 4, 4, 1, 1, 6, -30, 70) x[which(x %in% boxplot.stats(x)$out)]

library(grDevices) x <- c(1, 60, 2, 1, 4, 4, 1, 1, 6, -30, 70) x[which(x %in% boxplot.stats(x)$out)] [1] 60 -30 70

Now you may ask why we are using box-plot function here. Actually, box-plot is the visualization where we can see how the data is distributed along with if there are any outliers or not. Let’s see the box-plot below to understand that.

The red circles are the outliers in this data.

We also have some test to detect outliers like *dixon.test()*, *chisq.out.test()* in ** outliers** package. But I prefer a test

Form the box-plot we got 3 outliers, so a parameter in *ronserTest()* is *k* which is the number you think that how many outliers are there. I will prefer to put, what you get from the box-plot adding with 1 or 2. Here we can see 3 outliers from the box-plot, so we are putting k = 4. Now see how the test performs,

rosnerTest(x, k = 4, warn = F) OUTPUT: > rosnerTest(x, k = 4, warn = F) Results of Outlier Test ------------------------- Test Method: Rosner's Test for Outliers Hypothesized Distribution: Normal Data: x Sample Size: 11 Test Statistics: R.1 = 2.067720 R.2 = 2.508654 R.3 = 2.630493 R.4 = 1.816061 Test Statistic Parameter: k = 4 Alternative Hypothesis: Up to 4 observations are not from the same Distribution. Type I Error: 5% Number of Outliers Detected: 3 i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier 1 0 10.909091 28.577804 70 11 2.067720 2.354730 TRUE 2 1 5.000000 21.924112 60 2 2.508654 2.289954 TRUE 3 2 -1.111111 10.982309 -30 10 2.630493 2.215004 TRUE 4 3 2.500000 1.927248 6 9 1.816061 2.126645 FALSE

Here as you can see, 3 outliers are detected and we also get the values corresponding to the TRUE values of Outlier from the last table.

So, these are the methods to detect, visualize and test for outliers in data.

Now, the main question that should arise is, how should we deal with these outliers?

Presence of outliers may lead us to a bad result. So either we have to remove them or we have to replace them with some representative values.

There are several statistical methods to deal with outliers. Let’s discuss about the methods.

Suppose we have 1003 data on height (in ft) of adult people of a village. We are to find the mean height of the adult people.

height <- c((sample(seq(4,8,0.001),1000, replace = T)),101.51,-0.2346,601) boxplot(height,outcol = "red", outcex = 1.5) height <- c((sample(seq(4,8,0.001),1000, replace = T)),101.51,-0.2346,601) boxplot(height,outcol = "red", outcex = 1.5)

Here we can see from the box-plot that there are three outliers. Now we will check which values are those outliers. Now, as I have 1003 data if we remove 3 data from it, it doesn’t affect too much to the data. So, we remove/trim them from the data. You can notice that one is usual value which is negative we must have to remove it first. Then we perform trimming.

height <- height[-which(height < 0)] rosnerTest(height,k = 4) height <- height[-c(1001,1002)] boxplot(height,main = "After removing the outliers") height <- height[-which(height < 0)] rosnerTest(height,k = 4) Results of Outlier Test ------------------------- Test Method: Rosner's Test for Outliers Hypothesized Distribution: Normal Data: height Sample Size: 1002 Test Statistics: R.1 = 31.163267 R.2 = 29.440518 R.3 = 1.694513 R.4 = 1.696956 Test Statistic Parameter: k = 4 Alternative Hypothesis: Up to 4 observations are not from the same Distribution. Type I Error: 5% Number of Outliers Detected: 2 i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier 1 0 6.687901 19.070918 601.000 1002 31.163267 4.040471 TRUE 2 1 6.094183 3.240969 101.510 1001 29.440518 4.040225 TRUE 3 2 5.998767 1.179827 7.998 838 1.694513 4.039978 FALSE 4 3 5.996766 1.178719 7.997 919 1.696956 4.039731 FALSE > height <- height[-c(1001,1002)] > boxplot(height,main = "After removing the outliers")

Now we can see the outliers are removed and we can perform analysis as required to get a better result.

You can easily visualize the problem of outliers comparing the histograms of height before and after removing the outliers.

So, this is the concept of trimming.

Now, if there are few number of observations where each of the data point is as important as others, in that case we don’t use trimming. What we do here, we replac the outliers with some representative values like mean, median, minimum or maximum values etc.

Let’s replace the outliers with minimum and maximum value. This method is also known as Tukey’s method. Here what we do is, we first remove the outliers. Now, after removing outliers we have minimum and maximum value. We assign that minimum value to those outliers which were less than and assign that maximum value to those which were greater than This way we can replace the values of outliers. Let’s check how it helps to get rid of outliers. Let us consider the same example taken before but here we take 13 samples.

height <- c((sample(seq(4,8,0.001),10, replace = T)),0.00125,25.48,60) boxplot(height,outcol = "red", outcex = 1.5) height <- c((sample(seq(4,8,0.001),10, replace = T)),0.00125,25.48,60) boxplot(height,outcol = "red", outcex = 1.5)

Here we have three suspicious values. Let’s replace them with minimum and maximum values.

rosnerTest(height,k = 4,warn = F) height <- sort(height) for(i in 1:length(height)){ if(height[i] > quantile(height,0.75)+1.5*IQR(height)){ height[i] <- max(height[1:(i-1)]) } } height <- sort(height,decreasing = T) for(i in 1:length(height)){ if(height[i] < quantile(height,0.25)-1.5*IQR(height)){ height[i] <- min(height[1:(i-1)]) } } boxplot(height,main = "After replacing the outliers") > rosnerTest(height,k = 4,warn = F) Results of Outlier Test ------------------------- Test Method: Rosner's Test for Outliers Hypothesized Distribution: Normal Data: height Sample Size: 13 Test Statistics: R.1 = 3.092005 R.2 = 2.959174 R.3 = 2.589846 R.4 = 1.962536 Test Statistic Parameter: k = 4 Alternative Hypothesis: Up to 4 observations are not from the same Distribution. Type I Error: 5% Number of Outliers Detected: 3 i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier 1 0 11.617635 15.647570 60.00000 13 3.092005 2.462033 TRUE 2 1 7.585771 6.047034 25.48000 12 2.959174 2.411560 TRUE 3 2 5.959023 2.300435 0.00125 11 2.589846 2.354730 TRUE 4 3 6.554800 1.241659 4.11800 8 1.962536 2.289954 FALSE

> height <- sort(height) > for(i in 1:length(height)){ + if(height[i] > quantile(height,0.75)+1.5*IQR(height)){ + height[i] <- max(height[1:(i-1)]) + } + } > height <- sort(height,decreasing = T) > for(i in 1:length(height)){ + if(height[i] < quantile(height,0.25)-1.5*IQR(height)){ + height[i] <- min(height[1:(i-1)]) + } + } > boxplot(height,main = "After replacing the outliers")

Now we have the dataset free from the outliers’ effects. We can perform required analysis.

Let’s visualize the effect of replacing the outliers.

So this is the method of replacing.

Now, replacing with minimum and maximum values results heavy tailed. If you replace with mean or median it will increase the number of observations near the centre of the dataset. So now if your data is bell shaped you can use the mean or median to concentrate to the centre. If your data is “U” shaped you can use maximum and minimum value. So, it’s your choice according to how data is distributed.

A better approach is *capping*. What we do here? We replace the outliers which are less than with the 5^{th} percentile and replace the outliers which are greater than with the 95^{th} percentile of the data.

Now, suppose a company finds that a particular dealer continuously makes some huge loss in every month. So, the company can’t neglect that one. Then the company has to analyze why he is making these losses. Similarly, if there are many outliers then separate analysis should be done for outliers. One way to perform this is cluster analysis. Here we concentrate to each cluster and can have separate analysis for the outliers.

**Note:**

What I have discussed in the outlier detection section, that is for univariate data. What we do in bi-variate or multivariate setup? Let’s take the simplest one, i.e. bi-variate data.

As we know the simple representation of bi-variate data is scatter plot. We can see a pattern in this plot. Suppose the pattern is linear (see the image below).

Here we can see that except the first few points, we can have a moderate linear pattern. But, here the first few points are outliers because they are far from the others data points. Now, statistically if you want to see if they are outliers or not we can verify it by cook’s distance. We have a function in R *cooks.distance()* in ** stats** library.

cd <- cooks.distance(lm(y~x)) plot(cd,pch=1, cex = 1) abline(h = 4*mean(cd,na.rm = T), col = "blue") text(x=1:length(cd)+2, y=cd, labels=ifelse(cd>4*mean(cd, na.rm=T),names(cd),""), col="blue") cd <- cooks.distance(lm(y~x)) plot(cd,pch=1, cex = 1) abline(h = 4*mean(cd,na.rm = T), col = "blue") text(x=1:length(cd)+2, y=cd, labels=ifelse(cd>4*mean(cd, na.rm=T),names(cd),""), col="blue")

Here analysing the cook’s distance we can see the outliers with their indexes (blue marks).

Now, to deal with these outliers we can replace the y values of the outlier pairs with some representative values. Here representative value will be the predicted value of those *y’s* with the given corresponding *x’s*. The prediction will be based on the other values except the outliers.

We can use several imputations like kNN (k-th Nearest Neighbours), mice (Multivariate Imputation by Chained Equations) etc to deal with those outliers in bi-variate data, moreover multivariate data also.

Hope you have got a clear idea about what is outlier, how to detect them and how to deal with these outliers. It’s very important to take the outliers into account at the time of data preparation. Otherwise, it will mislead you far from a good interpretation.

The post Outlier Detection Techniques Using R appeared first on StepUp Analytics.

]]>The post 11 Reasons Why You Should Learn R Programming appeared first on StepUp Analytics.

]]>R programming is a **statistical** programming language developed by **scientists** that have open source libraries for statistics, machine learning, and data science. **R lends itself well to business because of its depth of topic-specific packages and its ****communication**** infrastructure**. R has packages covering a wide range of topics such as econometrics, finance, and time series.

R has best-in-class tools for visualization, reporting, and interactivity, which are as important to business as they are to science. Because of this, R is well-suited for scientists, engineers and business professionals.

- Business Capability (1 = Low, 10 = High)
- Ease of Learning (1 = Difficult, 10 = Easy)
- Cost (Free/Minimal, Low, High)
- Trend (0 = Fast Decline, 5 = Stable, 10 = Fast Growth)

**REASON 01: R IS OPEN-SOURCE AND FREELY AVAILABLE TOOL.**

Unlike SAS and Matlab, one can freely install, use, update, clone, modify, redistribute and resell R. This saves lots of money, but it also allows for easy upgrades, which is useful for a statistical programming language.

**REASON 02: R IS CROSS-PLATFORM AND OS COMPATIBLE TOOL.****
**R can be run on Windows, Mac OS X and Linux. It can also import data from Microsoft Excel, Microsoft Access, MySQL, SQLite, Oracle and many other programs as well.

**REASON 03: R IS A POWERFUL SCRIPTING LANGUAGE.**

As such, R can handle large, complex data sets. R is also the best language to use for heavy, resource-intensive simulations and it can be used on high-performance computer clusters.

**REASON 04: R HAS WIDESPREAD ACCLAIM.**** **

With an estimated 2 million users, R is one of the top programming languages of 2017.

**REASON 05: R IS HIGHLY FLEXIBLE AND EVOLVED.**

Many new developments in statistics first appear as R packages.

**REASON 06: PUBLISHERS LOVE R****
**R integrates easily with document preparation systems like LaTeX. That means statistical output and graphics from R can be embedded into word-processing documents.

**REASON 07: R HAS A HUGE, VIBRANT COMMUNITY AND RESOURCE BANK**

with a global community of passionate users who regularly interact on discussion forums and attend conferences. In addition, about 2000 free libraries are available for your unlimited use, covering statistical areas of finance, cluster analysis, high-performance computing and more.

**REASON 08: LEARNING R IS EASY WITH THE TIDYVERSE
**Learning R used to be a major challenge. Base R was a complex and inconsistent programming language. Structure and formality was not the top priority as in other programming languages. This all changed with the “tidy verse”, a set of packages and tools that have a consistently structured programming interface.

**REASON 09: R COMMUNITY SUPPORT **

Being a powerful language alone is not enough. To be successful, a language needs community support. We’ll hit on two ways that R excels in this respects: CRAN and the R Community.

**REASON 10: R HAS HEART **

We already talked about the infrastructure, the tidy verse, that enables the ecosystem of applications to be built using a consistent approach. It’s this infrastructure that brings life into your data analysis. The tidyverse enables:

- Data manipulation (dplyr, tidyr)
- Working with data types (stringr for strings, lubridate for date/datetime, forcats for categorical/factors)
- Visualization (ggplot2)
- Programming (purrr, tidyeval)
- Communication (Rmarkdown, shiny)

When tools such as dplyr and ggplot2 came to fruition, it made the learning curve much easier by providing a consistent and structured approach to working with data. As Hadley Wickham and many others continued to evolve R, the tidyverse came to be, which includes a series of commonly used packages for data manipulation, visualization, iteration, modelling, and communication. The end result is that R is now much easier to learn (we’ll show you in our next article!)

**REASON 10: R for Business**

**RMARKDOWN**

Rmarkdown is a framework for creating reproducible reports that have since been extended to building blogs, presentations, websites, books, journals, and more. It’s the technology that’s behind this blog, and it allows us to include the code with the text so that anyone can follow the analysis and see the output right with the explanation. What’s really cool is that the technology has evolved so much. Here are a few examples of its capability:

- rmarkdown for generating HTML, Word and PDF reports
- rmarkdown for generating presentations
- flexdashboard for creating web apps via the user-friendly Rmarkdown format.
- blogdown for building blogs and websites
- bookdown for creating online books
- Interactive documents
- Parameterized reports for generating custom reports (e.g. reports for a specific geographic segment, department, or segment of time)

REASON 11: The R community is awesome!

****************************************************************************************************************************

**How companies are using R**

- Ford uses R to improve the design of its vehicles.
- Basically, Twitter uses R to monitor user experience.
- The US National Weather Service uses R to predict severe flooding.
- The Human Rights Data Analysis Group uses R to quantify the impact of war.
- R is being used by The New York Times to create infographics.
- Google uses R to calculate the ROI of advertising campaigns.
- Facebook uses R to update Facebook status updates and its social network graph

**
WHAT SHOULD YOU DO?
**Don’t make the decision tougher than what it is. Think about where you are coming from:

**Are you a computer scientist or software engineer?** If yes, choose Python.

**Are you an analytics professional or mechanical/industrial/chemical engineer looking to get into data science?** If yes, choose R.

**Think about what you are trying to do:**

**Are you trying to build a self-driving car?** If yes, choose Python.

**Are you trying to communicate business analytics throughout your organization?** If yes, choose R.

**
R can also be used in a big data context**; You often hear that Scala and Python are great, and that is true, but you could also consider R when you’re working on visualization or data exploration; See this question and answers for more information – Is R considered unsuitable for Big Data when compared to Python?

Of course, tools like Mahout will always also be worth your time, and for the professional goals that you’re talking about, it’s an “and-and” story. My advice would be to check some companies and/or industries that you would like to work for and then see how much Mahout is actually used versus R for you to prioritize your learning.

The post 11 Reasons Why You Should Learn R Programming appeared first on StepUp Analytics.

]]>The post stringR package in R for Handling Strings appeared first on StepUp Analytics.

]]>**Suppose we want to count the length of the individual states**

Mainly four kinds of string manipulations can be performed by the functions incorporated in this package:

- Allow us to manipulate individual characters within a string in character vectors.
- Whitespace tools to add, remove, and manipulate whitespace.
- Locale-sensitive operations whose operations will vary from locale to locale.
- Pattern matching functions can recognize four engines of pattern description.

**Installation:**

install.packages('stringr', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/stringr_1.3.0.zip')

library(stringr)

There are a large number of functions incorporated in this package. Few important of them are discussed below.

There are three functions that are used to convert the case of strings.

– This converts the entered string into upper case. The syntax is:**str_to_upper()**

**str_to_upper(string, locale = “en”)**

Here “en” stands for English, which is by default.

str_to_upper("I am a good girl") ## [1] "I AM A GOOD GIRL"

We see immediately all the characters in the strings turn into upper case.

– This converts the entered string into lower case. The syntax is:**str_to_lower()**

**str_to_lower(string, locale = “en”)**

Here “en” stands for English, which is by default.

str_to_lower("I am a good girl") ## [1] "i am a good girl"

– This converts the entered string into the proper case, in the sense, the first character in each term of the string is in capital and rest are in lower case. The syntax is:**str_to_title()**

**str_to_title(string, locale = “en”)**

Here “en” stands for English, which is by default.

str_to_title("I am a good girl") ## [1] "I Am A Good Girl"

This is used to concatenate multiple strings into a single string. The syntax is:

**str_c(…, sep = “”)**

The “sep” stands for the separator. It is used when we want to concatenate the strings keeping any kind of separator between each term.

str_c("I", "am", "a", "good", "girl") ## [1] "Iamagoodgirl"

Here we see that output is shown but it is too congested to read. No space is there between words. Now, to make it look much more readable, we include a separator.

str_c("I", "am", "a", "good", "girl", sep = " ") ## [1] "I am a good girl"

Now, this looks good.

This is used to find out the length of the input string, or in other words, the total number of characters in the string. To be noted that, all the spaces between words are also calculated. The syntax is :

**str_length(string)**

str_length("I am a good girl") ## [1] 16

This works well as the number of characters in the above string is 16, which includes spaces also.

This is used to count the number of occurrences of the specified pattern in the given string. The syntax is :

**str_count(string, pattern = “”)**

The pattern to be mentioned can be anything, characters, numbers or special characters.

str_count(c("apple", "pears", "orange", "banana"), pattern = "p")## [1] 2 1 0 0

Here we have entered names of four fruits and we have asked to count the number of times “p” occurs in these individual fruit names. The output shows there are 2 times “p” occurred in “apple”, 1 time in “pears” and did not occur anytime in the rest of the names, which is true.

A similar to the above function is this. The only difference here is that it returns output in boolean datatype. This function tells us whether the given strings contain the given pattern or not. Hence the output comes to be “True” or “False”. The syntax is:

**str_detect(string, pattern)**

The syntax remains the same as the previous one. Hence we keep the same example as before and try to identify the difference between both the cases.

str_count(c("apple", "pears", "orange", "banana"), pattern = "p") ## [1] 2 1 0 0

Here the output comes to be “True” or “False”,i.e., the first two strings contain the pattern and the rest don’t.

This does exactly the opposite work of str_c(). It splits the given string, by the separator given. The syntax is:

**str_split(string, pattern, n = Inf, simplify = FALSE)**

str_split("I am a good girl", " ") ## [[1]] ## [1] "I" "am" "a" "good" "girl"

Here we split the string by the space. So we get all the individual words.

- str_replace() – This replaces the first occurrence of the pattern by some other given pattern. All the other occurrences remain as it is. The syntax is:

**str_replace(string, pattern, replacement)**

str_replace("apple", "p", "b") ## [1] "abple"

We see only the first occurrence of “p” gets replaced by “b”, the next one remains.

- str_replace_all() – This replaces all the occurrence of the pattern by some other given pattern. The syntax is:

**str_replace_all(string, pattern, replacement)**

str_replace_all("apple", "p", "b") ## [1] "abble"

So we see that both the “p” got replaced by “b”.

This orders the given string in a certain order, either increasing or decreasing. The nature of it is to place the index numbers in order of the occurrence. The syntax is:

**str_order(x, decreasing = FALSE, na_last = TRUE, locale = “”, …)**

Here decreasing = False means that the order should not be decreasing, i.e., it should be increasing order.

str_order(c("apple", "pears", "orange", "banana"), decreasing = T) ## [1] 2 3 4 1

As the inputs are in strings so the arrangements would be in alphabetical basis. As the decreasing order is mentioned true, therefore “pears” will come first (because “p” comes first in decreasing) then “orange” will come, and so on. Accordingly, we see that the 2nd element comes first, next is the 3rd element, then 4th element and lastly, the first element. Indexing wise the output is shown.

It does the same work as order(), but the difference is that here the strings are shown instead of their indexes. This looks much more convenient to understand as it would be difficult to remember all the indexes when the input set is large. Returning the names itself in sorted form is shown here.

str_sort(c("apple", "pears", "orange", "banana"), decreasing = T) ## [1] "pears" "orange" "banana" "apple"

It pads the string with the given argument, by the number of times mentioned and to the side of the string mentioned. The syntax is:

**str_pad(string, width, side = c(“left”, “right”, “both”), pad = ” “)**

Here “width” denotes the number of times the “pad” element is to be repeated. Also “side” denotes on which side it is to be incorporated.

str_pad("abc", width = 5, side = "left", pad = " ") ## [1] " abc"

So here we see that 5 spaces are added to the left of abc, which was desired.

It does exactly the opposite of what pad() is used for. Here this function removes all the extra spaces around the string. The syntax is:

**str_trim(string, side = c(“both”, “left”, “right”))**

str_trim(" abc ", side = "both") ## [1] "abc"

So here we remove all the extra spaces that were present on both the sides of abc, which was desired.

Now let us import a data set and imply these string operations on the fields.

setwd("C:/Users/Prithac/Desktop/step up analytics") cust <- read.csv("Customer_Info.csv")

In this dataset, there is personal information of 50 customers, with respect to 5 fields. One by one I repeat the above-mentioned functions implying them on the current dataset.

1. Suppose we want to keep the State names in all capital letters,

cust$State <- str_to_upper(cust$State) head(cust) ## Customer State Education Gender ## 1 BU79786 WASHINGTON Bachelor F ## 2 QZ44356 ARIZONA Bachelor F ## 3 AI49188 NEVADA Bachelor F ## 4 WW63253 CALIFORNIA Bachelor M ## 5 HB64268 WASHINGTON Bachelor M ## 6 OC83172 OREGON Bachelor F ## EmploymentStatus ## 1 Employed ## 2 Unemployed ## 3 Employed ## 4 Unemployed ## 5 Employed ## 6 Employed

2. Suppose we want to concatenate the Education and Employment fields into one whole field, separated by “-”,

cust$Concat <- str_c(cust$Education, cust$EmploymentStatus, sep = "-") head(cust) ## Customer State Education Gender ## 1 BU79786 WASHINGTON Bachelor F ## 2 QZ44356 ARIZONA Bachelor F ## 3 AI49188 NEVADA Bachelor F ## 4 WW63253 CALIFORNIA Bachelor M ## 5 HB64268 WASHINGTON Bachelor M ## 6 OC83172 OREGON Bachelor F ## EmploymentStatus ## 1 Employed ## 2 Unemployed ## 3 Employed ## 4 Unemployed ## 5 Employed ## 6 Employed ## Concat ## 1 Bachelor- Employed ## 2 Bachelor- Unemployed ## 3 Bachelor- Employed ## 4 Bachelor- Unemployed ## 5 Bachelor- Employed ## 6 Bachelor- Employed

3. Suppose we want to count the length of the individual states,

cust$Leng <- str_length(cust$State) head(cust) ## Customer State Education Gender ## 1 BU79786 WASHINGTON Bachelor F ## 2 QZ44356 ARIZONA Bachelor F ## 3 AI49188 NEVADA Bachelor F ## 4 WW63253 CALIFORNIA Bachelor M ## 5 HB64268 WASHINGTON Bachelor M ## 6 OC83172 OREGON Bachelor F ## EmploymentStatus ## 1 Employed ## 2 Unemployed ## 3 Employed ## 4 Unemployed ## 5 Employed ## 6 Employed ## Concat Leng ## 1 Bachelor- Employed 10 ## 2 Bachelor- Unemployed 7 ## 3 Bachelor- Employed 6 ## 4 Bachelor- Unemployed 10 ## 5 Bachelor- Employed 10 ## 6 Bachelor- Employed 6

4. Suppose we want to match the pattern that the corresponding customers are Male,

cust$count <- str_count(cust$Gender, pattern = "M") head(cust) ## Customer State Education Gender ## 1 BU79786 WASHINGTON Bachelor F ## 2 QZ44356 ARIZONA Bachelor F ## 3 AI49188 NEVADA Bachelor F ## 4 WW63253 CALIFORNIA Bachelor M ## 5 HB64268 WASHINGTON Bachelor M ## 6 OC83172 OREGON Bachelor F ## EmploymentStatus ## 1 Employed ## 2 Unemployed ## 3 Employed ## 4 Unemployed ## 5 Employed ## 6 Employed ## Concat Leng ## 1 Bachelor- Employed 10 ## 2 Bachelor- Unemployed 7 ## 3 Bachelor- Employed 6 ## 4 Bachelor- Unemployed 10 ## 5 Bachelor- Employed 10 ## 6 Bachelor- Employed 6 ## count ## 1 0 ## 2 0 ## 3 0 ## 4 1 ## 5 1 ## 6 0

All the “1”s represent that they are MAle, “0” says not male.

5. Suppose want to replace all the occurrences of “W” in the customer’s id’s with “Y”,

cust$replace <- str_replace(cust$Customer, pattern = "W", replacement = "Y")

6. Suppose we want to sort the customers in ascending order,

cust1 <- str_sort(cust$Customer, decreasing = F) head(cust1) ## [1] "AI49188" "AO98601" "BQ94931" "BU27331" "BU79786" "BW63560"

7. We can see there are lots of unwanted extra spaces in EmploymentStatus column. To remove them,

cust$EmploymentStatus <- str_trim(cust$EmploymentStatus, "both")

Hence we can see it’s so easy to manipulate strings in R using **stringr** package

The post stringR package in R for Handling Strings appeared first on StepUp Analytics.

]]>The post Naive Bayes Classifier and Its Application Using R appeared first on StepUp Analytics.

]]>There is a probability for a single event, calculated as the proportion of cases where that particular event happens. Similarly, we have a probability of a group of events, calculated as the proportion of cases where the group of events occur together.

Another one is that if it is known that one event has already happened, what will be the probability that another event happens after that. Example, if A is the first event and B is the second event, then P(B|A) is the probability of event A taking place after the occurrence of event B.

The equation goes like,

P(B | A) = P(B) * P(A | B) / P(A)

where,

A is the first event

B is the second event

P(B|A) is the probability of event A taking place after the occurrence of event B

P(A|B) is the probability of event B taking place after the occurrence of event A

P(A) is the probability of event A taking place

P(B) is the probability of event B taking place

This concept is called the Bayes Theorem in probability. Naive Bayes Classifier depends on this concept to explain its theory.

The algorithm makes an assumption that the data has attributes independent of each other. But in reality, they may be dependent in some way. If this assumption of independence holds, Naive Bayes performs better than other models. If all the attributes of the data are categorical, Naive Bayes works very well, though it can also be used with continuous attributes. However, in the case of numeric attributes, it makes another assumption that the numerical variable is normally distributed.

In our example case, we will work on a data having 9134 records of customers. The attributes about each customers provided are “**Customer ID**“, “**State**“, “**Education**“, “**Employment Status**“, “**Gender**“, “**Location**“, “**Marital Status**“, “**Vehicle**” and “**Income**“.

The objective of building our model is to predict the income level of customers. The income is divided into two levels, high and low. The assumption being that the customers having income below 35000 is considered as the low-income customer, and those having income more than 35000 are high-income customers.

The steps to be followed for the model building :

1. Import the data.

2. Data cleaning is an important part.

3. Creating a derived column with respect to the income column. The new column indicates only the income levels (high or low), based on the assumption made above.

4. Divide the data in 7:3 ratio. First part is training data that will be used to make the machine learn the data trend. The second part is to predict their income levels.

5. Then comes the step to see the predictions made by the model and check how accurate these predictions are.

So as explained above we start with our model building from the first step onwards.

setwd("C:/Users/Prithac/Desktop/step up analytics") data <- read.csv("NB_data.csv")

The link to download the data set is given: **Click here**

As mentioned above, our target is to predict the income levels of customers. So we create a column stating the income levels, i.e., high and low, according to the income mentioned.

Let us set that if a customer has income more than 35000, then we keep him in the “high” slot, otherwise we set him “low”.

data1 <- data data1$inc <- ifelse(data$Income >= 35000, "High", "Low")

Now, we remove the 9th variable (Income) as we are itself taking the income levels as a calculated field.

data1 <- data1[, -9]

Few variables that are irrelevant with regards to this model should be removed. These may include, “Customer”, “Gender” and “Marital Status”. These variables should have no direct connections on determining the income of the customers.

data1 <- data1[, c(-1,-5,-7)]

Checking the structure of the variables,

str(data1)

We see there are 9134 records and 6 variables structured in a data frame. Only the problem is that the variable “inc” in the char data type, which is a problem. As there are only two levels in this variable, high and low, hence we have to convert it into factor data type.

data1$inc <- as.factor(data1$inc)

Now, finally, the data looks good.

Installing the libraries,

install.packages('e1071', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/e1071_1.6-8.zip') install.packages('caret', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/caret_6.0-79.zip')

Loading them,

library(e1071) library(caret)

Now dividing the dataset into training and testing set, keeping the ratio as 7:3,

set.seed(2) random <- sample(2, nrow(data1), prob = c(0.7, 0.3), replace = T) data_train <- data1[random == 1, ] data_test <- data1[random == 2, ]

Running the naive Bayes function. Keeping “inc” as the dependent variable and considering all other 5 variables as independent variables (indicated with “.” sign). Running the model on the training set first.

data_nb <- naiveBayes(inc ~ . , data = data_train) data_nb

On running “data_nb” we get to see the summary of the model run. We read it as,

Under the heading “A-priori probabilities”, we see that there is 49% chance of income of the testing dataset customers being low. Similarly 51% chance of income of the testing dataset customers being high.

Under the heading “Conditional probabilities”, we get the conditional probabilities of all the variables individually.

If the State is “Arizona”, the probability of the income being high is more than the probability of the income being low. Similarly, if the state is “California”, the probability of the income being low is more than the probability of the income is high. We read the rest in this manner.

Next, if the Education is “Bachelor”, the probability of the income being low is more than the probability of the income is high. Compared to “Master”, the probability of the income being high is much more than the probability of the income is low, which is logical.

We can read the other observations in the same way.

Now running the model on the test data and getting the predictions,

pred_nb <- predict(data_nb, data_test)

The variable “pred_nb” stores the high and low levels corresponding to all the records. To read it properly let’s create a confusion matrix out of it,

confusionMatrix(table(pred_nb, data_test$inc))

The matrix shows the very good result.

- The diagonal values are the number of correct predictions and the off-diagonals are considered a number of wrong predictions. So we see that there are much lower wrong predictions (352 + 0)

as compared to the correct predictions (1382 + 1047). - Accuracy per cent is much high (87%) which is a good indication.
**P-value**is much lower than 0.05 (<2.2e-16), which is desired.- Kappa statistic is also high (around 75%). This indicates that there is a huge difference between the actual accuracy and the random accuracy.
- Sensitivity and Specificity are also close to each other.

Hence with all these observations, we can say it is a good model.

**1.** For State, customers living in “Arizona”, “Nevada”, “Oregon”, “Washington” has probabilities of income being high than being low. Those living in “California” has the opposite result. But we see that the variance among these two levels for all the States is less

**2.** For Education, customers who hold the degrees of “Bachelor”, “Doctor” and “High School or Below” has probabilities of income being low than being high. But those holding “Master” shows the opposite insight. Also, “College” people have an income level of a standard. Not high not even low, which is meaningful.

**3.** For Employment Status, customers who are “Disabled”, “Retired”, “Unemployed” and also who is on “Medical Leave”, have low income, and their probability of getting a high income is exactly 0. Extremely opposite is the case of “Employed” customers. Their probability to get high income is perfectly 1 and that of getting low income is much lesser. This actually makes sense if we refer it with our real life.

**4.** For Location, customers living in “Rural” and “Urban” has the much much higher probability of income being high than being low. On the other hand, the ones living in “Suburban” has the opposite result. There the probability of income being low is more than being high. All of their variances are much higher.

**5.** For Vehicle, customers having “Four-Door Car”, “Luxury Car” and “Two-Door Car” has a higher probability of income being high than being low. Customers having “Luxury SUV”, “Sports Car” and “SUV” has just the opposite result. But we see that the variance among these two levels for all the Vehicles is less.

The post Naive Bayes Classifier and Its Application Using R appeared first on StepUp Analytics.

]]>The post dplyr package in R With Implementation appeared first on StepUp Analytics.

]]>dplyr package, widely used in R, is basically a grammar of data manipulation. It is written and maintained by Hadley Wickham.

The package helps in transformation and summarization of data frames (i.e., data recorded in tabular form with rows and columns). It provides the most important verbs available to the users to work on R. Besides, it also allows the users to use the same interface while working data in different forms, be it in a data frame or a table or from a database itself.

The code to install this package:

install.packages('dplyr', repos ='https://cran.rstudio.com/bin/windows/contrib/3.4/dplyr_0.7.4.zip')

library(dplyr)

Now we will discuss a set of functions in the package, which performs common data manipulation operations.

1. Filter ()

2. Select ()

3. Mutate ()

4. Arrange ()

5. Summarize ()

6. Group_by ()

7. Piping ()

We explain the above functions using a data set available in R – “flights”.

To get this data set we have to install and then call two packages,

install.packages('nycflights13', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/nycflights13_0.2.2.zip')

install.packages('tidyverse', repos = 'https://cran.rstudio.com/bin/windows/contrib/3.4/tidyverse_1.2.1.zip')

library(nycflights13)

library(tidyverse)

Storing the dataset ‘flights’ with the name ‘data’,

data <- nycflights13::flights

This data frame contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics.

This is a vital step for all the analytical models. The data imported is not clean, as we can see huge amounts of missing values. So we have to do missing value treatment.

To check if there are missing values,

sum(is.na(data)) ## [1] 46595

We see there are 46595 NA values, which are considered missing. We remove these and store the rest dataset in a new ‘data1’,

data1 <- na.omit(data)

Now we are left with 327346 records to work on.

Secondly, we see that the variables like, “dep_delay” and “arr_delay” has negative values, which ultimately makes no sense. The logic being,

‘dep_delay’ = ‘dep_time’ – ‘sched_dep_time’ , should be the case.

But if we have ‘dep_delay’ negative means that ‘sched_dep_time’ is more than ‘dep_time’, which means that the flight departed before time, so it is ethically not delayed. Hence negative means nothing relevant under “dep_delay”.

Same logic goes for “arr_delay”. Hence negative means nothing relevant for this as well.

So we replace the negative values with 0 for both the fields, showing that there is no delay in those flights.

data1$dep_delay <- ifelse(data1$dep_delay < 0, 0, data1$dep_delay) data1$arr_delay <- ifelse(data1$arr_delay < 0, 0, data1$arr_delay)

Thirdly, we see the structure of the data is not appropriate,

str(data1)

## Classes 'tbl_df', 'tbl' and 'data.frame': 327346 obs. of 19 variables: ## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## $ month : int 1 1 1 1 1 1 1 1 1 1 ... ## $ day : int 1 1 1 1 1 1 1 1 1 1 ... ## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ... ## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ... ## $ dep_delay : num 2 4 2 0 0 0 0 0 0 0 ... ## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ... ## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ... ## $ arr_delay : num 11 20 33 0 0 12 19 0 0 8 ... ## $ carrier : chr "UA" "UA" "AA" "B6" ... ## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ... ## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ... ## $ origin : chr "EWR" "LGA" "JFK" "JFK" ... ## $ dest : chr "IAH" "IAH" "MIA" "BQN" ... ## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ... ## $ distance : num 1400 1416 1089 1576 762 ... ## $ hour : num 5 5 5 5 6 5 6 6 6 6 ... ## $ minute : num 15 29 40 45 0 58 0 0 0 0 ... ## $ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ... ## - attr(*, "na.action")=Class 'omit' Named int [1:9430] 472 478 616 644 726 734 755 839 840 841 ... ## .. ..- attr(*, "names")= chr [1:9430] "472" "478" "616" "644" ...

We have to correct the data types in a few fields,

data1$carrier <- as.factor(data1$carrier) data1$flight <- as.factor(data1$flight) data1$tailnum <- as.factor(data1$tailnum) data1$origin <- as.factor(data1$origin) data1$dest <- as.factor(data1$dest)

Fourthly, we convert the data set into a data frame, to work easily with it,

data2 <- as.data.frame(data1) str(data2)

## 'data.frame': 327346 obs. of 19 variables: ## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## $ month : int 1 1 1 1 1 1 1 1 1 1 ... ## $ day : int 1 1 1 1 1 1 1 1 1 1 ... ## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ... ## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ... ## $ dep_delay : num 2 4 2 0 0 0 0 0 0 0 ... ## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ... ## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ... ## $ arr_delay : num 11 20 33 0 0 12 19 0 0 8 ... ## $ carrier : Factor w/ 16 levels "9E","AA","AS",..: 12 12 2 4 5 12 4 6 4 2 ... ## $ flight : Factor w/ 3835 levels "1","2","3","4",..: 1381 1544 1041 676 424 1526 468 3693 69 265 ... ## $ tailnum : Factor w/ 4037 levels "D942DN","N0EGMQ",..: 180 524 2400 3201 2660 1141 1828 3297 2206 1177 ... ## $ origin : Factor w/ 3 levels "EWR","JFK","LGA": 1 3 2 2 3 1 1 3 2 3 ... ## $ dest : Factor w/ 104 levels "ABQ","ACK","ALB",..: 44 44 58 13 5 69 36 43 54 69 ... ## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ... ## $ distance : num 1400 1416 1089 1576 762 ... ## $ hour : num 5 5 5 5 6 5 6 6 6 6 ... ## $ minute : num 15 29 40 45 0 58 0 0 0 0 ... ## $ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ... ## - attr(*, "na.action")=Class 'omit' Named int [1:9430] 472 478 616 644 726 734 755 839 840 841 ... ## .. ..- attr(*, "names")= chr [1:9430] "472" "478" "616" "644" ...

Now the data looks fine to be used for our analysis. Data cleaning part is now over.

This function returns only the rows that match with the condition entered by the user. It is called the filtering process, where the rows returned as output holds the given condition true. There can be one or more than one condition given by the user at a time.

Some useful filter functions are-.** == , > , >= , < , <= . & , | , ! . xor() , is.na() . between() , near()**

The syntax being: **filter (dataset name, conditions)**

For example, if we want the records of students whose age is more than 15, from student dataset: **filter (student, age > 15)**

For multiple conditions, we mention all the conditions using filter functions in between them. Example, if we want the records of female students having age more than 15: **filter (student, sex == “F” & age > 15)**

Working with our data set, first showing with a single condition,

f1 <- filter(data2, origin == 'EWR') head(f1)

## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 554 558 0 740 728 ## 3 2013 1 1 555 600 0 913 854 ## 4 2013 1 1 558 600 0 923 937 ## 5 2013 1 1 559 600 0 854 902 ## 6 2013 1 1 601 600 1 844 850 ## arr_delay carrier flight tailnum origin dest air_time distance hour ## 1 11 UA 1545 N14228 EWR IAH 227 1400 5 ## 2 12 UA 1696 N39463 EWR ORD 150 719 5 ## 3 19 B6 507 N516JB EWR FLL 158 1065 6 ## 4 0 UA 1124 N53441 EWR SFO 361 2565 6 ## 5 0 UA 1187 N76515 EWR LAS 337 2227 6 ## 6 0 B6 343 N644JB EWR PBI 147 1023 6 ## minute time_hour ## 1 15 2013-01-01 05:00:00 ## 2 58 2013-01-01 05:00:00 ## 3 0 2013-01-01 06:00:00 ## 4 0 2013-01-01 06:00:00 ## 5 0 2013-01-01 06:00:00 ## 6 0 2013-01-01 06:00:00

Here we get only the flight records whose ‘origin’ is ‘EWR’. there are 117127 such records.

For multiple conditions,

f2 <- filter(data2, origin == 'EWR' & dest == 'IAH') head(f2)

## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 739 739 0 1104 1038 ## 3 2013 1 1 908 908 0 1228 1219 ## 4 2013 1 1 1044 1045 0 1352 1351 ## 5 2013 1 1 1205 1200 5 1503 1505 ## 6 2013 1 1 1356 1350 6 1659 1640 ## arr_delay carrier flight tailnum origin dest air_time distance hour ## 1 11 UA 1545 N14228 EWR IAH 227 1400 5 ## 2 26 UA 1479 N37408 EWR IAH 249 1400 7 ## 3 9 UA 1220 N12216 EWR IAH 233 1400 9 ## 4 1 UA 455 N667UA EWR IAH 229 1400 10 ## 5 0 UA 1461 N39418 EWR IAH 221 1400 12 ## 6 19 UA 1258 N26906 EWR IAH 218 1400 13 ## minute time_hour ## 1 15 2013-01-01 05:00:00 ## 2 39 2013-01-01 07:00:00 ## 3 8 2013-01-01 09:00:00 ## 4 45 2013-01-01 10:00:00 ## 5 0 2013-01-01 12:00:00 ## 6 50 2013-01-01 13:00:00

Here there are two conditions mentioned, and the output records must fulfil both the conditions simultaneously. That is, the flight ‘origin’ has to be ‘EWR’ and simultaneously the flight ‘dest’ has to be ‘IAH’. We get 3923 such records.

Also, we notice that the number of output records for multiple conditions is less than that for a single condition. Hence the simplest observation is that more the conditions imposed to come true, lesser is the number of output records.

This function is usually called when there is a large data set, i.e., the number of variables considered and the observations are both huge. It often happens that when we work on any data set, we are not interested in the whole set of observations, instead, we want to work on a particular set of observations only. Hence it helps us to extract a part of the original large data set of our interest and work on it.

There are a few functions/arguments which work only inside select (). These are: .starts_with () , ends_with () , contains () . matches () , num_range ()

The syntax being: select (table name, the columns we want to display separated by commas)

For example, if we want to extract the name, sex and age of the students, select (student, name, sex, age)

We can also use the functions inside the select statement to extract the desired records.

Example, select (student, starts_with (“total”)) Here we get the records of all the students whose column name starts with “total”. Example, it can be the column “total marks”.

Now if we use a minus sign before the column names then it means that we want to drop those particular columns from the extracted table. For example, select (student, -name)

It will extract a table from the data frame student without the column “name”. In other words, it will extract the whole of the student table, after dropping the column “name”. Showing with our present data set,

s1 <- select(data2, sched_dep_time, sched_arr_time, flight) head(s1)

## sched_dep_time sched_arr_time flight ## 1 515 819 1545 ## 2 529 830 1714 ## 3 540 850 1141 ## 4 545 1022 725 ## 5 600 837 461 ## 6 558 728 1696

Here we extract records of only three columns from the dataset- sched_dep_time, sched_arr_time and flight. There show 327346 records with only 3 variables.

Now if we want to extract all the columns that have column name containing “arr”, then,

s2 <- select(data2, contains ("arr")) head(s2)

## arr_time sched_arr_time arr_delay carrier ## 1 830 819 11 UA ## 2 850 830 20 UA ## 3 923 850 33 AA ## 4 1004 1022 0 B6 ## 5 812 837 0 DL ## 6 740 728 12 UA

This will show the 4 columns that start with “arr”- arr_time, sched_arr_time, arr_delay and carrier.

Similarly, we can use other embedded functions also as mentioned above.

Next, if we want to extract the whole data set, except for the column ‘year’, ‘month’, and ‘day’ in it,

s3 <- select(data2, -year, -month, -day) head(s3)

## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay ## 1 517 515 2 830 819 11 ## 2 533 529 4 850 830 20 ## 3 542 540 2 923 850 33 ## 4 544 545 0 1004 1022 0 ## 5 554 600 0 812 837 0 ## 6 554 558 0 740 728 12 ## carrier flight tailnum origin dest air_time distance hour minute ## 1 UA 1545 N14228 EWR IAH 227 1400 5 15 ## 2 UA 1714 N24211 LGA IAH 227 1416 5 29 ## 3 AA 1141 N619AA JFK MIA 160 1089 5 40 ## 4 B6 725 N804JB JFK BQN 183 1576 5 45 ## 5 DL 461 N668DN LGA ATL 116 762 6 0 ## 6 UA 1696 N39463 EWR ORD 150 719 5 58 ## time_hour ## 1 2013-01-01 05:00:00 ## 2 2013-01-01 05:00:00 ## 3 2013-01-01 05:00:00 ## 4 2013-01-01 05:00:00 ## 5 2013-01-01 06:00:00 ## 6 2013-01-01 05:00:00

We see that these three columns do not show up in the output. Hence we can work with the rest of the dataset. So we see only 16 variables are there now, as desired.

This function creates a new column to the existing data frame. The column thus created should essentially be the function of the existing variables in the concerned data frame.

There are few useful functions which are used in mutate (): . Arithmetical operators like, – , + , * , / . log () . Cumulative functions like, cumsum (), cummin (), cummax (), etc. . if_else (), etc…

The syntax being: mutate (table name, derived column name = the calculations with the existing column)

For example, if we want to find the average marks of all the students, mutate (student, avg_marks = (maths_marks + eng_marks)/2)

This will eventually create a new column “avg_marks” containing marks of the individual students.

Another feature of this function is that we can drop a particular variable by setting its value as NULL.

Example, mutate (student, address = NULL)

This command will set the address column will NULL. In this way, we can drop the unrequired variables.

Explaining this by using our dataset.

If we want to specify the flights whose ‘arr_delay’ is more than 100 to be “Bad rated flight” and the rest to be “Average rated flight”,

m1 <- mutate(data2, Flight_Remarks = ifelse(arr_delay > 100, "Bad rated flight", "Average rated flight")) head(m1)

## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 0 1004 1022 ## 5 2013 1 1 554 600 0 812 837 ## 6 2013 1 1 554 558 0 740 728 ## arr_delay carrier flight tailnum origin dest air_time distance hour ## 1 11 UA 1545 N14228 EWR IAH 227 1400 5 ## 2 20 UA 1714 N24211 LGA IAH 227 1416 5 ## 3 33 AA 1141 N619AA JFK MIA 160 1089 5 ## 4 0 B6 725 N804JB JFK BQN 183 1576 5 ## 5 0 DL 461 N668DN LGA ATL 116 762 6 ## 6 12 UA 1696 N39463 EWR ORD 150 719 5 ## minute time_hour Flight_Remarks ## 1 15 2013-01-01 05:00:00 Average rated flight ## 2 29 2013-01-01 05:00:00 Average rated flight ## 3 40 2013-01-01 05:00:00 Average rated flight ## 4 45 2013-01-01 05:00:00 Average rated flight ## 5 0 2013-01-01 06:00:00 Average rated flight ## 6 58 2013-01-01 05:00:00 Average rated flight

We see that a new column is added to the dataset, which shows ‘Flight_Remarks’ for individual records. Hence we now get 327346 records for 20 variables.

Now suppose if we want, we can drop the column “time_hour” and work with rest of the columns, as it is just the concatenation of ‘year’, ‘day’ and ‘month’ columns in the same data set. This can be done by,

m2 <- mutate(data2, time_hour = NULL) head(m2)

## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 0 1004 1022 ## 5 2013 1 1 554 600 0 812 837 ## 6 2013 1 1 554 558 0 740 728 ## arr_delay carrier flight tailnum origin dest air_time distance hour ## 1 11 UA 1545 N14228 EWR IAH 227 1400 5 ## 2 20 UA 1714 N24211 LGA IAH 227 1416 5 ## 3 33 AA 1141 N619AA JFK MIA 160 1089 5 ## 4 0 B6 725 N804JB JFK BQN 183 1576 5 ## 5 0 DL 461 N668DN LGA ATL 116 762 6 ## 6 12 UA 1696 N39463 EWR ORD 150 719 5 ## minute ## 1 15 ## 2 29 ## 3 40 ## 4 45 ## 5 0 ## 6 58

As the value of the column “time_hour” is set NULL, it implies that the particular column is dropped. Hence the whole data set, except the column “time_hour” is shown as an output. so, We see 18 variables now.

This function is used to re-order rows according to the variable specified by the user. The default re-arranging pattern is ascending. To make it descending we need to mention desc (). This function also allows group_by () in it, for arranging records according to groups.

The syntax is: arrange (table name, column names by which we want to arrange separated by commas)

For example, if we want student records to be arranged in order of total marks, arrange (student, total_marks)

If we want to order the students according to highest to lowest marks, arrange (table name, desc (total_marks))

Explaining this by using our dataset.

If we want to arrange the data set in the order of column “distance”, so that it is easy for us to identify which flights are for short trips or which one is for long trips. Al, so we can easily determine the distance between the origin and destination countries sorted,

a1 <- arrange(data2, distance) head(a1)

## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## 1 2013 1 3 2127 2129 0 2222 2224 ## 2 2013 1 4 1240 1200 40 1333 1306 ## 3 2013 1 4 1829 1615 134 1937 1721 ## 4 2013 1 4 2128 2129 0 2218 2224 ## 5 2013 1 5 1155 1200 0 1241 1306 ## 6 2013 1 6 2125 2129 0 2224 2224 ## arr_delay carrier flight tailnum origin dest air_time distance hour ## 1 0 EV 3833 N13989 EWR PHL 30 80 21 ## 2 27 EV 4193 N14972 EWR PHL 30 80 12 ## 3 136 EV 4502 N15983 EWR PHL 28 80 16 ## 4 0 EV 4645 N27962 EWR PHL 32 80 21 ## 5 0 EV 4193 N14902 EWR PHL 29 80 12 ## 6 0 EV 4619 N22909 EWR PHL 22 80 21 ## minute time_hour ## 1 29 2013-01-03 21:00:00 ## 2 0 2013-01-04 12:00:00 ## 3 15 2013-01-04 16:00:00 ## 4 29 2013-01-04 21:00:00 ## 5 0 2013-01-05 12:00:00 ## 6 29 2013-01-06 21:00:00

We see that the records are arranged according to “distance”, but by default in ascending order of the distance amount.

Now if we want the same records but by descending order of the “distance”, to identify the longest route flights easily,

a2 <- arrange(data2, desc(distance)) head(a2)

## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## 1 2013 1 1 857 900 0 1516 1530 ## 2 2013 1 2 909 900 9 1525 1530 ## 3 2013 1 3 914 900 14 1504 1530 ## 4 2013 1 4 900 900 0 1516 1530 ## 5 2013 1 5 858 900 0 1519 1530 ## 6 2013 1 6 1019 900 79 1558 1530 ## arr_delay carrier flight tailnum origin dest air_time distance hour ## 1 0 HA 51 N380HA JFK HNL 659 4983 9 ## 2 0 HA 51 N380HA JFK HNL 638 4983 9 ## 3 0 HA 51 N380HA JFK HNL 616 4983 9 ## 4 0 HA 51 N384HA JFK HNL 639 4983 9 ## 5 0 HA 51 N381HA JFK HNL 635 4983 9 ## 6 28 HA 51 N385HA JFK HNL 611 4983 9 ## minute time_hour ## 1 0 2013-01-01 09:00:00 ## 2 0 2013-01-02 09:00:00 ## 3 0 2013-01-03 09:00:00 ## 4 0 2013-01-04 09:00:00 ## 5 0 2013-01-05 09:00:00 ## 6 0 2013-01-06 09:00:00

Now we get our desired output.

This function draws a summary statistics from a particular column in a data frame. In other words, it brings down to a single value from multiple values. The function works more significantly when used on group level data, created by the function group by (). The output thus formed after applying this function will be one row per group.

The aggregate functions used in summarise () are: . mean (), median () . max() , min() . n () , first (), last (), distinct (), etc.

The syntax is:

summarise (table name, aggregate functions function of the existing variables separated by commas)

For example, if we want to know the minimum, maximum and average marks of the student dataset,

summarise (student, min(total_marks), max(total_marks), mean(total_marks))

Explaining this by using our dataset.

Suppose we want to know the maximum, minimum and average ‘distance’ of the flights in 2013,

su1 <- summarise(data2, minimum= min(distance), maximum= max(distance), average= mean(distance)) su1

## minimum maximum average ## 1 80 4983 1048.371

Here we see that three single values are derived from a whole data set, the three values showing the maximum, minimum and average ‘distance’ covered by all the flights in 2013, as desired.

This is used when we want to group the dataset with respect to a particular attribute.

From our data set if we want to group the records according to the year first, then the month and then day,

g1 <- group_by(data2, year, month, day) head(g1)

## # A tibble: 6 x 19 ## # Groups: year, month, day [1] ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 1 517 515 2. 830 ## 2 2013 1 1 533 529 4. 850 ## 3 2013 1 1 542 540 2. 923 ## 4 2013 1 1 544 545 0. 1004 ## 5 2013 1 1 554 600 0. 812 ## 6 2013 1 1 554 558 0. 740 ## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>, ## # carrier <fct>, flight <fct>, tailnum <fct>, origin <fct>, dest <fct>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, ## # time_hour <dttm>

Here we get all the records but these are sorted and grouped according to the year first, then the month and then by day.

This is used mainly when we have multiple operations to execute. Writing multiple commands in separate lines make the program look clumsy. So we can write multiple commands in a single line, connecting all with piping function. The pipe looks like, %>%.

Now working with our data set if we want to group by the data set on the basis of ‘year’, then ‘month’ and then ‘day’. Also, we want to extract only certain fields like, ‘arr_delay’, ‘dep_delay’, ‘flight’, ‘origin’, ‘dest’ and ‘distance’. We can create this using a single command,

p1 <- data2 %>% group_by(year, month, day) %>% select(arr_delay, dep_delay, flight, origin, dest, distance)

head(p1)

## # A tibble: 6 x 9 ## # Groups: year, month, day [1] ## year month day arr_delay dep_delay flight origin dest distance ## <int> <int> <int> <dbl> <dbl> <fct> <fct> <fct> <dbl> ## 1 2013 1 1 11. 2. 1545 EWR IAH 1400. ## 2 2013 1 1 20. 4. 1714 LGA IAH 1416. ## 3 2013 1 1 33. 2. 1141 JFK MIA 1089. ## 4 2013 1 1 0. 0. 725 JFK BQN 1576. ## 5 2013 1 1 0. 0. 461 LGA ATL 762. ## 6 2013 1 1 12. 0. 1696 EWR ORD 719.

So we get 327346 records and only 9 variables, just like we wanted.

Suppose the task is to find the number of flights falling under each ‘carrier’.

So we can start it, by grouping the data set on the basis of ‘carrier’. Creating a dummy column ‘count’ and then adding the value with every occurrence of unique ‘carrier’ codes.

data3 <- data2 data3$count <- 1 c1 <- data3 %>% group_by(carrier) %>% mutate(Num_of_flights = sum(count))

We get 327346 records showing 21 attributes. The two extra columns shown are for ‘count’ and the new column created by mutate function, ‘Num_of_flights’.

Now we see that a lot of repeatations in the ‘Num_of_flights’ column. It is not easy for us to understand, exactly which ‘carrier’ has how many flights in total.

So we extract two columns, ‘carrier’ and ‘Num_of_flights’ in a new data, ‘c2’.

c2 <- c1[, c(10,21)]

We get all the records of only these two above mentioned columns.

Now we take only the unique codes under ‘carrier’ column, to make the look of what we want easier,

c2 <- c2[!duplicated(c2),]

Finally, we reach our destination. We see 16 different ‘carrier’ codes and the number of flights under them.

Now, suppose if we want to cross-check to whether the grouping is done properly or not, we can add the number of flights under each ‘carrier’ and see if we get our original number of records we had in data2,

sum(c2$Num_of_flights)

## [1] 327346

So we see the value comes to 327346, exactly the same number of records we were originally working with. Hence the grouping done is correct.

Hence to conclude “dplyr” is a very powerful package that can make easy calculations and manipulations on data sets, which can actually make our life easier.

The post dplyr package in R With Implementation appeared first on StepUp Analytics.

]]>The post Introduction to ggplot2. appeared first on StepUp Analytics.

]]>- It is a package in R.
- An implementation of the grammar of graphics.
- It is a ‘third’ graphics system for R(along with base and lattice), build based upon the grid system.
- Available from CRAN via install.package().

**Grammer of graphics** : The grammer tells us that a statistical graphics is a mapping from data to aesthetic attributes(color,shape,size) of geometric objects(points, lines, bars). The plot may also contains statistical transformations of the data and is drawn in a specific co-ordinate system.

- Split the difference between base and lattice
- Automatically deals with spacings,text,titles but also allows you to state by adding.
- Superficial similarity to lattice but generally easier/more intuitive to use.
- Default mode makes many choices for you(you can customise).

- Works much similar to plot() [the base method] in base graphics system.
- Looks for data in a data frame, similar to lattice, or in the parent environment.
- Plots are made up of aesthetic (size, color, shape) and grooms(points, lines).
- Factors are important for indicating subsets of the data (if they are to have different properties); they should be labeled.
- The qplot () hides what goes on underneath, which is okay for most operations
- ggplot () is the core function and very flexible for doing things, which qplot () can not do.

library(ggplot2) str(mpg)

Here is the structure of the built-in data The structure looks like this, here in this data we will th relationship between 3 variables “hwy”, “displ”, “drv”. “drv” variable has three factor “f”, “r”, “4”. While other variables are the characteristics and specification of the cars. Let’s plot a basic qplot() of “mpg” data between “displ” and “hwy”.

qplot(displ, hwy, data=mpg, color=mpg$displ) # displ=x co-ordinate, hwy=y co-ordinate and mpg is dataframe, here i am differenciating displ variable with color.

Here is the plot: In the above graph, we gave the color to *mpg$displ* which is not valid, I will explain what is the main purpose of the data is we have to find out the miles per gallon for different drive, on highway for respective displ variable.

qplot(displ, hwy, data=mpg,color=drv) #color=drv, it will give different color to each factor.

Here is the plot: In the above graph we have separated the “** drv -> factors**” by using color. Another thing we can do, we can statistics to data plot, by giving “

qplot(displ, hwy, data=mpg, color=drv, geom=c("point","smooth"))

Here is the plot: We have seen both the graphs with or without “* geom*” parameter, we can clearly decide which one is more understandable. We know that both graphs are same. But the graph with

qplot(hwy, data=mpg,fill=drv)

Have a look below: Another feature of ggplot is called **facets.** **Facets are like panels in lattice.** We can create separate plots which indicates again subsets of your data indicating by a factor variable and you can make a panel plots to look at the separate subsets together, so one option will be the color code of the subset, according to different colors, like we did before.

qplot(displ, hwy,data=mpg, facets=.~drv)

Here is the plot:

But, if we have lots of data point that can be tricky to lookout and the color code can overlap and may be difficult to see the separate groups, easier way to do that is split up to 3 groups in the separate panel and make 3 separate class.

qplot(displ, hwy, data=mpg, facets=drv~., binwidth=2, color=drv)

Here is the plot, lets have a look please **One important thing to remeber is the syntax of facets**

- facets = .~drv
- facets = drv~.

Can you guys tell me the difference between the two above points. Don’t worry I will you, if you know already, then please ignore it. **facets = .~drv **means there is only one row and more than one column will be there in your graphs and columns depends on the number of factors, what you are giving to the function. **facets = drv~. **means there is only one column and more than one row will be there in your graphs and rows depends on the number of factors, what you are giving to the function. I hope you get the basic idea of ggplot2, Please keep in touch, in few days i will write about the advance ggplot2. This article originally posted here.

The post Introduction to ggplot2. appeared first on StepUp Analytics.

]]>