The post Web Scraping Using R From Amazone appeared first on StepUp Analytics.

]]>Web scraping is a technique for converting the data present in an unstructured format (HTML tags) over the web to the structured format which can easily be accessed and used.

Almost all the main languages provide ways for performing web scraping. In this article, we’ll use R for scraping the data for the most popular feature smartphones of 2019 from the Amazone website.

We’ll get a number of features for each of the 15 popular feature smartphones released in 2019. Also, we’ll look at the most common problems that one might face while scraping data from the internet because of the lack of consistency in the website code and look at how to solve these problems. If you are more comfortable using Python, I’ll recommend you to go through this website for getting started with web scraping using Python

There are several ways of scraping data from the web. Some of the popular ways are:

**Human Copy-Paste:**This is a slow and efficient way of scraping data from the web. This involves humans themselves analyzing and copying the data to local storage.**Text pattern matching:**Another simple yet powerful approach to extract information from the web is by using regular expression matching facilities of programming languages. You can learn more about regular expressions in this website**.****API Interface:**Many websites like Facebook, Twitter, LinkedIn, etc. provides public and/or private APIs which can be called using the standard code for retrieving the data in the prescribed format.**DOM Parsing:**By using web browsers, programs can retrieve the dynamic content generated by client-side scripts. It is also possible to parse web pages into a DOM tree, based on which programs can retrieve parts of these pages.

**Title***:- *Name, storage, and color of the top 15 smartphones.**Price**:- Price of smartphones.**Rating**:-Ratings of the smartphones

Now, let’s get started with scraping the Amazone website for the 15 most popular feature smartphones released in 2019. You can access them on this website Click.

rvest:- Hadley Wickham authored the rvest package for web scraping in R. rvest is useful in extracting the information you need from web pages.

- Click the Tools button and install packages.
- Then click the Install button.

rvest contains the basic web scraping functions, which are quite effective. Using the following functions, we will try to extract the data from web sites.

- read_html(url) : scrape HTML content from a given URL
- html_nodes(): identifies HTML wrappers.
- html_nodes(“.class”): calls node based on CSS class
- html_nodes(“#id”): calls node based on <div> id
- html_nodes(xpath=”xpath”): calls node based on xpath (we’ll cover this later)
- html_attrs(): identifies attributes (useful for debugging)
- html_table(): turns HTML tables into data frames
- html_text(): strips the HTML tags and extracts only the text

Let’s implement it and see how it works. We will scrape the Amazon website for comparision of top 15 smartphones.

#loading the package:

library(‘rvest’)

#Specifying the URL for the desired website to be scrapped

url <- https://www.amazon.in/s?k=top+smartphones&dc&crid=3JN9QKV0R5211&sprefix=top+smart%2Caps%2C376&ref=a9_sc_1

#Reading the html content from Amazon

webpage <- read_html(url)

In this code, we read the HTML content from the given URL and assign that HTML into the `webpage`

variable.

Now, as the next step, we will extract the following information from the website:

**Title:** The title of the product.**Price: **The price of the product.**Rating: **The user rating of the product.**Size:** The size of the product. **Color:** The color of the product.

Next, we will make use of HTML tags, like the title of the product and price, for extracting data using Inspect Element. In order to find out the class of the HTML tag, use the following steps:

**=> go to chrome browser => go to this URL => right-click => inspect** **element**

NOTE: If you are not using the Chrome browser, check out this article.

Based on CSS selectors such as class and id, we will scrape the data from the HTML. To find the CSS class for the product title, we need to right-click on the title and select “Inspect” or “Inspect Element”.

As you can see below, I extracted the title of the product with the help of ** html_nodes** in which I passed the id of the title —

#scrape title of the product

title_html <- html_nodes(webpage, ‘.a-size-medium’)

title <- html_text(title_html)

head(title)

We could get the title of the product using spaces and \n.

The next step would be to remove spaces and new line with the help of the ** gsub **function.

# remove all space and newlines

title<-gsub(“\n”,””,title)

head(title)

**Price of the product:**

price_html <- html_nodes(webpage, ‘.a-price-whole’)

price <- html_text(price_html) head(price)

**Output:**

**Rating of the products:**

rating_html <- html_nodes(webpage, ‘.a-icon-alt’)

rating <- html_text(rating_html)

head(rating)

**Output:**

Now we have successfully scraped all the 4 features for the 15 most popular feature smartphones from amazon released in 2019. Let’s combine them to create a dataframe and inspect its structure.

I believe this article would have given you a complete understanding of the web scraping in R. Now, you also have a fair idea of the problems which you might come across and how you can make your way around them. As most of the data on the web is present in an unstructured format, web scraping is a really handy skill for any data scientist.

Also, you can post the answers to the above three questions in the comment section below. Did you enjoy reading this article? Do share your views with me. If you have any doubts/questionsns feel free to drop them below.

To read more about R and its implementation Click

The post Web Scraping Using R From Amazone appeared first on StepUp Analytics.

]]>The post Missing Value Imputation Techniques In R appeared first on StepUp Analytics.

]]>Let’s see the three main types missing values according to their pattern of occurrence in a data set.

- Missing completely at random (MCAR)
- Missing at random (MAR)
- Not missing at random (NMAR)

It occurs when the missing values occur entirely at random and are independent of other variables in the observation. Here we are assuming that the variable of missing data is completely unrelated to the other variables or columns in the data. For example

Suppose that we have a database of school students with 4 columns **Student.Id**, **Name**, **Gender**, and **Number of Subjects**. With the data available we cannot determine the number of subjects for the given missing observation because the missing data is completely independent of the other observations in the data.

An alternative assumption to MCAR is MAR or Missing at Random. It assumes that we can predict the missing value on the basis of other available data.

From the given data we can build a predictive model that Number of subjects can be predicted on the basis of independent variables like class and age. So in these cases, we can use some advanced imputation techniques to determine the missing values.

MAR is always a safer assumption than MCAR. This is because any statistical analysis which is performed under the assumption of MCAR is also valid for MAR, but the reverse is not true.

NMAR is also known as* nonignorable missing data*. It is completely different from MCAR or MAR. It is a case where we cannot determine the value of the missing data with any of the advanced imputation techniques. For example, if there is a question in a questionnaire which is a very sensitive issue and it is likely to be avoided by the people filling out the questionnaire, or anything that we don’t know. This is known as *missing not at random data**.*

In the present study, I have used the **iris data set** which is already present in the R software. Though the dataset does not have any missing values, I have introduced missing values randomly into the data set to execute the six most popular methods of missing value treatment.

D <- iris

Saving the dataset in a dataframe “D”.

The data frame has four columns

- Sepal.Length
- Sepal.Width
- Petal.Length
- Petal.Width
- Species

Of these variables, the first four are numeric and the fifth variable “Species” is a factor with three levels.

**NOTE**: I have used only the first column i.e. **Sepal.Length** to explain the imputation techniques.

str(D)

This is the best avoidable method unless the data type is MCAR. We have to see whether the deletion of the data will affect any of the statistical analysis done with the data or not. Moreover, it is only performed if there is a sufficient amount of data available after deleting those observations with “NA” values and deleting them does not create any bias or not a representation of any variable.

Creating another data set from the original dataset “D”

df <- D

Introducing “NA” values randomly into the dataset.

df<-as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.95, 0.05), size = length(cc), replace = TRUE) ]))

Determining the number of “NA” values in the data set.

sapply(df,function(x)sum(is.na(x)))

Deleting the observations or rows which have “NA” values.

df<-na.omit (df)

Now, there is also an alternative to this, which is using **na.action = na.omit **directly in the model.

sapply(df,function(x)sum(is.na(x)))

This method is used only when a certain variable has a very high number of NA values in comparison to other variables. So using the previous method would lead to a loss of too many observations from the dataset. Now here we also need to see whether the given variable is an important predictor of the dependent variable or not. Then decide the better approach to deal with it.

** **Creating another data set from the original dataset “D”

df1 <- D

Introducing NA values to the first column of the dataset

df1$Sepal.Length [20:140] <-NA

Determining the number of “NA” values in the data set.

sapply(df1,function(x)sum(is.na(x)))

We can see that out of 150 observations 121 values in the Sepal.Length column is missing.

Deleting the variable Sepal.Length from the dataset.

df1$Sepal.Length<-NULL df1<-df1 [,-1] ## Another way to do it. sapply(df1,function(x)sum(is.na(x)))

This is a very common technique of replacing the NA values. It is often used when there is not much variation in the data or the variable is not that important predictor of the dependent variable. Though one can easily calculate the mean or median value to impute the missing values, this method leads to an artificial reduction of the variation in the dataset.

Moreover, it reduces the standard error which invalidates most hypothesis tests. Also, it introduces a wrong representation of the relationship of the variable with other variables in the dataset.

Creating another data set from the original dataset “D”

df2<-D

Introducing NA values randomly into the dataset.

set.seed (123) df2<-as.data.frame (lapply (df2, function (cc) cc [sample(c (TRUE, NA), > prob = c (0.60, 0.40), size = length (cc), replace = TRUE)]))

Here I am saving a copy of the variable Sepal.Length as “**original**”, consisting of the values which have been replaced as NA values in the data set. This is done so that later one can calculate the MSE, RMSE, and MAPE to see the accuracy of the imputation method.

for more details about **MSE**, **RMSE**, and **MAPE **please open this link

fn<-ifelse (is.na (df2$Sepal.Length) ==TRUE, df2$Sepal.Length, 0) original<-D$Sepal.Length [is.na (fn)]

Calculating the value of mean for the variable Sepal.Length and saving it as** predictmean**

predictmean <-round (mean (df2$Sepal.Length, na.rm = TRUE), digits = 1) df21<-df2

Replacing the missing values in the Sepal.Length column with the mean value

df21$Sepal.Length [is.na (df2$Sepal.Length)] <- predictmean

For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library **“Metrics” **Saving the output of accuracy in **Accuracy_mean**

library (Metrics) mae_mean<-mae (original, predictmean) rmse_mean<-rmse (original, predictmean) mape_mean<-mape (original, predictmean) Accuracy_mean<-cbind (mae=mae_mean, rmse=rmse_mean, mape=mape_mean) Accuracy_mean

Calculating the value of the median for the variable Sepal.Length and saving it as** predictmedian**

predictmedian <-round(median(df2$Sepal.Length,na.rm= TRUE),digits = 1) df22<-df2

Replacing the missing values in the Sepal.Length column with the median value

df22$Sepal.Length[is.na(df2$Sepal.Length)] <- predictmedian

For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library **“Metrics”**

Saving the output of accuracy in **Accuracy_median**

library(Metrics) mae_median<-mae(original,predictmedian) rmse_median<-rmse(original,predictmedian) mape_median<-mape(original,predictmedian) Accuracy_median<-cbind(mae=mae_median,rmse=rmse_median,mape=mape_median) Accuracy_median

This method of imputation is used when the missing data is of MAR type. So we create a decision tree model to predict the values of the missing data which was previously trained with the data which was already present in the dataset. This method can be used to predict both numeric and factor variables. Here we are predicting a numeric variable, so we choose the **method = “anova”. **But we write **method= “class” **in case of factor variables. It is also important to exclude the missing values from the model by using **na.action=na.omit.**

We have saved the model by the name **fitrpart**

library(rpart) # tree based model fitrpart <- rpart(Sepal.Length ~ ., data=df2[!is.na(df2$Sepal.Length),], method="anova",na.action=na.omit)

Here we are predicting the missing values of Sepal.Length with the model and saving it in **predictrpart**

predictrpart<- predict(fitrpart, df2[is.na(df2$Sepal.Length),])

For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library **“Metrics”**

Saving the output of accuracy in **Accuracy_rpart**

library(Metrics) mae_rpart<-mae(original,predictrpart) rmse_rpart<-rmse(original,predictrpart) mape_rpart<-mape(original,predictrpart) Accuracy_rpart<-cbind(mae=mae_rpart,rmse=rmse_rpart,mape=mape_rpart) Accuracy_rpart

The kNN algorithm computes the distance between the data point and its k nearest neighbors using the Euclidean distance in multidimensional space and imputes the missing values with the weighted average of the values taken by the k nearest neighbors.

Things to remember:

1.**k** is the number of nearest neighbors used to find the values of the missing data points.

2.**variable** is the variable which consists of missing values we choose to impute. We can choose more than one variable by, **variable=** **c(“a”, “b”, “c” ) **where a, b, and c are the variables which consist of missing values

Disadvantages of using Knn

- Time-consuming in case of large datasets
- The choice of the value of k is very critical. Higher values may include the attributes which we may not need. Lower values may miss out important attributes of the data

Executing the algorithm

library(VIM) df23<- kNN(df2,variable="Sepal.Length",k=6 ) df23$Sepal.Length_imp<-NULL

Saving the values of variable Sepal.Length that was imputed using kNN in a vector **predictkNN**

predictkNN <- df23[is.na(df2$Sepal.Length), "Sepal.Length"]

For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library **“Metrics” **Saving the output of accuracy in **Accuracy_kNN**

library(Metrics) mae_kNN<-mae(original,predictkNN) rmse_kNN<-rmse(original,predictkNN) mape_kNN<-mape(original,predictkNN) Accuracy_kNN<-cbind(mae=mae_kNN,rmse=rmse_kNN,mape=mape_kNN) Accuracy_kNN

**NOTE:** There is another package **library (****DMwR)** where function** knnImputation () **can be used to do the imputation. (For more details about the function: **Click**

Mice or Multivariate Imputation via Chained Equations is a package that uses multiple imputations for a missing data treatment. Now, as multiple imputations create multiple predictions for each missing value; they take into account the uncertainty in the imputation and give the best standard errors. If there is not much information in the given data used to prepare the model, the imputations will be highly variable, leading to high standard errors in the analysis.

Things to remember:-

**m**is the number of imputations. In order to achieve better estimates of standard errors, more imputations are better. But the default value of m is**5**.**maxit**is the number of iterations for each imputation. Mice use an iterative algorithm. That does multiple imputations for all variables and at the end, they reach a point of convergence.**method=****“pmm”**refers to the imputation method. In this case, we are using**predictive mean matching**as imputation method.**methods (mice)**can be used to see the list of the imputation methods that are available. The continuous data imputation is done by predictive mean matching by default, but we can also use**Bayesian linear regression**(**method= “norm”**) for this dataset. There are other methods which can be used according to the requirements of the data. For e.g. use**method= “logreg”**for categorical data (levels=2) use**method= “****polyreg****”**for categorical data (levels>=2) and etc

library (mice) fitmice<-mice (df2, m=10, maxit=30, method="pmm") df24<-complete (fitmice)

Saving the values of variable Sepal.Length that was imputed using Mice in a vector **predictmice**

predictmice<-df24 [is.na (df2$Sepal.Length), "Sepal.Length"]

For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library **“Metrics” **Saving the output of accuracy in **Accuracy_mice**

library (Metrics) mae_mice<-mae (original, predictmice) rmse_mice<-rmse (original, predictmice) mape_mice<-mape (original, predictmice) Accuracy_mice<-cbind (mae=mae_mice, rmse=rmse_mice, mape=mape_mice) Accuracy_mice

The missForest function is used particularly in the case of mixed-type data. It can be used to impute continuous and categorical data including complex interactions and non-linear relations. It uses the given data in the data frame to train the random forest model and then uses the model to predict the missing values. It yields an out-of-bag (OOB) imputation error estimate without the need of a test set or elaborate cross-validation.

library (missForest) Executing the algorithm fitmissforest<-missForest (df2) Saving the output in df25 df25<-fitmissforest$ximp

Saving the values of variable Sepal.Length that was imputed using missForest in a vector **predictmissforest**

predictmissforest<-round (df25 [is.na (df2$Sepal.Length), "Sepal.Length"], digits=1)

For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library **“Metrics” **Saving the output of accuracy in **Accuracy_missforest**

library (Metrics) mae_missforest<-mae (original, predictmissforest) rmse_missforest<-rmse (original, predictmissforest) mape_missforest<-mape (original, predictmissforest) Accuracy_missforest<-cbind (mae=mae_missforest, rmse=rmse_missforest, mape=mape_missforest) Accuracy_missforest

Creating a data frame **Accuracy** consisting of all the accuracy outputs of the last **five** methods

Accuracy<-cbind(Methods=c("Mean","Median","rpart","kNN","mice","missForest"), rbind.data.frame(Accuracy_mean,Accuracy_median,Accuracy_rpart, Accuracy_kNN,Accuracy_mice,Accuracy_missforest)) library (ggplot2)

Creating a scatter plot to visualize the best method with least **MAPE** (Mean absolute percentage error) Similarly, you can see all the plots of **MAE** and **RMSE**.

Scatter plot

#Scatter plot ggplot(data=Accuracy, aes(x=Methods, y=mape)) + geom_point()

So, from the above diagram, we can see that for the given data set mice and missForest gives us the best output. But we cannot conclude that for each and every dataset mice or missForest will be the best method of missing value treatment. Because missing values have different types and patterns. So, I think the best way is to test the models with the given data and then use the best model to impute the missing values in the data set

Missing values imputation with **missMDA**

Missing values imputation with **Fuzzy K-means Clustering
**Missing values imputation with

The post Missing Value Imputation Techniques In R appeared first on StepUp Analytics.

]]>The post Web Scraping And Analyzing Data Using R appeared first on StepUp Analytics.

]]>If I would have been present at home, I could have translated the information delivered in Chinese and then interpreted and clarify his concerns. On a similar note, web scrapping plays the same roll in the present online, social and the digital world.

In websites and other online sources, different types of commodities information are available but sometimes we cannot use that information for analysis and then draw valuable insights from the raw data. So web scraping comes to rescue here. It becomes necessary to scrap or to read the ungrouped data in a grouped format, as a result, we make graphical views using the available information and comparisons on different websites.

Now we will discuss what the web scrapping is. **Web scraping is a technique for converting the data in an unstructured format with HTML tags from the web to the structured format which can easily be accessed, used and analyzed.**

This is because it is not possible to collect information physically. But we can get the information easily by internet surveys or through social media. So, it is necessary to access the data from the internet.

Some applications of web scrapping in real life:

- For the purpose of comparison between different movies, medicines etc. to analyze the data
- For the purpose of scraping an image from different websites to train image classification
- For the purpose of scrapping data from social media sites such as Facebook, Twitter for sentiment analysis
- Scrapping user reviews and feedbacks from Flipkart and Amazon etc. for the purpose of e-commerce
- To know which laptop is better among HP, DELL etc. we use different reviews and the web scrapping method to compare the data
- Web scraping is also necessary for the purposes of reviewing which bus is better for transportation or which hotel is cheap for a tourist

**Human Copy-Paste: **This is a slow and efficient way of scraping data from the web. This involves humans for copy-paste the data from different websites.

**Text pattern matching:** Another simple yet powerful approach to extract information from the web is by using regular expression matching facilities of programming languages.

**API Interface:** Many websites like Facebook, Twitter, LinkedIn, etc. provides public and/ or private APIs which can be called using the standard code for retrieving the data in the prescribed format.

**DOM Parsing: **By using the web browsers, programs can retrieve the dynamic content generated by client-side scripts.

There are many software for web scraping. Here we will discuss how we can scrap the data from a website using R and R Studio software.

Suppose we search **“mobile”** in **flipkart.com** and have search results. Now, I want to collect the data of the mobiles from that page. Suppose we want the name, price, and ratings of those mobiles. Now first we have to copy the link of that webpage which I want to scrape and then follow the r codes,

library(rvest) url <- 'https://www.flipkart.com/search?q=Mobile&marketplace=FLIPKART&otracker=start&as-show=on&as=off' webpage <- read_html(url) name_html = html_nodes(webpage,'._3wU53n') names <- html_text(name_html) price_html = html_nodes(webpage,'._2rQ-NK') price <- html_text(price_html) price <- as.numeric(gsub(x = gsub(x = price,pattern = "\u20b9",replacement = ""),pattern = ",",replacement = "")) rating_html = html_nodes(webpage,'._2beYZw') rating <- html_text(rating_html)[1:24] rating <- as.numeric(sub(x = rating,pattern = " ★",replacement = "")) data <- data.frame(Product.description = names, Price = price, Rating = rating) data

Running the above codes we get the following output:

Product.description Price Rating 1 Redmi Note 5 Pro (Black, 64 GB) 14999 4.5 2 Asus Zenfone Max Pro M1 (Black, 32 GB) 10999 4.2 3 Redmi Note 5 Pro (Gold, 64 GB) 14999 4.5 4 Samsung Galaxy J6 (Black, 32 GB) 12990 4.4 5 Asus Zenfone Max Pro M1 (Black, 64 GB) 12999 4.3 6 Infinix HOT 6 Pro (Sandstone Black, 32 GB) 7999 4.3 7 Honor 7A (Black, 32 GB) 8999 4.2 8 Honor 7A (Blue, 32 GB) 8999 4.2 9 Honor 7A (Gold, 32 GB) 8999 4.2 10 Redmi Y1 (Grey, 32 GB) 8999 4.2 11 Redmi 5A (Rose Gold, 16 GB) 5999 4.4 12 Redmi 5A (Grey, 16 GB) 5999 4.4 13 Redmi 5A (Gold, 16 GB) 5999 4.4 14 Samsung Galaxy On6 (Blue, 64 GB) 14490 4.3 15 Redmi 5A (Gold, 32 GB) 6999 4.4 16 Redmi 5A (Blue, 16 GB) 5999 4.4 17 Redmi 5A (Rose Gold, 32 GB) 6999 4.4 18 Redmi 5A (Blue, 32 GB) 6999 4.4 19 Redmi 5A (Grey, 32 GB) 6999 4.4 20 Samsung Galaxy J8 (Blue, 64 GB) 18990 4.4 21 Asus Zenfone Max Pro M1 (Grey, 64 GB) 14999 4.5 22 Redmi Note 5 (Gold, 32 GB) 9999 4.4 23 Asus Zenfone Max Pro M1 (Grey, 32 GB) 10999 4.2 24 Samsung Galaxy On6 (Black, 64 GB) 14490 4.3

Now, one should ask what is the second element in the function ** html_nodes()**. The second element is the “CSS” which we have to select. You may think of it as an address of the elements which you to select in the webpage. Now you can find these “CSS” by the following method –

- Install “selector gadget” extension to your Google Chrome.
- Now, go to the webpage.
- Then click on the extension.
- Now, just select the one you are interested in.
- You will get the CSS selector in a tab below.

You can follow the visit __https://selectorgadget.com/__ to understand how “selector gadget” works.

Another question that may arise is why I have used some replacement for the variables “price” and “rating”. Basically, when we scrap a data from the webpage it may not be saved as we need. So, we have to modify it to serve our purpose.

So, this is how we easily scarp a data from a webpage using R. After getting these we can analyze the data as our need.

The post Web Scraping And Analyzing Data Using R appeared first on StepUp Analytics.

]]>The post Beginners Guide to Statistical Cluster Analysis in Detail Part-1 appeared first on StepUp Analytics.

]]>Cluster Analysis can be done by two methods:

- Hierarchical cluster analysis.
- Non-Hierarchical cluster analysis.

**Hierarchical Cluster Analysis(HCA):**

- In HCA, the observation vector(cases) are grouped together on the basis of their mutual distance.
- An HCA is usually visualised through a hierarchical tree called dendrogram tree. This hierarchical tree is a nested set of partitions represented by a tree diagram.

- Sectioning a tree at a particular level produces a partition into
**‘g’**disjoint groups. - If 2 groups are chosen from different partitions then either the groups are disjoint or 1 group is totally contained within the other.
- A numerical value is associated with each partition of the tree where branches join together. This value is a measure of distance or dissimilarity between two merged clusters.
- Different distance measures give rise to different hierarchical clusters structure.

**There are two types of approaches for HCA: **

- Agglomerative HCA
- Divisive HCA

**Agglomerative HCA: **

- Operates by successive merges of cases.
- Begin with clusters, each containing single cases.
- At each stage merge the 2 most similar group to form a new cluster, thus reducing the number of the cluster by n.
- Continue till(eventually as similarity decreases) all subgroups are fused to form one single cluster.

**Divisive HCA: **

- The divisive method operates by the successive splitting of groups.
- Initially starts with a single group(i.e. one single cluster).
- Group is divided into 2 types: 1) The objects in one subgroup are as far as possible from the objects in the other group. 2) Continue till there are ‘n’ groups, each with a single cluster.

**Note: **Result of both the approaches are displayed through the dendrogram tree.

**Steps Involved in Agglomerative HCA: **

- Starts with a cluster each containing a single object and an
**NxN**symmetric matrix of distances(or similarity).**D = ((D[i×j]))** - Search the distance matrix
**(D)**for nearest (most similar) pair of objects. Let the distance between the most similar cluster say (U&V) be denoted by**d[u×v].** - Merge clusters U & V to be as (U, V) as the new cluster(produces
**(n-1)×(n-1)**matrix), update the distance matrix by doing the following:- Deleting the rows & columns corresponding to the clusters U & V.
- Adding a row & a column giving the distances between the newly formed cluster (U, V) and the remaining cluster.

- Repeat points
**second & third**a total**(n-1) times.**Record the identity of clusters that are merged and the level(distance or dissimilarity) at which they are merged. - Structure the dendrogram tree from the information on mergers and merger levels.

**Possible distance measures between two clusters:**

- Single linkage-minimum distance or nearest neighbour approach

Here i∈k1, j∈k2

Distance between cluster 1 &2 ?

min[d(1,2),(3,4,5)] =

**Under single linkage approach min[d(1,2),(3,4,5)] = d(2,5)**

Here is the example of single linkage attached in pdf

New Doc 2017-12-19

- Complete Linkage – max distance between cluster

d(1,3), d(1,4),d(1,5) | d(2,3), d(2,4),d(2,5)

complete linkage distance between cluster 1 and 2 = d(1,4)

Here is the complete linkage example attached

New Doc 2017-12-19 (1)

- Average linkage – average distance

Average linkage distance between clusters

=1/6∑d(i,j) where **i,j is 1 to n**

- Centroid Linkage: Distance between centroids of two clusters.
- Median linkage: Distance between the median of two clusters.

Hierarchical cluster analysis ends here, in the next tutorial article I will explain Non-Hierarchical cluster analysis.

Till then stay tuned and keep visiting for learning tutorials which you won’t get anywhere.

If you have any doubts please mention in comments or shoot me an email @ irrfankhann29@gmail.com.

This article posted here.

The post Beginners Guide to Statistical Cluster Analysis in Detail Part-1 appeared first on StepUp Analytics.

]]>The post 50 R Language and R Studio Tips appeared first on StepUp Analytics.

]]>Here are 50 R and R Studio tips that we hope will be useful through your journey in R. Hope some of these are new to you and will enhance your R skills.

If you wish to keep up with more of these, follow our post on **facebook **where you will find more of these tips, blogs on Analytics and Data Science related updates.

**1.** Never use as.numeric() to convert a factor variable to numeric, instead use as.numeric(as.character(myFactorVar)).

**2.** options(show.error.messages=F) turns printing error messages off.

**3.** Use file.path() to create ﬁle paths. It works independent of OS platform.

**4.** mixedsort() from gtools package sorts strings with embedded numbers so even the numbers are in correct order. This is not achieved by regular sort() function.

**5.** Use ylim = range(myNumericData) + 10 as an argument in plot() function to set and adjust the Y axis limits in your plot

**6. **Use las parameter in your plot() to customise the orientation of axis labels. Accepted values are {0, 1, 2, 3} for {parallel to axis, horizontal, perpendicular to axis, vertical}

**7.** Use memory.limit (size=2500), where the size is in MB, to manage the maximum memory allocated for R on a Windows machine.

**8.** Use alarm() to produce a short beep sound at the end of your script to notify that the run has completed.

**9.** eval(parse(text=paste (“a <-‐ 10”))) will create a new variable ‘a’ and assign value 10 to it. It executes your strings as if they are R statements.

**10.** sessionInfo() gets the version information about current R session and attached or loaded packages.

**11.** Compute the number of changes in characters required to convert one word to another using adist(word1, word2).

**12.** options(max.print=1000000) sets the max no. of lines printable in console. Adjust this if you want to see more lines.

**13.** Introducing practical and robust anomaly detection in a time series: https://blog.twitter.com/2015/introducing-practical-and-robust-anomalydetection-in-a-time-series

**14.** Two R sessions running simultaneously is guaranteed to have unique IDs. Get the ID of current R session using Sys.getpid()

**15.** Remove the names attributes from an R object using the unname() function.

**16.** Check if two R objects are same with identical(x,y). Use all.equal() to test if values are equal.

**17.** Use withTimeout() function from R.utils package to interrupt functions if run time exceeds a preset time limit and move to next step.

**18.** Use dist() to compute the distance between rows of a matrix.

**19.** Use diff() to calculate lagged and iterated differences of a numeric vector.

**20.** Turn off printing scientiﬁc notation such 1e-5 in output, using options(scipen=999)

**21.** bagEarth() from earth package performs a bagged MARS (Multivariate Adaptive Regressive Spline)

**22.** setClass(‘myClass’) will deﬁne a new user deﬁned class called ‘myClass”. Use setAs() to further customisation.

**23.** assign (“varName”, 10) is a convenient way to create numerous variables, as the var name can be passed as a programmable string.

**24.** dim(matrix) returns the number of rows and columns.

**25.** data.matrix() converts a data frame to a numeric matrix. Factors will be converted to appropriate numeric values.

**26.** Use invisible(..) to suppress printing the output to console. Widely used from within functions.

**27.** cat(“\014”) clears the R Console in Windows.

**28.** dir(‘folder path’) shows the ﬁles in ‘folder path’. Works much like the same way as in windows cmd prompt.

**29.** Make missing values in a factor variable as another category in one-line using: levels(Var) <-‐ c (levels (Var), “UNKNOWN”)

**30.** Initialise all required packages in one line: lapply(x, require, character.only=T), where x is char of all required package names

**31.** rev(x) reverses the elements of x

**32.** Use complete.cases() to get the rows which are complete (with no missing values)

**33.** avNNet() from nnet pkg to implement Model Averaged Neural Network

**34.** file.remove(‘filepath’) removes the ﬁle from directory. Use this wisely to delete multiple ﬁles esp in repetitive tasks.

**35.** Use ada() in ada pkg to implement Boosted classiﬁcation trees.

**36.** Use unclass() on objects like ‘lm’ to break it down to a ‘list’. Makes it easier to access un-printed elements this way.

**37.** Sort a data frame based on 2 columns together: df[order(df$col1, df $col2), ]

**38.** Convert One ‘N-level factor var’ to N ‘binary-predictor-vars’ with model.matrix(~as.factor(Data)+0).

**39.** Use seasadj() to de-seasonalize a time series. http://goo.gl/Oio7s2

**40.** Use <<-‐ instead of <-‐ operator to assign the value to a variable that exists outside the function from which it is called.

**41.** Set the memory size R uses using memory.limit(size=desired-‐size) in windows platform. On other platforms, use mem.limits()

**42.** Use file.copy(from=fromFile, to = toFile, overwrite=TRUE ) to copy ﬁles with R , works even between connected servers.

**43.** Use debugonce() to run through debug step only once, instead of debug() which requires undebug() to come out of it.

**44.** Convert a R Factor Variable To A Collection of Multiple 1/0 Binary Vars: bins <-‐ model.matrix( ~ 0 + varName, data). Highly useful in regression modelling.

**45.** discretize() from arules pkg is a convenient function to convert continuous variables to categorical. It has convenient split criteria options.

**46.** NROW() is similar to nrow() function but even works on a vector, treating it as a 1-column matrix. You can safely use in place of length() function.

**47.** commandArgs() returns the cmd line arguments passed with R script run from command. http://bit.ly/1yARCWj

**48.** Use attr(myFunc, “AttrName”) <-‐ myVal, within the function, it remembers the “AttrName” var in next call.

**49.** Use object.size() to estimate the memory a given R object consumes in bytes.

**50.** Use ls.str() (over ls()) to see structural details of objects when working on large R projects.

Follow Us on **YouTube **for More Updates

The post 50 R Language and R Studio Tips appeared first on StepUp Analytics.

]]>The post Data Mining: Market Basket Analysis in R appeared first on StepUp Analytics.

]]>The term ‘E-commerce’ is well known to all of us. Well, it means trade and business through the means of the internet, popularly known as ‘online shopping’. Nowadays, retailers who traditionally used to sell their products strictly in ‘Brick-And-Mortar’ stores, resort to the online display of their products and hence facilitate online purchase of their products through various platforms.

In doing so, both the customers and the sellers are benefitted intelligently. The customers can search for their desired products and compare their prices online, whereas the sellers can effortlessly conduct their merchandise trade in a cost-effective but intelligent manner.

The biggest perks of having an online presence for a seller is that it enables them to correct their past mistakes in the business policies, by merely looking at the recorded sales data and understanding the customer behaviour largely.

However, this data generation and delving deep into the data to get useful insights is a logical task, which requires some scientific algorithm. One such algorithm widely used is the ‘Apriori’ algorithm. But, the thing is, such algorithms require trained marketing analysts, to be executed and inferred on.

So, this comes from the term ‘Market Basket Analysis’. Nowadays this is very common a procedure performed by not only online retailers but also the sellers who prefer to sell in physical ‘Brick-and Mortar’ stores.

The term ‘market-basket’ implies any consumption bundle taken up by the customers for final purchase.

However, such bundles do not necessarily mean bundles of the same product, but also comprises the possibility of a customer buying up multiple product items in the same go, which together build up his ‘market-basket’.

Market basket analysis may provide the retailer with information to understand the purchase behaviour of a buyer. This information will enable the retailer to understand the buyer’s needs and rewrite the store’s layout accordingly, develop cross-promotional programs, or even capture new buyers, all of which are necessary to survive in the market.

Most relevant and well-known examples include ‘Amazon’, ‘Flipkart’, ‘Ebay’, etc.

**Terminology:**

**Items** are the objects that we are identifying associations between. For an online retailer, each item is a product in the shop. For a publisher, each item might be an article, a blog post, a video etc. A group of items is an item set.

**Transactions** are instances of groups of items occurring together. For an online retailer, a transaction is generally a monetary transaction. For a publisher, a transaction might be the group of articles read in a single visit to the website. (It is up to the analyst to define over what period to measure a transaction.) For each transaction, then, we have an item set.

**Rules **are statements of the form

i.e. if you have the items in item set (on the left-hand side (LHS) of the rule

i.e. {i_1, i_2, …}, then it is likely that a visitor will be interested in the item on the right-hand side (RHS i.e. {i_k}. In our example above, our rule would be:

The output of a market basket analysis is generally a set of rules, that we can then exploit to make business decisions (related to marketing or product placement, for example).

The **support** of an item or an item set is the fraction of transactions in our data set that contain that item or the item set. In general, it is nice to identify rules that have a high support, as these will be applicable to a large number of transactions. For super market retailers, this is likely to involve basic products that are popular across an entire user base (e.g. bread, milk).

A printer cartridge retailer, for example, may not have products with a high support, because each customer only buys cartridges that are specific to his / her own printer.

The **confidence** of a rule is the likelihood that it is true for a new transaction that contains the items on the LHS of the rule (i.e. it is the probability that the transaction also contains the item(s) on the RHS.). Formally:

The **lift** of a rule is the ratio of the support of the items on the LHS of the rule co-occurring with items on the RHS, to the probability that the LHS and RHS co-occur if the two are independent.

If the lift is greater than 1, it suggests that the presence of the items on the LHS has increased the probability that the items on the RHS will occur on this transaction. If the lift is below 1, it suggests that the presence of the items on the LHS make the probability that the items on the RHS will be part of the transaction, lower.

If the lift is 1, it suggests that the presence of items on the LHS and RHS really are independent, i.e. knowing that the items on the LHS are present makes no difference to the probability that items will occur on the RHS.

When we perform market basket analysis, then, we are looking for rules with a lift of more than 1.

Rules with higher confidence are ones where the probability of an item appearing on the RHS is high given the presence of the items on the LHS. It is also preferable to action rules that have a high support – as these will be applicable to a larger number of transactions.

However, in the case of long-tail retailers, this may not be possible. Practically it has been seen in many cases that maximizing support and confidence at the same time is not possible. In businesses that are dealing with products with relatively low demand, it is advisable to maximize confidence in maintaining the support parameter at a threshold acceptable level.

The following steps take us through the exact analytical process of dealing with Market Basket Analysis using R: –

At first, we read the data set on transactions.

The name of the required data set in my analysis is “AprioriTransactionsReduced.csv”, i.e. a CSV file.

If anyone needs to get access to this data set, get it from the link below.

Data Set – AprioriTransactionsReduced.csv

We now set the file path and then import the csv file in R. After importing the data file we look at its initial structure.

setwd('G:\StepUpAnalytics.com\Arka') df <- read.csv("AprioriTransactionsReduced.csv") str(df)

## 'data.frame': 40236 obs. of 2 variables: ## $ Invoice: Factor w/ 10246 levels "110002","110003",..: 294 637 637 637 822 2814 2814 2965 3173 5232 ... ## $ Item : Factor w/ 6822 levels "001-0012","001-0013",..: 4579 4719 4827 5217 23 4719 4827 23 4 23 ...

We sort the data set by the ascending order of the ‘Invoice’ and have a brief look at the sorted data set.

df_sorted <- df[order(df$Invoice),]

head(df_sorted)

## Invoice Item ## 24 110002 340-7800 ## 25 110003 086-0604 ## 26 110003 086-1568 ## 27 110003 086-3126 ## 28 110003 138-0116 ## 29 110003 352-4505

tail(df_sorted)

## Invoice Item ## 40059 41102 084-3926 ## 40060 41102 404-0420 ## 40061 41102 404-0438 ## 40062 41102 436-1265 ## 40063 41102 436-1275 ## 40064 INVOICE Part

We convert ‘Invoice’ to numeric and check its data nature.

df_sorted$Invoice <- as.numeric(df_sorted$Invoice) class(df_sorted$Invoice)

## [1] "numeric"

We then convert Item to categorical format and look at its data type and structure.

df_sorted$Item <- as.factor(df_sorted$Item) class(df_sorted$Item)

## [1] "factor"

str(df_sorted)

## 'data.frame': 40236 obs. of 2 variables: ## $ Invoice: num 1 2 2 2 2 2 2 2 3 4 ... ## $ Item : Factor w/ 6822 levels "001-0012","001-0013",..: 4863 2019 2042 2107 2935 4935 5974 5987 6041 1 ...

Now, we have to convert dataframe to transaction format using ddply and #group all the items that were bought together by the same customer on the same date.

library(plyr) df_itemList <- ddply(df,'Invoice', function(df1)paste(df1$Item,collapse = ",")) head(df_itemList)

## Invoice V1 ## 1 110002 340-7800 ## 2 110003 086-0604,086-1568,086-3126,138-0116,352-4505,462-5602,462-5724 ## 3 110004 471-2812 ## 4 110005 001-0012 ## 5 110006 053-9626,110-4368 ## 6 110007 148-0294,148-0736,452-1691,462-3070,462-5406,462-5726

Now, we remove the column ‘Invoice’.

df_itemList$Invoice <- NULL head(df_itemList)

## V1 ## 1 340-7800 ## 2 086-0604,086-1568,086-3126,138-0116,352-4505,462-5602,462-5724 ## 3 471-2812 ## 4 001-0012 ## 5 053-9626,110-4368 ## 6 148-0294,148-0736,452-1691,462-3070,462-5406,462-5726

Next, we have to rename the only column head left in the data set.

colnames(df_itemList) <- c("Item List") head(df_itemList)

## Item List ## 1 340-7800 ## 2 086-0604,086-1568,086-3126,138-0116,352-4505,462-5602,462-5724 ## 3 471-2812 ## 4 001-0012 ## 5 053-9626,110-4368 ## 6 148-0294,148-0736,452-1691,462-3070,462-5406,462-5726

We now export the data set to be worked upon to a csv format file, for a back up.

write.csv(df_itemList,"ItemList.csv", quote = FALSE, row.names = TRUE)

We load the packages required.

library(arules)

## Warning: package 'arules' was built under R version 3.4.2

## Loading required package: Matrix

## ## Attaching package: 'arules'

## The following objects are masked from 'package:base': ## ## abbreviate, write

We now convert the csv file to basket format and inspect it.

txn = read.transactions(file="ItemList.csv", rm.duplicates= FALSE, format="basket",sep=",",cols=1);

## Warning in asMethod(object): removing duplicated items in transactions

inspect(head(txn))

## items transactionID ## [1] {Item List} ## [2] {340-7800} 1 ## [3] {086-0604, ## 086-1568, ## 086-3126, ## 138-0116, ## 352-4505, ## 462-5602, ## 462-5724} 2 ## [4] {471-2812} 3 ## [5] {001-0012} 4 ## [6] {053-9626, ## 110-4368} 5

For our convenience, we remove the quotes from transactions.

txn@itemInfo$labels <- gsub(""","",txn@itemInfo$labels)

Finally, the next step is to run the apriori algorithm.

basket_rules1 <- apriori(txn,parameter = list(sup = 0.001, conf = 0.8, target="rules"))

## Apriori ## ## Parameter specification: ## confidence minval smax arem aval originalSupport maxtime support minlen ## 0.8 0.1 1 none FALSE TRUE 5 0.001 1 ## maxlen target ext ## 10 rules FALSE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## Absolute minimum support count: 10 ## ## set item appearances ...[0 item(s)] done [0.00s]. ## set transactions ...[6823 item(s), 10247 transaction(s)] done [0.01s]. ## sorting and recoding items ... [855 item(s)] done [0.00s]. ## creating transaction tree ... done [0.00s]. ## checking subsets of size 1 2 3 4 5 6 7 8 9 10

## Warning in apriori(txn, parameter = list(sup = 0.001, conf = 0.8, target = ## "rules")): Mining stopped (maxlen reached). Only patterns up to a length of ## 10 returned!

## done [0.00s]. ## writing ... [37190 rule(s)] done [0.01s]. ## creating S4 object ... done [0.02s].

basket_rules2 <- apriori(txn,parameter = list(sup = 0.01, conf = 0.25, target="rules"))

## Apriori ## ## Parameter specification: ## confidence minval smax arem aval originalSupport maxtime support minlen ## 0.25 0.1 1 none FALSE TRUE 5 0.01 1 ## maxlen target ext ## 10 rules FALSE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## Absolute minimum support count: 102 ## ## set item appearances ...[0 item(s)] done [0.00s]. ## set transactions ...[6823 item(s), 10247 transaction(s)] done [0.00s]. ## sorting and recoding items ... [14 item(s)] done [0.00s]. ## creating transaction tree ... done [0.00s]. ## checking subsets of size 1 2 done [0.00s]. ## writing ... [2 rule(s)] done [0.00s]. ## creating S4 object ... done [0.00s].

Note: Here we have created 2 different basket rules. One with high confidence and low support and the other with high support and low confidence.

Now we view the rules that we have created.

summary(basket_rules1)

## set of 37190 rules ## ## rule length distribution (lhs + rhs):sizes ## 2 3 4 5 6 7 8 9 10 ## 15 866 3630 7246 9433 8422 5126 2002 450 ## ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2.00 5.00 6.00 6.25 7.00 10.00 ## ## summary of quality measures: ## support confidence lift count ## Min. :0.001073 Min. :0.8000 Min. : 43.72 Min. :11.00 ## 1st Qu.:0.001073 1st Qu.:0.9167 1st Qu.: 86.18 1st Qu.:11.00 ## Median :0.001073 Median :1.0000 Median :121.99 Median :11.00 ## Mean :0.001150 Mean :0.9569 Mean :134.11 Mean :11.78 ## 3rd Qu.:0.001171 3rd Qu.:1.0000 3rd Qu.:167.40 3rd Qu.:12.00 ## Max. :0.004294 Max. :1.0000 Max. :640.44 Max. :44.00 ## ## mining info: ## data ntransactions support confidence ## txn 10247 0.001 0.8

summary(basket_rules2)

## set of 2 rules ## ## rule length distribution (lhs + rhs):sizes ## 2 ## 2 ## ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2 2 2 2 2 2 ## ## summary of quality measures: ## support confidence lift count ## Min. :0.0122 Min. :0.6477 Min. :35.87 Min. :125 ## 1st Qu.:0.0122 1st Qu.:0.6547 1st Qu.:35.87 1st Qu.:125 ## Median :0.0122 Median :0.6617 Median :35.87 Median :125 ## Mean :0.0122 Mean :0.6617 Mean :35.87 Mean :125 ## 3rd Qu.:0.0122 3rd Qu.:0.6687 3rd Qu.:35.87 3rd Qu.:125 ## Max. :0.0122 Max. :0.6757 Max. :35.87 Max. :125 ## ## mining info: ## data ntransactions support confidence ## txn 10247 0.01 0.25

We now convert to the basket rules to dataframe and view them. Also, we give suitable transformations to the ‘confidence’ and ‘support’ parameters.

df_basket1 <- as(basket_rules1,"data.frame") head(df_basket1)

## rules support confidence lift count ## 1 {300-3512} => {300-3535} 0.001073485 1.0000000 365.9643 11 ## 2 {036-8303} => {036-8304} 0.001073485 1.0000000 640.4375 11 ## 3 {300-3532} => {300-3535} 0.001366254 0.8750000 320.2188 14 ## 4 {161-1271} => {339-6516} 0.001366254 0.8235294 122.3001 14 ## 5 {338-4760} => {338-4715} 0.001366254 0.9333333 222.4155 14 ## 6 {338-4760} => {339-6516} 0.001268664 0.8666667 128.7063 13

tail(df_basket1)

## rules ## 37185 {092-0526,092-1480,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-2684} ## 37186 {092-0526,092-1480,092-2684,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-7603} ## 37187 {092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-0526} ## 37188 {092-0526,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-1480} ## 37189 {092-0526,092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510} => {552-9860} ## 37190 {092-0526,092-1480,092-2684,092-7603,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-7620} ## support confidence lift count ## 37185 0.001073485 1.0000000 121.9881 11 ## 37186 0.001073485 1.0000000 129.7089 11 ## 37187 0.001073485 1.0000000 133.0779 11 ## 37188 0.001073485 1.0000000 150.6912 11 ## 37189 0.001073485 0.9166667 138.1336 11 ## 37190 0.001073485 1.0000000 204.9400 11

df_basket1$confidence <- df_basket1$confidence * 100 df_basket1$support <- df_basket1$support * nrow(df) head(df_basket1)

## rules support confidence lift count ## 1 {300-3512} => {300-3535} 43.19274 100.00000 365.9643 11 ## 2 {036-8303} => {036-8304} 43.19274 100.00000 640.4375 11 ## 3 {300-3532} => {300-3535} 54.97258 87.50000 320.2188 14 ## 4 {161-1271} => {339-6516} 54.97258 82.35294 122.3001 14 ## 5 {338-4760} => {338-4715} 54.97258 93.33333 222.4155 14 ## 6 {338-4760} => {339-6516} 51.04596 86.66667 128.7063 13

tail(df_basket1)

## rules ## 37185 {092-0526,092-1480,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-2684} ## 37186 {092-0526,092-1480,092-2684,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-7603} ## 37187 {092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-0526} ## 37188 {092-0526,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-1480} ## 37189 {092-0526,092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510} => {552-9860} ## 37190 {092-0526,092-1480,092-2684,092-7603,552-0298,552-0847,552-1972,552-6510,552-9860} => {092-7620} ## support confidence lift count ## 37185 43.19274 100.00000 121.9881 11 ## 37186 43.19274 100.00000 129.7089 11 ## 37187 43.19274 100.00000 133.0779 11 ## 37188 43.19274 100.00000 150.6912 11 ## 37189 43.19274 91.66667 138.1336 11 ## 37190 43.19274 100.00000 204.9400 11

df_basket2 <- as(basket_rules2,"data.frame") head(df_basket2)

## rules support confidence lift count ## 1 {462-5406} => {462-5726} 0.01219869 0.6756757 35.87383 125 ## 2 {462-5726} => {462-5406} 0.01219869 0.6476684 35.87383 125

tail(df_basket2)

## rules support confidence lift count ## 1 {462-5406} => {462-5726} 0.01219869 0.6756757 35.87383 125 ## 2 {462-5726} => {462-5406} 0.01219869 0.6476684 35.87383 125

df_basket2$confidence <- df_basket2$confidence * 100 df_basket2$support <- df_basket2$support * nrow(df) head(df_basket2)

## rules support confidence lift count ## 1 {462-5406} => {462-5726} 490.8266 67.56757 35.87383 125 ## 2 {462-5726} => {462-5406} 490.8266 64.76684 35.87383 125

tail(df_basket2)

## rules support confidence lift count ## 1 {462-5406} => {462-5726} 490.8266 67.56757 35.87383 125 ## 2 {462-5726} => {462-5406} 490.8266 64.76684 35.87383 125

We split lhs and rhs into two columns.

library(reshape2)

df_basket_split1 <- transform(df_basket1, rules = colsplit(rules, pattern = " => ", names = c("lhs","rhs"))) head(df_basket_split1) ## rules.lhs rules.rhs support confidence lift count ## 1 {300-3512} {300-3535} 43.19274 100.00000 365.9643 11 ## 2 {036-8303} {036-8304} 43.19274 100.00000 640.4375 11 ## 3 {300-3532} {300-3535} 54.97258 87.50000 320.2188 14 ## 4 {161-1271} {339-6516} 54.97258 82.35294 122.3001 14 ## 5 {338-4760} {338-4715} 54.97258 93.33333 222.4155 14 ## 6 {338-4760} {339-6516} 51.04596 86.66667 128.7063 13

tail(df_basket_split1)

## rules.lhs ## 37185 {092-0526,092-1480,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} ## 37186 {092-0526,092-1480,092-2684,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} ## 37187 {092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} ## 37188 {092-0526,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860} ## 37189 {092-0526,092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510} ## 37190 {092-0526,092-1480,092-2684,092-7603,552-0298,552-0847,552-1972,552-6510,552-9860} ## rules.rhs support confidence lift count ## 37185 {092-2684} 43.19274 100.00000 121.9881 11 ## 37186 {092-7603} 43.19274 100.00000 129.7089 11 ## 37187 {092-0526} 43.19274 100.00000 133.0779 11 ## 37188 {092-1480} 43.19274 100.00000 150.6912 11 ## 37189 {552-9860} 43.19274 91.66667 138.1336 11 ## 37190 {092-7620} 43.19274 100.00000 204.9400 11

df_basket_split2 <- transform(df_basket2, rules = colsplit(rules, pattern = " => ", names = c("lhs","rhs"))) head(df_basket_split2)

## rules.lhs rules.rhs support confidence lift count ## 1 {462-5406} {462-5726} 490.8266 67.56757 35.87383 125 ## 2 {462-5726} {462-5406} 490.8266 64.76684 35.87383 125

tail(df_basket_split2)

## rules.lhs rules.rhs support confidence lift count ## 1 {462-5406} {462-5726} 490.8266 67.56757 35.87383 125 ## 2 {462-5726} {462-5406} 490.8266 64.76684 35.87383 125

Next, we remove curly brackets around the rules.

df_basket_split1$rules$lhs <- gsub("\{|\}", "", df_basket_split1$rules$lhs) df_basket_split1$rules$rhs <- gsub("\{|\}", "", df_basket_split1$rules$rhs) head(df_basket_split1)

## rules.lhs rules.rhs support confidence lift count ## 1 300-3512 300-3535 43.19274 100.00000 365.9643 11 ## 2 036-8303 036-8304 43.19274 100.00000 640.4375 11 ## 3 300-3532 300-3535 54.97258 87.50000 320.2188 14 ## 4 161-1271 339-6516 54.97258 82.35294 122.3001 14 ## 5 338-4760 338-4715 54.97258 93.33333 222.4155 14 ## 6 338-4760 339-6516 51.04596 86.66667 128.7063 13

tail(df_basket_split1)

## rules.lhs ## 37185 092-0526,092-1480,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860 ## 37186 092-0526,092-1480,092-2684,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860 ## 37187 092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860 ## 37188 092-0526,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860 ## 37189 092-0526,092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510 ## 37190 092-0526,092-1480,092-2684,092-7603,552-0298,552-0847,552-1972,552-6510,552-9860 ## rules.rhs support confidence lift count ## 37185 092-2684 43.19274 100.00000 121.9881 11 ## 37186 092-7603 43.19274 100.00000 129.7089 11 ## 37187 092-0526 43.19274 100.00000 133.0779 11 ## 37188 092-1480 43.19274 100.00000 150.6912 11 ## 37189 552-9860 43.19274 91.66667 138.1336 11 ## 37190 092-7620 43.19274 100.00000 204.9400 11

df_basket_split2$rules$lhs <- gsub("\{|\}", "", df_basket_split2$rules$lhs) df_basket_split2$rules$rhs <- gsub("\{|\}", "", df_basket_split2$rules$rhs) head(df_basket_split2)

## rules.lhs rules.rhs support confidence lift count ## 1 462-5406 462-5726 490.8266 67.56757 35.87383 125 ## 2 462-5726 462-5406 490.8266 64.76684 35.87383 125

tail(df_basket_split2)

## rules.lhs rules.rhs support confidence lift count ## 1 462-5406 462-5726 490.8266 67.56757 35.87383 125 ## 2 462-5726 462-5406 490.8266 64.76684 35.87383 125

Next, we convert the rules to character format to make it presentable

df_basket_split1$rules$lhs <- as.character(df_basket_split1$rules$lhs) df_basket_split1$rules$rhs <- as.character(df_basket_split1$rules$rhs) head(df_basket_split1)

## rules.lhs rules.rhs support confidence lift count ## 1 300-3512 300-3535 43.19274 100.00000 365.9643 11 ## 2 036-8303 036-8304 43.19274 100.00000 640.4375 11 ## 3 300-3532 300-3535 54.97258 87.50000 320.2188 14 ## 4 161-1271 339-6516 54.97258 82.35294 122.3001 14 ## 5 338-4760 338-4715 54.97258 93.33333 222.4155 14 ## 6 338-4760 339-6516 51.04596 86.66667 128.7063 13

tail(df_basket_split1)

## rules.lhs ## 37185 092-0526,092-1480,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860 ## 37186 092-0526,092-1480,092-2684,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860 ## 37187 092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860 ## 37188 092-0526,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860 ## 37189 092-0526,092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510 ## 37190 092-0526,092-1480,092-2684,092-7603,552-0298,552-0847,552-1972,552-6510,552-9860 ## rules.rhs support confidence lift count ## 37185 092-2684 43.19274 100.00000 121.9881 11 ## 37186 092-7603 43.19274 100.00000 129.7089 11 ## 37187 092-0526 43.19274 100.00000 133.0779 11 ## 37188 092-1480 43.19274 100.00000 150.6912 11 ## 37189 552-9860 43.19274 91.66667 138.1336 11 ## 37190 092-7620 43.19274 100.00000 204.9400 11

str(df_basket_split1)

## 'data.frame': 37190 obs. of 5 variables: ## $ rules :'data.frame': 37190 obs. of 2 variables: ## ..$ lhs: chr "300-3512" "036-8303" "300-3532" "161-1271" ... ## ..$ rhs: chr "300-3535" "036-8304" "300-3535" "339-6516" ... ## $ support : num 43.2 43.2 55 55 55 ... ## $ confidence: num 100 100 87.5 82.4 93.3 ... ## $ lift : num 366 640 320 122 222 ... ## $ count : num 11 11 14 14 14 13 27 20 18 23 ...

df_basket_split2$rules$lhs <- as.character(df_basket_split2$rules$lhs) df_basket_split2$rules$rhs <- as.character(df_basket_split2$rules$rhs) head(df_basket_split2)

## rules.lhs rules.rhs support confidence lift count ## 1 462-5406 462-5726 490.8266 67.56757 35.87383 125 ## 2 462-5726 462-5406 490.8266 64.76684 35.87383 125

tail(df_basket_split2)

str(df_basket_split2)

## 'data.frame': 2 obs. of 5 variables: ## $ rules :'data.frame': 2 obs. of 2 variables: ## ..$ lhs: chr "462-5406" "462-5726" ## ..$ rhs: chr "462-5726" "462-5406" ## $ support : num 491 491 ## $ confidence: num 67.6 64.8 ## $ lift : num 35.9 35.9 ## $ count : num 125 125

Now, we create a copy of the basket outputs.

df_basket_output1 <- df_basket_split1 head(df_basket_output1)

## rules.lhs rules.rhs support confidence lift count ## 1 300-3512 300-3535 43.19274 100.00000 365.9643 11 ## 2 036-8303 036-8304 43.19274 100.00000 640.4375 11 ## 3 300-3532 300-3535 54.97258 87.50000 320.2188 14 ## 4 161-1271 339-6516 54.97258 82.35294 122.3001 14 ## 5 338-4760 338-4715 54.97258 93.33333 222.4155 14 ## 6 338-4760 339-6516 51.04596 86.66667 128.7063 13

str(df_basket_output1)

## 'data.frame': 37190 obs. of 5 variables: ## $ rules :'data.frame': 37190 obs. of 2 variables: ## ..$ lhs: chr "300-3512" "036-8303" "300-3532" "161-1271" ... ## ..$ rhs: chr "300-3535" "036-8304" "300-3535" "339-6516" ... ## $ support : num 43.2 43.2 55 55 55 ... ## $ confidence: num 100 100 87.5 82.4 93.3 ... ## $ lift : num 366 640 320 122 222 ... ## $ count : num 11 11 14 14 14 13 27 20 18 23 ...

nrow(df_basket_output1)

## [1] 37190

df_basket_output2 <- df_basket_split2 head(df_basket_output2)

str(df_basket_output2)

## 'data.frame': 2 obs. of 5 variables: ## $ rules :'data.frame': 2 obs. of 2 variables: ## ..$ lhs: chr "462-5406" "462-5726" ## ..$ rhs: chr "462-5726" "462-5406" ## $ support : num 491 491 ## $ confidence: num 67.6 64.8 ## $ lift : num 35.9 35.9 ## $ count : num 125 125

nrow(df_basket_output2)

## [1] 2

We change the variable heads for rules for simplicity.

str(df_basket_output1$rules)

## 'data.frame': 37190 obs. of 2 variables: ## $ lhs: chr "300-3512" "036-8303" "300-3532" "161-1271" ... ## $ rhs: chr "300-3535" "036-8304" "300-3535" "339-6516" ...

colnames(df_basket_output1$rules)<- c("Lhs","Rhs") head(df_basket_output1$rules)

## Lhs Rhs ## 1 300-3512 300-3535 ## 2 036-8303 036-8304 ## 3 300-3532 300-3535 ## 4 161-1271 339-6516 ## 5 338-4760 338-4715 ## 6 338-4760 339-6516

str(df_basket_output1$rules)

## 'data.frame': 37190 obs. of 2 variables: ## $ Lhs: chr "300-3512" "036-8303" "300-3532" "161-1271" ... ## $ Rhs: chr "300-3535" "036-8304" "300-3535" "339-6516" ...

str(df_basket_output2$rules)

## 'data.frame': 2 obs. of 2 variables: ## $ lhs: chr "462-5406" "462-5726" ## $ rhs: chr "462-5726" "462-5406"

colnames(df_basket_output2$rules)<- c("Lhs","Rhs") head(df_basket_output2$rules)

## Lhs Rhs ## 1 462-5406 462-5726 ## 2 462-5726 462-5406

str(df_basket_output2$rules)

## 'data.frame': 2 obs. of 2 variables: ## $ Lhs: chr "462-5406" "462-5726" ## $ Rhs: chr "462-5726" "462-5406"

Next, we go for creating the final output and look into its structure.

So, first, we create an empty data frame with the suitable number of rows and columns and then copy and paste the required variable columns from the ‘df_basket_output1’ data frame.

We first do this for the 1st basket output and then repeat the process for the 2nd basket output.

output1 <- data.frame(Lhs=character(nrow(df_basket_output1)), Rhs=character(nrow(df_basket_output1)), Support=double(nrow(df_basket_output1)), Confidence=double(nrow(df_basket_output1)), Lift=double(nrow(df_basket_output1)), Count=double(nrow(df_basket_output1)),stringsAsFactors=FALSE)

str(output1)

## 'data.frame': 37190 obs. of 6 variables: ## $ Lhs : chr "" "" "" "" ... ## $ Rhs : chr "" "" "" "" ... ## $ Support : num 0 0 0 0 0 0 0 0 0 0 ... ## $ Confidence: num 0 0 0 0 0 0 0 0 0 0 ... ## $ Lift : num 0 0 0 0 0 0 0 0 0 0 ... ## $ Count : num 0 0 0 0 0 0 0 0 0 0 ...

output1$Lhs <- df_basket_output1$rules$Lhs output1$Rhs <- df_basket_output1$rules$Rhs output1$Support <- df_basket_output1$support output1$Confidence <- df_basket_output1$confidence output1$Lift <- df_basket_output1$lift output1$Count <- df_basket_output1$count

str(output1)

## 'data.frame': 37190 obs. of 6 variables: ## $ Lhs : chr "300-3512" "036-8303" "300-3532" "161-1271" ... ## $ Rhs : chr "300-3535" "036-8304" "300-3535" "339-6516" ... ## $ Support : num 43.2 43.2 55 55 55 ... ## $ Confidence: num 100 100 87.5 82.4 93.3 ... ## $ Lift : num 366 640 320 122 222 ... ## $ Count : num 11 11 14 14 14 13 27 20 18 23 ...

head(output1)

## Lhs Rhs Support Confidence Lift Count ## 1 300-3512 300-3535 43.19274 100.00000 365.9643 11 ## 2 036-8303 036-8304 43.19274 100.00000 640.4375 11 ## 3 300-3532 300-3535 54.97258 87.50000 320.2188 14 ## 4 161-1271 339-6516 54.97258 82.35294 122.3001 14 ## 5 338-4760 338-4715 54.97258 93.33333 222.4155 14 ## 6 338-4760 339-6516 51.04596 86.66667 128.7063 13

tail(output1)

## Lhs ## 37185 092-0526,092-1480,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860 ## 37186 092-0526,092-1480,092-2684,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860 ## 37187 092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860 ## 37188 092-0526,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510,552-9860 ## 37189 092-0526,092-1480,092-2684,092-7603,092-7620,552-0298,552-0847,552-1972,552-6510 ## 37190 092-0526,092-1480,092-2684,092-7603,552-0298,552-0847,552-1972,552-6510,552-9860 ## Rhs Support Confidence Lift Count ## 37185 092-2684 43.19274 100.00000 121.9881 11 ## 37186 092-7603 43.19274 100.00000 129.7089 11 ## 37187 092-0526 43.19274 100.00000 133.0779 11 ## 37188 092-1480 43.19274 100.00000 150.6912 11 ## 37189 552-9860 43.19274 91.66667 138.1336 11 ## 37190 092-7620 43.19274 100.00000 204.9400 11

Doing the same with the 2nd basket output.

output2 <- data.frame(Lhs=character(nrow(df_basket_output2)), Rhs=character(nrow(df_basket_output2)), Support=double(nrow(df_basket_output2)), Confidence=double(nrow(df_basket_output2)), Lift=double(nrow(df_basket_output2)), Count=double(nrow(df_basket_output2)),stringsAsFactors=FALSE)

str(output2)

## 'data.frame': 2 obs. of 6 variables: ## $ Lhs : chr "" "" ## $ Rhs : chr "" "" ## $ Support : num 0 0 ## $ Confidence: num 0 0 ## $ Lift : num 0 0 ## $ Count : num 0 0

output2$Lhs <- df_basket_output2$rules$Lhs output2$Rhs <- df_basket_output2$rules$Rhs output2$Support <- df_basket_output2$support output2$Confidence <- df_basket_output2$confidence output2$Lift <- df_basket_output2$lift output2$Count <- df_basket_output2$count

str(output2)

## 'data.frame': 2 obs. of 6 variables: ## $ Lhs : chr "462-5406" "462-5726" ## $ Rhs : chr "462-5726" "462-5406" ## $ Support : num 491 491 ## $ Confidence: num 67.6 64.8 ## $ Lift : num 35.9 35.9 ## $ Count : num 125 125

head(output2)

## Lhs Rhs Support Confidence Lift Count ## 1 462-5406 462-5726 490.8266 67.56757 35.87383 125 ## 2 462-5726 462-5406 490.8266 64.76684 35.87383 125

tail(output2)

## Lhs Rhs Support Confidence Lift Count ## 1 462-5406 462-5726 490.8266 67.56757 35.87383 125 ## 2 462-5726 462-5406 490.8266 64.76684 35.87383 125

Next, we write the outputs to csv format files.

write.csv(output1,"Output1.csv")

write.csv(output2,"Output2.csv")

Thank You !!! For further studies and updates, latest updates or interview tips on data science and machine learning, subscribe to our emails.

The post Data Mining: Market Basket Analysis in R appeared first on StepUp Analytics.

]]>The post Step by step implementation of Grammar of Graphics in R appeared first on StepUp Analytics.

]]>

- A dataframe :- Stores all of the data that will be displayed on the plot
- Aesthetic mappings :- How data are mapped to color,size, etc.
- geom :- geometric objects like points, lines,shapes, etc.
- Facets :- facets are for conditionals plot, so you gonna have multiple panels. Make sure the relationship of something, something as third variable change.
- Stats :- Statistical transformations like binning, quantiles, smoothing, regression, like that.
- Scales :- How the different variables are coded in terms of plots. e.g. binary variables that represents gender(male, female).
- Co-ordinate System :- How certain numerical representation get translated on to a plot.

when building the plot with ggplot2(), the basic idea, if you are not gonna use the qplot(), you wanna build by piece by piece. You can think of “artist’s palette” model.

This is the similar to model “base” plotting system. Where you gonna start something, you gonna add piece by piece, unlike the lattice plotting system. Where you had have to whole plot is enhanced in a single. You can add things piece by piece in ggplot2() plotting system. If you want some change in next line, you can make changes in next line.

To the ideas that the plots are built up in layers:

- The most fundamental letter is called the data.
- Overlay a summary, like smoothing, regression line.

- geom_point.
- geom_smooth.
- aesthetic.
- facets_grid.
- annotations.
- labels.
- customising geom_smooth.

Let’s use the in-build dataset to understand the features of **ggplot2().**

**Building up in layers:**

g<-ggplot(mtcars ,aes(wt,mpg)) print(g) #no layers in plot

It will look like this

**Adding geom_points layer**

p<-g+geom_point() print(p) #explicitly save and print ggplot objects. g+geom_point() #Auto print plot objects with saving.

Here is the plot, how it look like after adding geom_point layer to the ggplot2()

**Adding geom_smooth() layers:**

g+geom_point()+geom_smooth(method="lm") #default which is lowest smoother

The plot looks like after adding geom_smooth()

**Adding another layer facet:**

g+geom_point()+geom_smooth(method="lm")+facet_grid(.~cyl) #here we differenciating with respective cyl. ##facets are useful for conditioning and categorical varibale.

Here the plot for facets_grid

**Adding Annotations:**

- there are different kinds of labels like: xlab(),ylab(),labs(),ggtitle()
- geom() has options to modify
- theme()

There are built in themes also there which we can use

- theme_gray() #default theme
- theme_bw() #more stark/plain

Let’s talk about the modifying some of the features of aesthetic. Some example, changing the colors of object points.

g+geom_point(color="green",size=4,alpha=1/2)

Here is the plot, how it looks like

g+geom_point(color=cyl,size=4,alpha=1/2)+geom_smooth(method="lm") ## this will distribute different color to each factor of the facets.

Here is the plot

**Modifying Label:**

g+geom_points(aes(color=cyl))+geom_smooth(method="lm")+facet_grid(.~cyl),labs(title="Cars Cylinder",x=car,y=weight)

Here is the plot

**Customising the Smooth**

g+geom_point(aes(color=cyl,size=4,alpha=1/2))+geom_smooth(size=4,linetype=3,method="lm",se=FALSE) ## I have removed smooth using "se=FALSE"

Here is the plot, how it looks like

That’s it folks about ggplot2(), if something missse please mention in comments or shoot an email at irrfankhann29@gmail.com.

The post Step by step implementation of Grammar of Graphics in R appeared first on StepUp Analytics.

]]>The post How to do cox(PH) regression modelling using R? appeared first on StepUp Analytics.

]]>Because cox regression assume that the effects of the predictor variables upon survival are constant over time and are additive in one scale.

The formula is expressed as the product of the baseline hazard function of time and an exponential function of covariates. The baseline hazard is an unspecified form of the Cox model and the distribution of the outcome (survival

time) is unknown. This makes the Cox (PH) regression a semi-parametric model. The semi-parametric property of the Cox (PH) model makes it a robust model which can closely approximate parametric models

If the below assumptions of Cox regression are met, this function will provide better estimates of survival probabilities and cumulative hazard than those provided by the Kaplan-Meier function.

- Hazard and Hazard-ratios.
- Time-dependent and covariates.
- Model analysis and deviance.
- Survival and commulative hazard rates.

The Cox PH assumption states that the Hazard-ratios(HR) for any two individuals in the same study is constant over time.

In other words, the hazard for a subject is proportional to the hazard for another subject in the same study where the proportionality constant, say b is independent of time.

When the HR vary with time, for example where hazards cross or when time varying confounding variables are present the PH assumption maybe violated, making it inappropriate to use the Cox PH model. Where the Cox PH assumption is not met, variations of the Cox model can be used, for example the extended Cox regression or the stratified Cox regression depending on the context.

When individuals are followed over time, the values of covariates may change with time.

Covariates can thus be divided into fixed and time-dependent.

A covariate is time dependent if the difference between its values for two different subjects changes with time(e.g. blood pressure,cholestrol,smoking).

A covariate is fixed if its values can not change with time(e.g. gender or ethnicity).

What do you mean by “model analysis”? Yes, you guess it right, its the total analysis of your model; e.g. statistical significance( it covers all the hypothesis testing like t-test, chi-square, fisher, at all). Here the likelihood of chi-square model is calculated by comparing the deviance.

Deviance is minus twice the log of the likelihood ratio for models fitted by maximumlikelihood (-2*log(likelihood)).

The value of adding a parameter to a Cox model is tested by subtracting the deviance of the model with the new parameter from the deviance of the model without the new parameter, the difference is then tested against a chi-square distribution with degrees of freedom equal to the difference between the degrees of freedom of the old and new models.

The survival function and the cumulative hazard function are calculated relative to the baseline (lowest value of covariates) at each time point.

<strong>If you have binary/dichotomous predictors in your model you are given the option to calculate survival and cumulative hazards for each variable separately.</strong>

**Censored(*)**individuals have the same prospect of survival as those who continue to be followed. This can not be tested for and can lead to a**bias(**)**that artificially reduces S.- Survival prospects are the same for early as for late recruits to the study (can be tested for).
- The event studied (e.g. death) happens at the specified time. Late recording of the event studied will cause artificial inflation of S.

* Censorship in survival-time (time-to-event, failure-time) studies refers to incomplete data.

** Bias is a systematic error that leads to an incorrect estimate of effect or association.

first install or import package “mgcv”. Get the description of mgcv package.

library(mgcv) library(survival) ## to load the data

‘**mgcv**‘ is an package, which has **gam**(Generalised Additive Model), This is similar to **glm. **For more details on ‘mgcv’ package, please follow the steps

open R -> type ??mgcv

This, the above command will open help page from R documents.

We are using the survival data from ‘**survival**‘ package, which is an inbuilt in R. Now Move on

col1 <- colon[colon$etype==1,] ## concentrate on single event col1$differ <- as.factor(col1$differ) col1$sex <- as.factor(col1$sex)

Changing the characteristics of differ and sex variable for our smoothness.

model <- gam(time~s(age,by=sex)+sex+s(nodes)+perfor+rx+obstruct+adhere, family=cox.ph(),data=col1,weights=status)

Here(above) is the model formula for gam

summary(model)

Summary of the above model looks like this

Let’s take a look at the plot effects of model

plot(model,pages=1,all.terms=TRUE)

Here is the plot effect of the given data

Now, till now you got the idea about data’s behaviour or what kind of data is what we have.

Now lets’ work for survival plot for particular object here in this data set object is patient and we are calculating the survival plot for patient ‘**j**‘:

Below is the function for survival analysis.

numP <- 300;j <- 6 newd <- data.frame(time=seq(0,3000,length=numP)) dname <- names(col1) for (n in dname) newd[[n]] <- rep(col1[[n]][j],numP) newd$time <- seq(0,3000,length=numP) fv <- predict(model,newdata=newd,type="response",se=TRUE) plot(newd$time,fv$fit,type="l",ylim=c(0,1),xlab="time",ylab="survival") lines(newd$time,fv$fit+2*fv$se.fit,col=2) lines(newd$time,fv$fit-2*fv$se.fit,col=2)

Crude plot of baseline survival

plot(model$family$data$tr,exp(-b$family$data$h),type="l",ylim=c(0,1), xlab="time",ylab="survival") lines(model$family$data$tr,exp(-model$family$data$h + 2*model$family$data$q^.5),col=2) lines(model$family$data$tr,exp(-model$family$data$h - 2*model$family$data$q^.5),col=2) lines(model$family$data$tr,exp(-model$family$data$km),lty=2)

Here is the plot for above code

If you want me to explain the graphics please shoot me a mail to irrfankhann29@gmail.com. I will get back to you.

Please give your feedback in comments or shoot me a mail

Source:Click Here.

The post How to do cox(PH) regression modelling using R? appeared first on StepUp Analytics.

]]>The post Time series Forecasting using simple exponential smoothing method using R appeared first on StepUp Analytics.

]]>In * holtWinters()* function we have to set

Values of alpha that are close to 0 mean that little weight is placed on the most recent observations when making forecasts of future values.

We will use the data provided by Roby J Hyndman . Which contains total rainfall in inches for london, from 1813-1912.

As we discussed in earlier articles, how to read and plot time series data. So the lets Read and plot this data using R

data<-scan("http://robjhyndman.com/tsdldata/hurst/precip1.dat",skip=1) datatimeseries<-ts(data,start=c(1813)) datatimeseries plot.ts(datatimeseries)

Here is the R console output

and here is data graph

From the plot the mean stays constant at about 25 inches. The random fluctuations in the time series seem to be roughly constant in size over time, so it is probably appropriate to describe the data using an additive model. Thus, we can make forecasts using simple exponential smoothing.

To make forecasts using simple exponential smoothing in R, we can fit a simple exponential smoothing predictive model using the “* HoltWinters()*” function in R. The beta and gamma parameters are used for Holt’s exponential smoothing, or Holt-Winters exponential smoothing, as described below).

The HoltWinters() function returns a list variable, that contains several named elements.

Lets make forcast for time series of annual rainfall in the provided data.

datatimeseriesforecast <- Holtwinters(datatimeseries, beta=FALSE, gamma=FALSE) datatimeseriesforecast

Here is the R console output

You can see that the value of alpha is 0.02412, which is very close to zero, As I mentioned above the value of alpha lies between 0 to 1. With this value of alpha we can predict that the forecast is made on both recent and less recent observations,[though more weight of recent observations is in the forecasting].

The output of HoltWinters() function is saved in the list variable “datatimeseriesforecst“. The forecast made with HoltWinters() function are stored in a named element if this list variable called “fitted”.

Lets have look into the observations of **datatimeseiresforecast.**

datatimeseriesforecast$fitted plot(datatimeseriesforecast)

Lets have look into the plot

The plot shows the original time series in black, and the forecasts as a red line.

As a measure of the accuracy of the forecasts, calculate the sum of squared errors. The sum-of squared-errors is stored in a named element of the list variable “datatimeseriesforecast” called “SSE”.

datatimeseriesforecast$SSE

It is common in simple exponential smoothing to use the first value in the time series as the initial value for the level.

HoltWinters(datatimeseries,beta=FALSE, gamma=FALSE, l.start=23.56)

HoltWinters() just makes forecasts for the time period covered by the original data. We can make forecasts for further time points by using the “forecast.HoltWinters()” function in the R “forecast” package.

To use the forecast.HoltWinters() function, we first need to install the “forecast” R package (for instructions on how to install an R package, see How to install an R package).

Lets install **forecast R package**

install.package("forecast")

When using the forecast.HoltWinters() function, as its first argument (input), you pass it the predictive model that you have already fitted using the HoltWinters() function. For example, in the case of the rainfall time series, we stored the predictive model made using HoltWinters() in the variable “**datatimeseriesforecast**”. You specify how many further time points you want to make forecasts for by using the “h” parameter in forecast.HoltWinters(). For example, to make a forecast of rainfall for the years 1913-1920 (8 more years) using forecast.HoltWinters(). See below

datatimeseriesforecastpackage <- forecast.HoltWinters(datatimeseriesforecast h=8) datatimeseriesforecastpackage

R console output

The forecast.HoltWinters() function gives you the forecast for a year, a 80% prediction interval for the forecast, and a 95% prediction interval for the forecast. For example, the forecasted rainfall for 1920 is about 24.68 inches, with a 95% prediction interval of (16.24, 33.11).

To plot the predictions made by forecast.HoltWinters(), we can use the “plot.forecast()” function:

plot.forecast(datatimeseriesforecastpackage)

Here the rain forecast for the time interval of 1913-1920 plotted as a blue line, the 80% prediction area is plotted in dark color shaded area and 95% plotted as light color shaded area.

The ‘* forecast errors*’ are calculated as the

The in-sample forecast errors are stored in the named element “

If there are correlations between

Lets check, we can obtain a

acf(datatimeseriesforecastpackage$resuduals, lag.max=20)

if you get an na.fail.default(ts(x)), then use this command to omit NA’S

acf(datatimeseriesforecastpackage$resuduals, lag.max=20,na.action=na.pass)

Here is the plot

You can see from the sample correlogram that the autocorrelation at **lag 3** is just touching the significance bounds. To test whether there is significant evidence for non-zero correlations at lags 1-20, we can carry out a Ljung-Box test.

This can be done in R using **Box.test() function**

To test, if there is any non-zero autocorrelation in between lag(1-20)

box.test(datatimeseriesforecast$residual, lag=20, type= LjungBox)

Here is the R output

Here the Ljung-Box test statistic is 17.4, and the p-value is 0.6, so there is little evidence of non-zero autocorrelations in the in-sample forecast errors at lags 1-20.

To be sure that the predictive model cannot be improved upon, it is also a good idea to check whether the forecast errors are normally distributed with mean zero and constant variance. To check whether the forecast errors have constant variance, we can make a time plot of the in-sample forecast errors:

plot.ts(datatimeseriesforecastpackage$residuals)

here is the plot

The plot shows that the in-sample forecast errors seem to have roughly constant variance over time, although the size of the fluctuations in the start of the time series (1820-1830) may be slightly less than that at later dates (eg.1840-1850).

Forecasting time series of rainfall is done. If you have any doubts please share your views in comment section or shoot me an email irrfankhann29@gmail.com.

Source : Article originally posted here.

The post Time series Forecasting using simple exponential smoothing method using R appeared first on StepUp Analytics.

]]>The post Mathematical Modelling Functions In R – Part 2 appeared first on StepUp Analytics.

]]>**Polynomial functions:** are functions in which x appears several times, each time raised to a different power. They are useful for describing curves with humps, inflexions or local maxima.

See graphs below..

x<-seq(0,10,0.1) y1<-2+5*x-0.2*x^2 y2<-2+5*x-0.4*x^2 y3<-2+4*x-0.6*x^2+0.04*x^3 y4<-2+4*x+2*x^2-0.6*x^3+0.04*x^4 par(mfrow=c(2,2) plot(x,y1,type="l",ylab="y",main="decelerating") plot(x,y2,type="l",ylab="y",main="humped") plot(x,y3,type="l",ylab="y",main="inflection") plot(x,y4,type="l",ylab="y",main="local maximum")

Inverse polynomials are an important class of functions which are suitable for setting up

generalized linear models with gamma errors and inverse link functions:

**1/y = a+b*x+c*x^2 +d*x^3 +…+z*x^n**

Various shapes of function are produced, depending on the order of the polynomial (the

maximum power) and the signs of the parameters:

par(mfrow=c(2,2)) y1<-x/(2+5*x) y2<-1/(x-2+4/x) y3<-1/(x^2-2+4/x) plot(x,y1,type="l",ylab="y",main="Michaelis-Menten") plot(x,y2,type="l",ylab="y",main="shallow hump") plot(x,y3,type="l",ylab="y",main="steep hump")

There are two ways of parameterizing the **Michaelis–Menten** equation:

In the first case, the asymptotic value of y is a/b and in the second it is 1/d.

**y= a*x/1+b*x and y= x/c+d*x**

**Gamma function: **The gamma function t is an extension of the factorial function, **t!** to positive real numbers:

**Γt=∫x^(t−1)*e^(−x)dx 0<x<∞**

It looks like this:

t<-seq(0.2,4,0.01) plot(t,gamma(t),type="l") abline(h=1,lty=2)

Note that t is equal to 1 at both t=1 and t=2. For integer values of t t+1=t! and.

**Asymptotic functions: **Much the most commonly used asymptotic function is

**y= a*x/****1+b*x**

which has a different name in almost every scientific discipline. For example, in **biochemistry it is called Michaelis–Menten**, and shows reaction rate as a function of enzyme concentration; in **ecology, it is called Holling’s disc equation** and shows predator feeding rate as a function of prey density. The other common function is the **asymptotic exponential**

**y=a*(1−e^(−b*x))**

This, too, is a two-parameter model, and in many cases, the two functions would describe data equally well.

Let’s the behaviour at the limits of our two asymptotic functions, starting with the asymptotic exponential.

For x=0 we have

**y=a*(1−e−^(b×0))=a*(1−e^0)=a*(1−1)=a×0=0**

so the graph goes through the origin. At the other extreme, for x=∞, we have

**y=a*(1−e^(−b×∞))=a*(1−e^(−∞))=a*(1−0)=a*1=a**

had the look just like this

which demonstrates that the relationship is asymptotic and that the asymptotic value of y is a. For the Michaelis–Menten equation, determining the behaviour at the limits is somewhat more difficult, because for x=∞ we end up with y=∞/∞ which you might imagine is always going to be 1 no matter what the values of a and b. In fact, there is a special mathematical rule for this case, called **l’Hospital’s rule: when you get a ratio of infinity to ****infinity, you work out the ratio of the derivatives to obtain the behaviour at the limit**.

For x=0 the limit is easy:

**y= a×0/1+b×0 = 0/1+0= 0/1=0**

For x=∞ we get

**y=∞/1+∞=∞/∞**

The numerator is **a*x** so its derivative with respect to **x is a**. The denominator is **1+b*x** so its derivative with respect to **x is 0 +b = b**.

So the ratio of the derivatives is **a/b**, and this is the asymptotic value of the Michaelis–Menten equation.

Parameter estimation in asymptotic functions There is no way of linearizing the asymptotic exponential model,

so we must resort to nonlinear least squares (NLS) to estimate parameter values for it (p. 662). One of the advantages of the Michaelis–Menten function is that it is easy to linearize. We use the reciprocal

transformation

**1/y= (1+b*x)ax**

which, at first glance, isn’t a big help. But we can separate the terms on the right because they have a common denominator. Then we can cancel the xs, like this:

**1/y = (1/a*x)+b*x/a*x = 1/ax+ b/a**

If we simplify the above equation by putting *y=1/y, x=1/x, A=1/a, and C=b/a*, we see that

**Y =AX +C**

which is linear: **C is the intercept** and **A is the slope**. So to estimate the values of a and b

from data, we would transform both x and y to reciprocals, plot a graph of 1/y against 1/x,

carry out a linear regression, then back-transform, to get:

**a= 1/A**

**b=a*C**

Suppose, the graph passed through the two points (x1,y1) as (0.2, 44.44) and (x2,y2) as (0.6,70.59). How do we work out the values of the parameters a and b?

**slope A = (y2-y1)/(x2-x1)**

First, we calculate the four reciprocals. The slope of the linearized function, A

** A = (70.59-44.44)/(0.6 – 0.2) = 0.002500781**

so a = 1/A = 1/0.0025 = 400. Now we rearrange the equation and use one of the points

(say x =0.2 y=44.44) to get the value of b:

**b= (1/x)*[(a*x/y)−1] = (1/0.2)[(400×02/4444)−1] = 4**

The remaining function we will discuss in a later part.

If you have any doubts please put a comment below and we will try to address those.

The post Mathematical Modelling Functions In R – Part 2 appeared first on StepUp Analytics.

]]>