Data Mining: Market Basket Analysis in R

Market Basket Analysis in R, From Sellers to Intelligent Sellers: Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items.

The term ‘E-commerce’ is well known to all of us. Well, it means trade and business through the means of the internet, popularly known as ‘online shopping’. Nowadays, retailers who traditionally used to sell their products strictly in ‘Brick-And-Mortar’ stores, resort to the online display of their products and hence facilitate online purchase of their products through various platforms.

In doing so, both the customers and the sellers are benefitted intelligently. The customers can search for their desired products and compare their prices online, whereas the sellers can effortlessly conduct their merchandise trade in a cost-effective but intelligent manner.

The biggest perks of having an online presence for a seller is that it enables them to correct their past mistakes in the business policies, by merely looking at the recorded sales data and understanding the customer behaviour largely.

However, this data generation and delving deep into the data to get useful insights is a logical task, which requires some scientific algorithm. One such algorithm widely used is the ‘Apriori’ algorithm. But, the thing is, such algorithms require trained marketing analysts, to be executed and inferred on.

So, this comes from the term ‘Market Basket Analysis’. Nowadays this is very common a procedure performed by not only online retailers but also the sellers who prefer to sell in physical ‘Brick-and Mortar’ stores.
The term ‘market-basket’ implies any consumption bundle taken up by the customers for final purchase.

However, such bundles do not necessarily mean bundles of the same product, but also comprises the possibility of a customer buying up multiple product items in the same go, which together build up his ‘market-basket’.

How such an analysis helps:

Market basket analysis may provide the retailer with information to understand the purchase behaviour of a buyer. This information will enable the retailer to understand the buyer’s needs and rewrite the store’s layout accordingly, develop cross-promotional programs, or even capture new buyers, all of which are necessary to survive in the market.

Most relevant and well-known examples include ‘Amazon’, ‘Flipkart’, ‘Ebay’, etc.

Market Basket Analysis: The Basics

Items are the objects that we are identifying associations between. For an online retailer, each item is a product in the shop. For a publisher, each item might be an article, a blog post, a video etc. A group of items is an item set.
Transactions are instances of groups of items occurring together. For an online retailer, a transaction is generally a monetary transaction. For a publisher, a transaction might be the group of articles read in a single visit to the website. (It is up to the analyst to define over what period to measure a transaction.) For each transaction, then, we have an item set.

Rules are statements of the form

i.e. if you have the items in item set (on the left-hand side (LHS) of the rule
i.e. {i_1, i_2, …}, then it is likely that a visitor will be interested in the item on the right-hand side (RHS i.e. {i_k}. In our example above, our rule would be:
The output of a market basket analysis is generally a set of rules, that we can then exploit to make business decisions (related to marketing or product placement, for example).

The support of an item or an item set is the fraction of transactions in our data set that contain that item or the item set. In general, it is nice to identify rules that have a high support, as these will be applicable to a large number of transactions. For super market retailers, this is likely to involve basic products that are popular across an entire user base (e.g. bread, milk).

A printer cartridge retailer, for example, may not have products with a high support, because each customer only buys cartridges that are specific to his / her own printer.

The confidence of a rule is the likelihood that it is true for a new transaction that contains the items on the LHS of the rule (i.e. it is the probability that the transaction also contains the item(s) on the RHS.). Formally:
The lift of a rule is the ratio of the support of the items on the LHS of the rule co-occurring with items on the RHS, to the probability that the LHS and RHS co-occur if the two are independent.
If the lift is greater than 1, it suggests that the presence of the items on the LHS has increased the probability that the items on the RHS will occur on this transaction. If the lift is below 1, it suggests that the presence of the items on the LHS make the probability that the items on the RHS will be part of the transaction, lower.

If the lift is 1, it suggests that the presence of items on the LHS and RHS really are independent, i.e. knowing that the items on the LHS are present makes no difference to the probability that items will occur on the RHS.
When we perform market basket analysis, then, we are looking for rules with a lift of more than 1.

Rules with higher confidence are ones where the probability of an item appearing on the RHS is high given the presence of the items on the LHS. It is also preferable to action rules that have a high support – as these will be applicable to a larger number of transactions.

However, in the case of long-tail retailers, this may not be possible. Practically it has been seen in many cases that maximizing support and confidence at the same time is not possible. In businesses that are dealing with products with relatively low demand, it is advisable to maximize confidence in maintaining the support parameter at a threshold acceptable level.

The following steps take us through the exact analytical process of dealing with Market Basket Analysis using R: –

Implementing Market Basket Analysis using Apriori Algorithm

At first, we read the data set on transactions.

The name of the required data set in my analysis is “AprioriTransactionsReduced.csv”, i.e. a CSV file.
If anyone needs to get access to this data set, get it from the link below.

Data Set – AprioriTransactionsReduced.csv
We now set the file path and then import the csv file in R. After importing the data file we look at its initial structure.

We sort the data set by the ascending order of the ‘Invoice’ and have a brief look at the sorted data set.


We convert ‘Invoice’ to numeric and check its data nature.


We then convert Item to categorical format and look at its data type and structure.


Now, we have to convert dataframe to transaction format using ddply and #group all the items that were bought together by the same customer on the same date.


Now, we remove the column ‘Invoice’.


Next, we have to rename the only column head left in the data set.


We now export the data set to be worked upon to a csv format file, for a back up.


We bring in the association rule mining algorithm: ‘apriori’

We load the packages required.


We now convert the csv file to basket format and inspect it.


For our convenience, we remove the quotes from transactions.

Finally, the next step is to run the apriori algorithm.



Note: Here we have created 2 different basket rules. One with high confidence and low support and the other with high support and low confidence.
Now we view the rules that we have created.

We now convert to the basket rules to dataframe and view them. Also, we give suitable transformations to the ‘confidence’ and ‘support’ parameters.


Mining Rules for Recommendations:

We split lhs and rhs into two columns.



Next, we remove curly brackets around the rules.



Next, we convert the rules to character format to make it presentable



Now, we create a copy of the basket outputs.




We change the variable heads for rules for simplicity.




Next, we go for creating the final output and look into its structure.
So, first, we create an empty data frame with the suitable number of rows and columns and then copy and paste the required variable columns from the ‘df_basket_output1’ data frame.
We first do this for the 1st basket output and then repeat the process for the 2nd basket output.





Doing the same with the 2nd basket output.



Next, we write the outputs to csv format files.

Thank You !!! For further studies and updates, latest updates or interview tips on data science and machine learning, subscribe to our emails.

You might also like More from author