SEGMENTATION OF CUSTOMERS USING RFM MODEL

 

Customer Lifetime Value

Customer Lifetime Value (CLV), also known as Life-time Value (LTV), is the present value of the future cash flows from the customer during his or her entire relationship with the company. In other words, the dollar value of a customer relationship which is based on the present value of the estimated future cash flows from that particular customer’s relationship with the company.

Why Customer Lifetime Value ?

CLV is mainly used by companies for customer segmentation. Because each and every customer is not important for a company, hence CLV-based segmentation model allows the company to predict its most profitable group of customers, understand their nature, and focus more on them rather than wasting time, money and resource on other less profitable customers.

Introducing RFM-model

Recency-Frequency-Monetary (RFM) model is one of the predictive models to calculate CLV. By segmenting customers using RFM model, we can analyze each group of customers individually and determine which set of customers has the highest CLV, hence contributing to the profitability of the company.
RFM analysis determines quantitatively which customers are the best ones by examining the following factors :

  1. How recently a customer has purchased, also known as Recency,
  2. How often a customer purchases, also known as Frequency, and
  3. How much a customer spends during his/her purchase, also known as Monetary value.

It follows the axiom that 80% of business comes from 20% of customers (set that are most important).

RFM Model

To start with the RFM model, install the following packages first.

Now loading these above packages,

Now, importing the dataset in R,

The link to view the whole dataset “Online Retail” in csv format is,
Workable Data: Download
To have a look at the preview of the dataset and see its structure,

It is an invoice dataset of online retail. It has 541909 customer records taking into account 8 variables.
We see there are few misinterpreted datatypes of the variables. Also we see in the data that there are negetive values for the variables, Quantity and Unit Price. This logically cannot happen. So we need to do treatment for all these.
Firstly, correcting the datatypes of required variables,

Also, to treat the illogically placed negetive values in Quantity and Unit Price,

What we do is, we replace the negetive values with NA, and them omitting the NA’s.
Checking the structure again,

Now the records decreased to 397884 observations, explained by the same 8 variables. And this data is a cleaned data, including no illogical records or datatypes in it.
Let us now create a calculated column showing total price, so that it is easy for us to calculate the monetary value of customers.

Now firstly, we calculate Recency of a customer in number of days,

Here we take the difference in days between customer’s purchase and today, and finally rounding the figure. This shows the recency of the purchase done by the customer.
Next calculating Frequency of a customer,

To find how frequent a customer is, we create a dummy column ‘Count’ giving a fixed value of 1. So we group by customer ID and sum the number of ‘Count’(s) in each group. This will give us the total number of occurances of purchase for a particular customer. Hence, keeping this as Frequency.
Next calculating Monetary Value of a customer,

To find monetary value of a customer, we group by customer ID and sum the Total Price in each group. This will give us the total money spent on purchase, for a particular customer. Hence, keeping this as Monetary Value.
Now viewing the preview of the data,

So we get 397884 records with its 13 variables. But we just need only three variables to do the process of segmentation with respect to each customer, Recency, Frequency and Monetary value. So separating these three variables and creating another dataset with these.

Extracting the Customer ID, Recency, Frequency and Monetary Value in data2.
Checking its structure now,
We find that it is in table form. Also the variable datatypes are not as is desired to be. So doing the necessary treatment on it,

Checking the structure again

So now we get a data frame structure consisting of 397884 records taking into consideration 4 variables only, that we require to make segmentation.

Clustering Model

To start with segmentation modelling, we have to remove the duplicate combinations of recency, frequency and monetary value, so that we get an unique set of records to do our segmentation. We keep these unique records in another new data set.

Now we have to create a new dataset with only the scaled down values of Recency, Frequency and Monetary values, so that there exist no bias in the unit of measurement. So three columns are present here storing it in another new dataset,

To convert dataset into a dataframe structure, for our ease in work, we convert it,

Checking the structure of the dataset now,

Now we get the required structure on which we can now segment our customers, using the algorithm of clustering.
Firstly, we install and run few libraries to work on clustering,

To start with segementation, we use the very famous K-means clustering. We divide our customers into three groups, based on their three features- recency, frequency and monetary value.

Running this code we get to se that the data is being divided into three groups, each size of 8704, 623 and 7436, respectively. Their individual cluster means are also calculated with respect to each feature.Also a vector containing the cluster number for each record is provided. The accuracy comes about 64%, which is good.
To see the distance within points in a cluster and between two clusters are,

We see that between cluster distance is much more than within points distance in a each clusters.
We can plot he clusters,


Now, we export out outputs in csv files, to read it better,

My output file containing the customers segmented in three clusters is attached below. Click to the link to view the output,
Output file: Download
Customer ID in pink coloured cells are from Cluster 1, with blue cells are in Cluster 2 and in yellow cells are in Cluster 3.

Insights

For Cluster 1 – Recency is observed to be 2294 days to 2463 days. Variability of the customers is very very less.
Number of times the purchase is done ranges from 1 to 2700 times. Variability captured is very less.
Monetary value of the purchases made ranges from Rs.6.2 to Rs.91062 . Variability observed is very less.
For Cluster 2 – Recency is observed to be 2294 days to 2667 days. Hence it takes into account little more older customers. Variability of the customers is very very less.
Number of times the purchase is done ranges from 3 to 7847 times. Hence it takes into account customers who purchases more number of times. Variability captured is moderately higher than the previous cluster.
Monetary value of the purchases made ranges from Rs.33720 to Rs.280206 . Hence it takes into account customers who adds to lot more valuation to their purchase. Variability observed is moderately higher than the previous cluster.
For Cluster 3 – Recency is observed to be 2458 days to 2667 days. Hence it takes into account little more older customers. Variability of the customers is very very less.
Number of times the purchase is done ranges from 1 to 2700 times. Hence it takes into account customers who purchases moderate number of times. Variability captured is moderately higher than the previous cluster. Few outliers are observed in the higher end.
Monetary value of the purchases made ranges from Rs.3.75 to Rs.91062 . Hence it takes into account customers who adds to moderate valuation to their purchase. Variability observed is moderately higher than the previous cluster. High variability at the upper end. Infact lots of outliers could be identified in higher purchase amounts.
So we can conclude that:

  1. The customers who fall under Cluster 2, have a high contribution to the profitability of the company, as they have tendency of purchasing more and purchasing items of high amounts. These are the potential an good customers.
  2. The customers who fall under Cluster 1, has a moderate contribution to the profitability of the company, as there is no high variability among the customers purchases. All customers purchase frequently, but the total amount is not much high. So if they are more motivated or advertised from the company’s end, then they could increase their number of purchase and be potential customers.
  3. The customers who fall under Cluster 3, are less atrractive customers as they have high variablibity in their number of purchases, plus they contribute much of an amount. These customers can be least expected to contribute to profit.
  4. Apart from this, we can see that frequency mainly influence the monetary value contributed by the customers, hence determining their CLV. Recency is more or less same for all clusters and so cant be identified as an influencial attribute to CLV.

You might also like More from author

1 Comment

  1. Kumar says

    Hi Pritha,

    Thanks for your detailed RFM analysis but the output file you have provided does not give the correct Customer IDs. The customer IDs you have provided in it, does not exist in the input file at all. Can you pls check on it and respond with right consolidated file. Also pls clarify if the output vector of kmeans model contains the row-ids of the input file against the cluster number?.

    I mean how to map the output of the model with the customer ids in the input file. This is needed as without this mapping we cannot know which customer falls into which cluster.

Leave A Reply

Your email address will not be published.