Retailing is a dynamic business domain where commodities and goods are sold in small quantities directly to the customers. It deals with the end user customers of a supply-chain network and therefore has to accommodate the needs and desires of a large group of customers over varied utilities. The volume and volatility of the business makes it one of the prospective fields for analytical study and data modeling. This is also why customer segmentation drives a key role in multiple retail business decisions such as marketing budgeting, customer targeting, customized offers, value proposition etc. The segmentation could be on various aspects such as demographics, historic behavior or preferences based on the use cases. In this paper, historic retail transactional data is used to segment the customers using K-Means clustering and the results are utilized to arrive at a transition matrix which is used to predict the cluster movements over the time period using Markov Model algorithm. This helps in calculating the futuristic value a segment or a customer brings to the business. Strategic marketing designs and budgeting can be implemented using these results. The study is specifically useful for large scale marketing in domains such as e-commerce, insurance or retailers to segment, profile and measure the customer lifecycle value over a short period of time.

Retail Business is one of the foremost domains in which marketing and the customized customer targeting is of critical value. It is an industry with the largest and most dynamic customer base, which also has to essentially provide finished goods and services that satisfy the needs of every single type of customer. This makes it a business with the most number of marketing challenges as well as an effective field for successful marketing explorations, unsurprisingly causing it to be one of the well sought out field fit for data analytics. Statistical and Predictive algorithms are being widely utilized in the retail industry for tracking the purchase patters, optimizing the inventories, budgeting for sectors, forecasting sales, targeted marketing etc. [

Customer profiling or segmentation is one of the key models that play a critical role in any retail marketing decision. A retailer has to provide a

Transactional Data is one such mode of data acquisition in retail industry that provides steady inflow of metrics that could help us understand purchase patterns or preferences. It covers multiple aspects such as product preferences, association patterns, impact of sales offers, sales trends etc.

An effective methodology for customer segmentation would be clustering on transactional data [

Therefore, this study aims in clustering the sample dataset of retail transactional data by first deriving the RFM metrics (Recency, Frequency and Monetary) of the customers and repeating the same across t number of periods. Provided the cluster profiles are uniform and comparable, we are able to establish the Markov Transition Matrix and predict the movement of customers across clusters in

Markov Matrix is a stochastic system that shows all possible states, transitions and the probabilities of each of them which can be leveraged to model the resultant segment distribution. This gives a measurable spread and therefore can be further used to make numeric decisions, specifically such as budgeting or value propositions. Alternative predictive strategies include regression models or decision trees which need extensive training and although may be useful for generic strategies, do not give a definitive mathematical result.

Unsupervised Clustering Algorithms are a major part of any large scale data analysis and therefore has an undeniable part in retail industry. It helps to segment customers without any necessary information on individual customers. Clustering serves as the first step for any analysis. There are multiple clustering algorithms widely in use with K-Means being the predominant type for larger numerical datasets. Although it has a weakness in accurately establishing the number of clusters, researches have established quite clearly that coupling it with elbow method to establish the cluster count is reliable enough as compared to other methods such as Hierarchical or Two-Step Clustering [

The steps in the K-Means method are as follows: [

1. Determine the number of clusters n, using a separate methodology

2. Select n number of initial centroids randomly

3. Calculate the distance of a data point to the centroid using the Euclidean distance formula.

4. Update centroids by calculating the average value of metrics in each cluster

5. Return to stage 3 as long as there is a data point that moves clusters or if the value of the centroid changes

Although the initial cluster allocation plays a pivotal role in the performance of clustering algorithms such as K-Modes or K-Means [

Elbow Method is used to determine the number of ideal clusters for a dataset by plotting the explained variation as a function of the number of clusters [

It is based on the intuition that the fit increases with increased number of clusters but the variation flattens beyond a point when it’s termed as over-fitting. This is graphically denoted by the elbow part of the curve.

The resultant elbow curve of the dataset taken for this study is seen as below

Realistically, it may be challenging to accurately establish the elbow of the curve like in this case where a prominent deviation in the curve may not be observed. However, the curve noticeably flattens beyond the cluster 6 which makes it the best k-value for our model

The transactional data can be summarized simply in terms of three major metrics–Recency, Frequency and Monetary which in other words are referred to as RFM metrics. Recency is an indicator of the last date of transaction of a customer denoting how recently they were associated with the business. Frequency is a measure of the number of distinct purchases or visits done by a customer. Monetary is the amount of money that a customer has spent with the business during the analysis period. RFM Clustering has proved to be an effective methodology to segregate and focus on customer groups in terms of retention, incentives, targeted offers etc. It is based on a simple theory that customers who are regular or recent visitors and who spend more money on an average tend to bring more value to business or tend to revisit when compared to others.

Despite the simplicity of RFM models, they have some limitations: the customer scores can be predicted only for the given period, it does not predict the precise behavior of the customer or the ability of their needs to change with time, other customer variables are not taken into account and therefore may not be high on accuracy, the model does not output a measurable monetary value. To deal with the disadvantages, it is advised to mix this approach with stochastic ones (RFM with Markov Chain, for instance [

Cluster Profiling or Segment Profiling is the last and final stage of any clustering algorithm which involves utilizing the same variables taken as part of cluster analysis. This can be done by identifying the centroids of each cluster obtained as a result of the model implementation, observing the metrics of the centroid and trying to describe the same in business friendly terminology [

Profiling may sometimes be more complicated as explained above, where looking at the centroids may not prove to be as effective, especially in cases with multiple variables or with higher cluster counts. In such cases, visualization of the cluster distribution across the key variables may come in handy where the variables are identified based on the respective business scenarios.

Markov Model is stochastic method for any system that randomly changes states in a stipulated time period, with an assumption that the future state does not depend on the past state but only on the present state [

In a Markov Model, the probabilities of moving from one state to another in a single period are called transition probabilities and the matrix representing these as an S × S matrix, where S denotes the number of states exhibited by the customer is called the Transaction Matrix [

In other words, this is the percentage of population from an initial state, being retained in same state or moving into other states. It measures the likelihood of transition of an entry into the finite set of states.

In

Markov Models have multiple use cases in real life such as weather forecasting, customer behavior, brand loyalty etc. and is an effective measure for larger datasets as it requires low or modest computational requirements and is easier to adapt to any series of data as long as it is sequential. The most common Markov Model used these days are text prediction algorithms which provide word suggestions based on most frequently associated words for every word entered.

Markov Chain is a specific scenario of Markov Models which deals with a discrete set of states over sequential periods of time according to the given transitional probability. The notable factor of Markov Chain algorithm is it’s convergence to equilibrium [

For this study, we have obtained a transactional dataset from a retail departmental store that was collected over a span of 12 months and have passed it through the 4 phases of data preparation which includes data cleaning, feature generation, dimensionality reduction and pre-processing.

The resultant dataset devoid of extreme outliers has 375,617 records and 9 variables out of which Invoice No acts as the primary key. The dataset also holds a Customer ID which helps to identify distinct customers and therefore establish their RFM metrics by observing their purchases throughout the time period. The dataset consists of invoice entries generated over the 12 month period from Dec-2010 onwards until Nov-2011. The Invoice Date, Unit Price and Quantity fields are the key fields which are in turn used to arrive at the Recency, Frequency and Monetary calculated fields.

The dataset is then split into subsets of equal time intervals of 2 months each, resulting in 6 individual datasets. Each of these are then clustered using K-Means algorithm where the

The first step of preparing any dataset is to explore the variables and ensure the nulls and outliers are handled appropriately. In this case, the invoices which have null Customer ID are removed as they cannot be associated with any individual customer’s behavior. The Amount or Monetary metric at customer level is calculated by summing up individual purchase amounts, which is the product of Unit Price and Quantity. Therefore, any extreme outliers in quantity are also removed to acquire a uniform dataset. The Invoice Date, Invoice No and Amount fields are then used to arrive at the RFM metrics.

Recency is calculated as the number of days since last purchase of the customer, Frequency is the number of distinct purchases done over the given period and Monetary is the total amount spent over the given period. This is done by aggregating the invoices at individual Customer ID level. These measures are then grouped based on percentiles or quintiles, Pareto rule or business acumen based on the use case [

An alternate method of normalization would be to assign comparable recency and frequency scores based on the percentiles or quartiles of the values calculated using invoice dates and counts. This would imply that a customer who has visited the store quite recently gets a higher recency score and therefore will require a descending quartile assignment during calculation. The customers who is a frequent purchaser gets a higher score for Frequency and this trend and distribution is observed in

The conventional method goes on to utilize the RFM metrics to derive the RFM scores which is based on assigning customers into relative buckets of recency, frequency and monetary as shown in

The RFM metrics are separately calculated for the six sets of datasets split for t1-t6 time periods of 2 months interval each. Calculating it separately is essential to capture comparable scales of recency and frequency for any given 2 month period which is unbiased by how latest the data is. These datasets are scalar transformed to pass through the K-Means algorithm which is detailed in the subsequent section.

The RFM metrics for each datasets, referred as d1–d6 for the six sets of time periods, are passed through the k-means model to result in 6 clusters. The resultant centroids are observed and labeled individually. It is evident that the centroids are comparable and similar in profiles across all six periods since the dataset is obtained from the same uniform source. This is very important for the resultant clusters to be established as different states of a Markov Model which will be utilized in the next section. Also, an additional cluster 0 is added to all datasets for the missing customers who did not make a purchase in the given time period

The scatter plots in

As important as it is that the profiles from each period are comparable to each other, it is also important to observe the density distribution across each cluster with time. Any extreme transition within clusters in a few periods would mean that there are external factors influencing the customer behavior with time and therefore may impact the transition probabilities and hence, we plot the cluster distribution for the six time periods as shown in

Another notable observation is that the customers in cluster 0, which is the churned or yet to acquire base, are the highest in count. This means that the footfall of customers is a highly dispersed or a one-time occurrence and that the business is most often not seeing repeat purchases in any 2–4 months period. Cluster 3 and 4, where the customers are average or infrequent visitors is the segment with considerable density and the golden segment has the most minimal size of all. It is also the customers in 3 and 4 that have the most potential of transitioning to clusters 5 or 6, depending on the right marketing strategies of a business.

The profile tags and the descriptions are provided in details in

Cluster | Customer profile | Profile description |
---|---|---|

0 | Churned | Customers who have no spend in the period |

1 | Rare visitors | Low engagement & low value customer base |

2 | Recently acquired | Lower recency, but engagement and value metrics are lower |

3 | Big spenders | Lower engagement, but higher monetary value |

4 | Moderate opportunity customers | Customer with average engagement and value |

5 | High opportunity customers | Moderate engagement with high value and a high potential for revenue |

6 | Golden customers | Best customers with highest engagement and value, most loyal |

The study aims at measuring the likelihood of a customer to transition across the clusters across time period

The Markov Model is a stochastic algorithm that results in higher accuracy with a larger dataset. This is a key driving factor in portioning the given dataset into 6 equal intervals, resulting in more number of transitions from one segment to another. The Markov Model is something that concerns the current state only and therefore, we do not require linking the series of transitions of a customer with time. The cluster transitions are then appended one below the other for the datasets d1 to d5. The final dataset d6is then retained as a test base, which would help us validate how effective the model metrics are in establishing the cluster movements.

Transition Probabilities for each cluster are calculated by taking the count of transitions of each cluster to a given cluster state divided by the total number of occurrence of the original cluster. The transition probabilities are then arranged in a 7 × 7 matrix, to signify the movement across Clusters 0 to 6 and this is called a Transition Matrix. The equation below explains how the probabilities are calculated for a sample scenario of cluster 0 to cluster 1 transition, where n is the number of observations–

This implies that the probability of a customer under Cluster 1 churning out is more (67%) as compared to their chance of becoming a more valued customer even with all cluster probabilities put together. However, it is evident that the churn rate is reducing as we move up the segment ladder, with the chances of losing a customer being as low as 4% in the golden segment with high engaging, high value customers.

This summarizes the cluster distribution of customers in t + 1 period, given the cluster distribution in period t. It is observed that there is an incremental churn observed, but also that cluster 4 and 6 are seeing a natural increment. In order to establish the accuracy of the above matrix, the actual cluster distribution in dataset d6 is depicted as below.

Comparison of the D6 matrices from Markov prediction and actual test dataset proves to be comparable at first glance. It must be noted that the number of distinct customer data entries taken for the purpose of this study is a smaller sample and real use cases would be much larger and therefore, acquiring such comparable results is still appreciable for business decisions. Calculating numeric variance at each stage, it can be observed that some cluster transitions are not as accurately predicted such as the cluster 6 counts which have a variance of around 35% with respect to the test base. This is on one hand due to the fact a sample size of less than 100 records is inevitable to cause high variance, but on the other hand it could also be due to the natural influence in customer purchase behavior such as seasonality or financial influences in the market on the whole. However, the whole prediction on an average shows a 15% variance which is exceptionally helpful when it comes to future planning of business strategies.

The customers being clustered initially using RFM metrics across the set time intervals has proven effective to establish comparable customer profiles. This is a worthy tag which consists of a finite number of segments, 7 in this case, which helps with establishing the future state relationship using the Markov Transition Matrix. It is observed that the resultant predicted distribution across clusters is closer to the test dataset retained for period

Although predictive models such as regression or decision trees can be used for scoring or state prediction, it requires exhaustive training and a longer duration of study when compared to Markov Models. The Markov Model being stochastic in nature also gives a definitive probabilistic measure of future states thereby favoring applications that are broader business decisions such as budgeting or targeting. The transition matrix is also a valuable measure in cases where there are two independent sets of customer bases who are targeted with different offers or marketing methodologies. This would help the business establish the better strategy that proved to be more effective in transitioning the customer towards a segment of higher engagement and value. The Model can also be effectively utilized to predict purchase patterns, in recommendation models apart from segment prediction [