Customer Cohort Analysis

21 minute read

Cohort Analysis with Python

What is cohort analysis? A cohort is a group of users who share something in common, be it their sign-up date, first purchase month, birth date, acquisition channel, etc. Cohort analysis is the method by which these groups are tracked over time, helping you spot trends, understand repeat behaviors (purchases, engagement, amount spent, etc.), and monitor your customer and revenue retention.

It’s common for cohorts to be created based on a customer’s first usage of the platform, where “usage” is dependent on your business’ key metrics. For Uber or Lyft, usage would be booking a trip through one of their apps. For GrubHub, it’s ordering some food. For AirBnB, it’s booking a stay. With these companies, a purchase is at their core, be it taking a trip or ordering dinner — their revenues are tied to their users’ purchase behavior.

In others, a purchase is not central to the business model and the business is more interested in “engagement” with the platform. Facebook and Twitter are examples of this - are you visiting their sites every day? Are you performing some action on them - maybe a “like” on Facebook or a “favorite” on a tweet?1

When building a cohort analysis, it’s important to consider the relationship between the event or interaction you’re tracking and its relationship to your busines

Why is it valuable?

Cohort analysis can be helpful when it comes to understanding your business’ health and “stickiness” - the loyalty of your customers. Stickiness is critical since it’s far cheaper and easier to keep a current customer than to acquire a new one. For startups, it’s also a key indicator of product-market fit.

Additionally, your product evolves over time. New features are added and removed, the design changes, etc. Observing individual groups over time is a starting point to understanding how these changes affect user behavior.

It’s also a good way to visualize your user retention/churn as well as formulating a basic understanding of their lifetime value.

Cohort Analysis (Retention over User & Product Lifetime)

A cohort is a group of subjects who share a defining characteristic. We can observe how a cohort behaves across time and compare it to other cohorts. Cohorts are used in medicine, psychology, econometrics, ecology and many other areas to perform a cross-section (compare difference across subjects) at intervals through time.

**Types of cohorts: **

  • Time Cohorts are customers who signed up for a product or service during a particular time frame. Analyzing these cohorts shows the customers’ behavior depending on the time they started using the company’s products or services. The time may be monthly or quarterly even daily.
  • Behaovior cohorts are customers who purchased a product or subscribed to a service in the past. It groups customers by the type of product or service they signed up. Customers who signed up for basic level services might have different needs than those who signed up for advanced services. Understaning the needs of the various cohorts can help a company design custom-made services or products for particular segments.
  • Size cohorts refer to the various sizes of customers who purchase company’s products or services. This categorization can be based on the amount of spending in some periodic time after acquisition or the product type that the customer spent most of their order amount in some period of time.

**Import Libraries and DataSet **

# import library
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime as dt

#For Data  Visualization
import matplotlib.pyplot as plt
import seaborn as sns

#For Machine Learning Algorithm
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory



df = pd.read_excel("Online Retail.xlsx")


**Explore + Clean the data **

Variables Description

**InvoiceNo Invoice number. Nominal, a six digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation

**StockCode Product (item) code. Nominal, a five digit integral number uniquely assigned to each distinct product

**Description Product (item) name. Nominal

**Quantity The quantities of each product (item) per transaction. Numeric

**InvoiceDate Invoice Date and time. Numeric, the day and time when each transaction was generated

**UnitPrice Unit price. Numeric, product price per unit in sterling

**CustomerID Customer number. Nominal, a six digit integral number uniquely assigned to each customer

**Country Country name. Nominal, the name of the country where each customer resides

df.head(5)
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB

Nots that: There are Missing Data in Description and The Customer ID Columns , let’s check that

Check and Clean Missing Data

df.isnull().sum()
InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64
df= df.dropna(subset=['CustomerID'])
df.isnull().sum().sum()
0

Check & Clean Duplicates Data

df.duplicated().sum()

5225
df = df.drop_duplicates()
df.duplicated().sum()

0
df.describe()
Quantity UnitPrice CustomerID
count 401604.000000 401604.000000 401604.000000
mean 12.183273 3.474064 15281.160818
std 250.283037 69.764035 1714.006089
min -80995.000000 0.000000 12346.000000
25% 2.000000 1.250000 13939.000000
50% 5.000000 1.950000 15145.000000
75% 12.000000 3.750000 16784.000000
max 80995.000000 38970.000000 18287.000000

Note That : The min for unit price = 0 and the min for Quantity with negative value

df=df[(df['Quantity']>0) & (df['UnitPrice']>0)]
df.describe() 
Quantity UnitPrice CustomerID
count 392692.000000 392692.000000 392692.000000
mean 13.119702 3.125914 15287.843865
std 180.492832 22.241836 1713.539549
min 1.000000 0.001000 12346.000000
25% 2.000000 1.250000 13955.000000
50% 6.000000 1.950000 15150.000000
75% 12.000000 3.750000 16791.000000
max 80995.000000 8142.750000 18287.000000
df.shape
(392692, 8)
# Let's create a feature with total cost of the transactions
df['Total_cost'] = df.Quantity * df.UnitPrice

EDA

Now let’s do some Exploratory Data Analysis on the processed dataset

# Check the oldest and latest date in the dataset.
print(f'Oldest date is - {df.InvoiceDate.min()}\n')
print(f'Latest date is - {df.InvoiceDate.max()}')
Oldest date is - 2010-12-01 08:26:00

Latest date is - 2011-12-09 12:50:00
 # Count of transactions in different years
df.InvoiceDate.dt.year.value_counts(sort=False).plot(kind='bar', rot=45);

png

# Let's visualize some top products from the whole range.
top_products = df['Description'].value_counts()[:20]
plt.figure(figsize=(10,6))
sns.set_context("paper", font_scale=1.5)
sns.barplot(y = top_products.index,
            x = top_products.values)
plt.title("Top selling products")
plt.show();

png

 #Count of transactions in different months within 2011 year.
df[df.InvoiceDate.dt.year==2011].InvoiceDate.dt.month.value_counts(sort=False).plot(kind='bar');

png

An increasing pattern can be observed month by month wise with a sharp decline in the month of December. That is evident because only first 8-9 days of December 2011 month is available in the dataset i.e. around 70% of the month transactions are not considered. Due to this fact, sales figure looks legitimate.

# Let's visualize the top grossing months
monthly_gross = df[df.InvoiceDate.dt.year==2011].groupby(df.InvoiceDate.dt.month).Total_cost.sum()
plt.figure(figsize=(10,5))
sns.lineplot(y=monthly_gross.values,x=monthly_gross.index, marker='o');
plt.xticks(range(1,13))
plt.show()

png

# Let's visualize the Unit price distribution
plt.figure(figsize=(16,4))
sns.boxplot(y='UnitPrice', data=df, orient='h');

png

Let’s Make Cohort Analysis

For cohort analysis, there are a few labels that we have to create:

  • Invoice period: A string representation of the year and month of a single transaction/invoice.
  • Cohort group: A string representation of the the year and month of a customer’s first purchase. This label is common across all invoices for a particular customer.
  • Cohort period / Cohort Index: A integer representation a customer’s stage in its “lifetime”. The number represents the number of months passed since the first purchase.
def get_month(x) : return dt.datetime(x.year,x.month,1)
df['InvoiceMonth'] = df['InvoiceDate'].apply(get_month)
grouping = df.groupby('CustomerID')['InvoiceMonth']
df['CohortMonth'] = grouping.transform('min')
df.tail()
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country InvoiceMonth CohortMonth CohortIndex
541904 581587 22613 PACK OF 20 SPACEBOY NAPKINS 12 2011-12-09 12:50:00 0.85 12680.0 France 2011-12-01 2011-08-01 5
541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6 2011-12-09 12:50:00 2.10 12680.0 France 2011-12-01 2011-08-01 5
541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 2011-12-09 12:50:00 4.15 12680.0 France 2011-12-01 2011-08-01 5
541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 2011-12-09 12:50:00 4.15 12680.0 France 2011-12-01 2011-08-01 5
541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 2011-12-09 12:50:00 4.95 12680.0 France 2011-12-01 2011-08-01 5

Calculate time offset in months

Calculating time offset for each transaction allows you to report the metrics for each cohort in a comparable fashion.

First, we will create some variables that capture the integer value of years and months for Invoice and Cohort Date using the get_date_int() function

def get_month_int (dframe,column):
    year = dframe[column].dt.year
    month = dframe[column].dt.month
    day = dframe[column].dt.day
    return year, month , day 

# Get the integers for date parts from the `InvoiceMonth` column

invoice_year,invoice_month,_ = get_month_int(df,'InvoiceMonth')

# Get the integers for date parts from the `CohortMonth` column

cohort_year,cohort_month,_ = get_month_int(df,'CohortMonth')

# Calculate difference in months

year_diff = invoice_year - cohort_year 
month_diff = invoice_month - cohort_month 

# Extract the difference in months from all previous values

df['CohortIndex'] = year_diff * 12 + month_diff + 1 

Calculate retention rate

Customer retention is a very useful metric to understand how many of all the customers are still active. It gives you the percentage of active customers compared to the total number of customers

#Count monthly active customers from each cohort
grouping = df.groupby(['CohortMonth', 'CohortIndex'])
cohort_data = grouping['CustomerID'].apply(pd.Series.nunique)
# Return number of unique elements in the object.
cohort_data = cohort_data.reset_index()
cohort_counts = cohort_data.pivot(index='CohortMonth',columns='CohortIndex',values='CustomerID')
cohort_counts

CohortIndex 1 2 3 4 5 6 7 8 9 10 11 12 13
CohortMonth
2010-12-01 948.0 362.0 317.0 367.0 341.0 376.0 360.0 336.0 336.0 374.0 354.0 474.0 260.0
2011-01-01 421.0 101.0 119.0 102.0 138.0 126.0 110.0 108.0 131.0 146.0 155.0 63.0 NaN
2011-02-01 380.0 94.0 73.0 106.0 102.0 94.0 97.0 107.0 98.0 119.0 35.0 NaN NaN
2011-03-01 440.0 84.0 112.0 96.0 102.0 78.0 116.0 105.0 127.0 39.0 NaN NaN NaN
2011-04-01 299.0 68.0 66.0 63.0 62.0 71.0 69.0 78.0 25.0 NaN NaN NaN NaN
2011-05-01 279.0 66.0 48.0 48.0 60.0 68.0 74.0 29.0 NaN NaN NaN NaN NaN
2011-06-01 235.0 49.0 44.0 64.0 58.0 79.0 24.0 NaN NaN NaN NaN NaN NaN
2011-07-01 191.0 40.0 39.0 44.0 52.0 22.0 NaN NaN NaN NaN NaN NaN NaN
2011-08-01 167.0 42.0 42.0 42.0 23.0 NaN NaN NaN NaN NaN NaN NaN NaN
2011-09-01 298.0 89.0 97.0 36.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-10-01 352.0 93.0 46.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-11-01 321.0 43.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-12-01 41.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

**Retention Rate Table **

# Retention table
cohort_size = cohort_counts.iloc[:,0]
retention = cohort_counts.divide(cohort_size,axis=0) #axis=0 to ensure the divide along the row axis 
retention.round(3) * 100 #to show the number as percentage 
CohortIndex 1 2 3 4 5 6 7 8 9 10 11 12 13
CohortMonth
2010-12-01 100.0 36.6 32.3 38.4 36.3 39.8 36.3 34.9 35.4 39.5 37.4 50.3 26.6
2011-01-01 100.0 22.1 26.6 23.0 32.1 28.8 24.7 24.2 30.0 32.6 36.5 11.8 NaN
2011-02-01 100.0 18.7 18.7 28.4 27.1 24.7 25.3 27.9 24.7 30.5 6.8 NaN NaN
2011-03-01 100.0 15.0 25.2 19.9 22.3 16.8 26.8 23.0 27.9 8.6 NaN NaN NaN
2011-04-01 100.0 21.3 20.3 21.0 19.7 22.7 21.7 26.0 7.3 NaN NaN NaN NaN
2011-05-01 100.0 19.0 17.3 17.3 20.8 23.2 26.4 9.5 NaN NaN NaN NaN NaN
2011-06-01 100.0 17.4 15.7 26.4 23.1 33.5 9.5 NaN NaN NaN NaN NaN NaN
2011-07-01 100.0 18.1 20.7 22.3 27.1 11.2 NaN NaN NaN NaN NaN NaN NaN
2011-08-01 100.0 20.7 24.9 24.3 12.4 NaN NaN NaN NaN NaN NaN NaN NaN
2011-09-01 100.0 23.4 30.1 11.4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-10-01 100.0 24.0 11.5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-11-01 100.0 11.1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-12-01 100.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
#Build the heatmap
plt.figure(figsize=(15, 8))
plt.title('Retention rates')
sns.heatmap(data=retention,annot = True,fmt = '.0%',vmin = 0.0,vmax = 0.5,cmap="BuPu_r")
plt.show()

png

** Note That: Customer retention is a very useful metric to understand how many of the all customers are still active.Retention gives you the percentage of active customers compared to the total number of customers.**

**Average quantity for each cohort **

#Average quantity for each cohort
grouping = df.groupby(['CohortMonth', 'CohortIndex'])
cohort_data = grouping['Quantity'].mean()
cohort_data = cohort_data.reset_index()
average_quantity = cohort_data.pivot(index='CohortMonth',columns='CohortIndex',values='Quantity')
average_quantity.round(1)
average_quantity.index = average_quantity.index.date

#Build the heatmap
plt.figure(figsize=(15, 8))
plt.title('Average quantity for each cohort')
sns.heatmap(data=average_quantity,annot = True,vmin = 0.0,vmax =20,cmap="BuGn_r")
plt.show()

png


ax = Retention_rates.T.mean().plot(figsize=(11,6), marker='s')
plt.title("Retention rate (%) per CohortGroup", fontname='Ubuntu', fontsize=20, fontweight='bold')

plt.xticks(np.arange(1, 16.1, 1), fontsize=10)
plt.yticks(np.arange(0, 1.1, 0.1), fontsize=10)
ax.set_xlabel("CohortPeriod", fontsize=10)
ax.set_ylabel("Retention(%)", fontsize=10)
plt.show()
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-52-f89e89d88b09> in <module>
----> 1 ax = Retention_rates.T.mean().plot(figsize=(11,6), marker='s')
      2 plt.title("Retention rate (%) per CohortGroup", fontname='Ubuntu', fontsize=20, fontweight='bold')
      3 
      4 plt.xticks(np.arange(1, 16.1, 1), fontsize=10)
      5 plt.yticks(np.arange(0, 1.1, 0.1), fontsize=10)


NameError: name 'Retention_rates' is not defined

Recency, Frequency and Monetary Value calculation

**What is RFM? **

  • RFM is an acronym of recency, frequency and monetary. Recency is about when was the last order of a customer. It means the number of days since a customer made the last purchase. If it’s a case for a website or an app, this could be interpreted as the last visit day or the last login time.

  • Frequency is about the number of purchase in a given period. It could be 3 months, 6 months or 1 year. So we can understand this value as for how often or how many a customer used the product of a company. The bigger the value is, the more engaged the customers are. Could we say them as our VIP? Not necessary. Cause we also have to think about how much they actually paid for each purchase, which means monetary value.

  • Monetary is the total amount of money a customer spent in that given period. Therefore big spenders will be differentiated with other customers such as MVP or VIP.

**The RFM values can be grouped in several ways: **

**1.Percentiles e.g. quantiles **

2.Pareto 80/20 cut

3.Custom - based on business knowledge

We are going to implement percentile-based grouping.

Process of calculating percentiles:

  1. Sort customers based on that metric
  2. Break customers into a pre-defined number of groups of equal size
  3. Assign a label to each group
#New Total Sum Column  
df['TotalSum'] = df['UnitPrice']* df['Quantity']

#Data preparation steps
print('Min Invoice Date:',df.InvoiceDate.dt.date.min(),'max Invoice Date:',
       df.InvoiceDate.dt.date.max())

df.head(3)
Min Invoice Date: 2010-12-01 max Invoice Date: 2011-12-09
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country InvoiceMonth CohortMonth CohortIndex Total_cost TotalSum
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 2010-12-01 2010-12-01 1 15.30 15.30
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 2010-12-01 2010-12-01 1 20.34 20.34
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 2010-12-01 2010-12-01 1 22.00 22.00

In the real world, we would be working with the most recent snapshot of the data of today or yesterday

snapshot_date = df['InvoiceDate'].max() + dt.timedelta(days=1)
snapshot_date
#The last day of purchase in total is 09 DEC, 2011. To calculate the day periods, 
#let's set one day after the last one,or 
#10 DEC as a snapshot_date. We will cound the diff days with snapshot_date.

Timestamp('2011-12-10 12:50:00')
# Calculate RFM metrics
rfm = df.groupby(['CustomerID']).agg({'InvoiceDate': lambda x : (snapshot_date - x.max()).days,
                                      'InvoiceNo':'count','TotalSum': 'sum'})
#Function Lambdea: it gives the number of days between hypothetical today and the last transaction

#Rename columns
rfm.rename(columns={'InvoiceDate':'Recency','InvoiceNo':'Frequency','TotalSum':'MonetaryValue'}
           ,inplace= True)

#Final RFM values
rfm.head()

Recency Frequency MonetaryValue
CustomerID
12346.0 326 1 77183.60
12347.0 2 182 4310.00
12348.0 75 31 1797.24
12349.0 19 73 1757.55
12350.0 310 17 334.40

**Note That : **

**#We will rate “Recency” customer who have been active more recently better than the less recent customer,because each company wants its customers to be recent **

**#We will rate “Frequency” and “Monetary Value” higher label because we want Customer to spend more money and visit more often(that is different order than recency). **

#Building RFM segments
r_labels =range(4,0,-1)
f_labels=range(1,5)
m_labels=range(1,5)
r_quartiles = pd.qcut(rfm['Recency'], q=4, labels = r_labels)
f_quartiles = pd.qcut(rfm['Frequency'],q=4, labels = f_labels)
m_quartiles = pd.qcut(rfm['MonetaryValue'],q=4,labels = m_labels)
rfm = rfm.assign(R=r_quartiles,F=f_quartiles,M=m_quartiles)

# Build RFM Segment and RFM Score
def add_rfm(x) : return str(x['R']) + str(x['F']) + str(x['M'])
rfm['RFM_Segment'] = rfm.apply(add_rfm,axis=1 )
rfm['RFM_Score'] = rfm[['R','F','M']].sum(axis=1)

rfm.head()
Recency Frequency MonetaryValue R F M RFM_Segment RFM_Score
CustomerID
12346.0 326 1 77183.60 1 1 4 114 6.0
12347.0 2 182 4310.00 4 4 4 444 12.0
12348.0 75 31 1797.24 2 2 4 224 8.0
12349.0 19 73 1757.55 3 3 4 334 10.0
12350.0 310 17 334.40 1 1 2 112 4.0

**The Result is a Table which has a row for each customer with their RFM **

Analyzing RFM Segments

Largest RFM segments ** **It is always the best practice to investigate the size of the segments before you use them for targeting or other business Application.

rfm.groupby(['RFM_Segment']).size().sort_values(ascending=False)[:5]
RFM_Segment
444    450
111    381
344    217
122    206
211    179
dtype: int64

**Filtering on RFM segments **

#Select bottom RFM segment "111" and view top 5 rows
rfm[rfm['RFM_Segment']=='111'].head()
Recency Frequency MonetaryValue R F M RFM_Segment RFM_Score
CustomerID
12353.0 204 4 89.00 1 1 1 111 3.0
12361.0 287 10 189.90 1 1 1 111 3.0
12401.0 303 5 84.30 1 1 1 111 3.0
12402.0 323 11 225.60 1 1 1 111 3.0
12441.0 367 11 173.55 1 1 1 111 3.0

**Summary metrics per RFM Score **

rfm.groupby('RFM_Score').agg({'Recency': 'mean','Frequency': 'mean',
                             'MonetaryValue': ['mean', 'count'] }).round(1)


Recency Frequency MonetaryValue
mean mean mean count
RFM_Score
3.0 260.7 8.2 157.4 381
4.0 177.2 13.6 240.0 388
5.0 152.9 21.2 366.6 518
6.0 95.9 27.9 820.8 457
7.0 79.6 38.0 758.1 463
8.0 64.1 56.0 987.3 454
9.0 45.9 78.7 1795.1 414
10.0 32.4 110.5 2056.4 426
11.0 21.3 186.9 4062.0 387
12.0 7.2 367.8 9285.9 450

Use RFM score to group customers into Gold, Silver and Bronze segments:

def segments(df):
    if df['RFM_Score'] > 9 :
        return 'Gold'
    elif (df['RFM_Score'] > 5) and (df['RFM_Score'] <= 9 ):
        return 'Sliver'
    else:  
        return 'Bronze'

rfm['General_Segment'] = rfm.apply(segments,axis=1)

rfm.groupby('General_Segment').agg({'Recency':'mean','Frequency':'mean',
                                    'MonetaryValue':['mean','count']}).round(1)

Recency Frequency MonetaryValue
mean mean mean count
General_Segment
Bronze 192.2 15.1 266.5 1287
Gold 20.1 225.6 5246.8 1263
Sliver 72.0 49.4 1072.4 1788

Data Pre-Processing for Kmeans Clustering

K-Means clustering is one type of unsupervised learning algorithms, which makes groups based on the distance between the points. How? There are two concepts of distance in K-Means clustering. Within Cluster Sums of Squares (WSS) and Between Cluster Sums of Squares (BSS).

We must check these Key k-means assumptions before we implement our Kmeans Clustering Mode

  • Symmetric distribution of variables (not skewed)
  • Variables with same average values
  • Variables with same variance
rfm_rfm = rfm[['Recency','Frequency','MonetaryValue']]
print(rfm_rfm.describe())


           Recency    Frequency  MonetaryValue
count  4338.000000  4338.000000    4338.000000
mean     92.536422    90.523744    2048.688081
std     100.014169   225.506968    8985.230220
min       1.000000     1.000000       3.750000
25%      18.000000    17.000000     306.482500
50%      51.000000    41.000000     668.570000
75%     142.000000    98.000000    1660.597500
max     374.000000  7676.000000  280206.020000

From this table, we find this Problem: Mean and Variance are not Equal

Soluation: Scaling variables by using a scaler from scikit-learn library

# plot the distribution of RFM values
f,ax = plt.subplots(figsize=(10, 12))
plt.subplot(3, 1, 1); sns.distplot(rfm.Recency, label = 'Recency')
plt.subplot(3, 1, 2); sns.distplot(rfm.Frequency, label = 'Frequency')
plt.subplot(3, 1, 3); sns.distplot(rfm.MonetaryValue, label = 'Monetary Value')
plt.style.use('fivethirtyeight')
plt.tight_layout()
plt.show()

/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

png

Also, there is another Problem: UnSymmetric distribution of variables (data skewed)

Soluation:Logarithmic transformation (positive values only) will manage skewness

**We use these Sequence of structuring pre-processing steps: ** **1. Unskew the data - log transformation **

**2. Standardize to the same average values **

**3. Scale to the same standard deviation **

4. Store as a separate array to be used for clustering ___________

Why the sequence matters?

  • Log transformation only works with positive data
  • Normalization forces data to have negative values and log will not work
#Unskew the data with log transformation
rfm_log = rfm[['Recency', 'Frequency', 'MonetaryValue']].apply(np.log, axis = 1).round(3)
#or rfm_log = np.log(rfm_rfm)


# plot the distribution of RFM values
f,ax = plt.subplots(figsize=(10, 12))
plt.subplot(3, 1, 1); sns.distplot(rfm_log.Recency, label = 'Recency')
plt.subplot(3, 1, 2); sns.distplot(rfm_log.Frequency, label = 'Frequency')
plt.subplot(3, 1, 3); sns.distplot(rfm_log.MonetaryValue, label = 'Monetary Value')
plt.style.use('fivethirtyeight')
plt.tight_layout()
plt.show()

/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

png

Implementation of K-Means Clustering

Key steps

  1. Data pre-processing
  2. Choosing a number of clusters
  3. Running k-means clustering on pre-processed data
  4. Analyzing average RFM values of each cluster

** 1. Data Pre-Processing**

#Normalize the variables with StandardScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(rfm_log)
#Store it separately for clustering
rfm_normalized= scaler.transform(rfm_log)

2. Choosing a Number of Clusters

Methods to define the number of clusters

  • Visual methods - elbow criterion
  • Mathematical methods - silhouette coefficient
  • Experimentation and interpretation

**Elbow criterion method **

  • Plot the number of clusters against within-cluster sum-of-squared-errors (SSE) - sum of squared distances from every data point to their cluster center
  • Identify an “elbow” in the plot
  • Elbow - a point representing an “optimal” number of clusters
from sklearn.cluster import KMeans

#First : Get the Best KMeans 
ks = range(1,8)
inertias=[]
for k in ks :
    # Create a KMeans clusters
    kc = KMeans(n_clusters=k,random_state=1)
    kc.fit(rfm_normalized)
    inertias.append(kc.inertia_)

# Plot ks vs inertias
f, ax = plt.subplots(figsize=(15, 8))
plt.plot(ks, inertias, '-o')
plt.xlabel('Number of clusters, k')
plt.ylabel('Inertia')
plt.xticks(ks)
plt.style.use('ggplot')
plt.title('What is the Best Number for KMeans ?')
plt.show()


png

**Note Theat: We Choose No.KMeans = 3 **

# clustering
kc = KMeans(n_clusters= 3, random_state=1)
kc.fit(rfm_normalized)

#Create a cluster label column in the original DataFrame
cluster_labels = kc.labels_

#Calculate average RFM values and size for each cluster:
rfm_rfm_k3 = rfm_rfm.assign(K_Cluster = cluster_labels)

#Calculate average RFM values and sizes for each cluster:
rfm_rfm_k3.groupby('K_Cluster').agg({'Recency': 'mean','Frequency': 'mean',
                                         'MonetaryValue': ['mean', 'count'],}).round(0)

Recency Frequency MonetaryValue
mean mean mean count
K_Cluster
0 171.0 15.0 293.0 1527
1 13.0 260.0 6574.0 953
2 69.0 65.0 1170.0 1858

Snake plots to understand and compare segments

  • Market research technique to compare different segments
  • Visual representation of each segment’s attributes
  • Need to first normalize data (center & scale)
  • Plot each cluster’s average normalized values of each attribute
rfm_normalized = pd.DataFrame(rfm_normalized,index=rfm_rfm.index,columns=rfm_rfm.columns)
rfm_normalized['K_Cluster'] = kc.labels_
rfm_normalized['General_Segment'] = rfm['General_Segment']
rfm_normalized.reset_index(inplace = True)

#Melt the data into a long format so RFM values and metric names are stored in 1 column each
rfm_melt = pd.melt(rfm_normalized,id_vars=['CustomerID','General_Segment','K_Cluster'],value_vars=['Recency', 'Frequency', 'MonetaryValue'],
var_name='Metric',value_name='Value')
rfm_melt.head()

CustomerID General_Segment K_Cluster Metric Value
0 12346.0 Sliver 2 Recency 1.409982
1 12347.0 Gold 1 Recency -2.146578
2 12348.0 Sliver 2 Recency 0.383648
3 12349.0 Gold 2 Recency -0.574961
4 12350.0 Bronze 0 Recency 1.375072

Snake Plot and Heatmap

f, (ax1, ax2) = plt.subplots(1,2, figsize=(15, 8))
sns.lineplot(x = 'Metric', y = 'Value', hue = 'General_Segment', data = rfm_melt,ax=ax1)

# a snake plot with K-Means
sns.lineplot(x = 'Metric', y = 'Value', hue = 'K_Cluster', data = rfm_melt,ax=ax2)

plt.suptitle("Snake Plot of RFM",fontsize=24) #make title fontsize subtitle 
plt.show()

/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

png

**Relative importance of segment attributes **

  • Useful technique to identify relative importance of each segment’s attribute
  • Calculate average values of each cluster
  • Calculate average values of population
  • Calculate importance score by dividing them and subtracting 1 (ensures 0 is returned when cluster average equals population average)

**Let’s try again with a heat map. Heat maps are a graphical representation of data where larger values were colored in darker scales and smaller values in lighter scales. We can compare the variance between the groups quite intuitively by colors. **

# The further a ratio is from 0, the more important that attribute is for a segment relative to the total population
cluster_avg = rfm_rfm_k3.groupby(['K_Cluster']).mean()
population_avg = rfm_rfm.mean()
relative_imp = cluster_avg / population_avg - 1
relative_imp.round(2)


Recency Frequency MonetaryValue
K_Cluster
0 0.85 -0.84 -0.86
1 -0.86 1.88 2.21
2 -0.25 -0.28 -0.43
# the mean value in total 
total_avg = rfm.iloc[:, 0:3].mean()
# calculate the proportional gap with total mean
cluster_avg = rfm.groupby('General_Segment').mean().iloc[:, 0:3]
prop_rfm = cluster_avg/total_avg - 1
prop_rfm.round(2)

Recency Frequency MonetaryValue
General_Segment
Bronze 1.08 -0.83 -0.87
Gold -0.78 1.49 1.56
Sliver -0.22 -0.45 -0.48
# heatmap with RFM
f, (ax1, ax2) = plt.subplots(1,2, figsize=(15, 5))
sns.heatmap(data=relative_imp, annot=True, fmt='.2f', cmap='Blues',ax=ax1)
ax1.set(title = "Heatmap of K-Means")

# a snake plot with K-Means
sns.heatmap(prop_rfm, cmap= 'Oranges', fmt= '.2f', annot = True,ax=ax2)
ax2.set(title = "Heatmap of RFM quantile")

plt.suptitle("Heat Map of RFM",fontsize=20) #make title fontsize subtitle 

plt.show()


png

You can Updated RFM data by adding Tenure variable : ** ** -Tenure: time since the first transaction ، Defines how long the customer has been with the company

**Conclusion: We talked about how to get RFM values from customer purchase data, and we made two kinds of segmentation with RFM quantiles and K-Means clustering methods. **

Updated: