Customer Cohort Analysis
Cohort Analysis with Python
What is cohort analysis? A cohort is a group of users who share something in common, be it their sign-up date, first purchase month, birth date, acquisition channel, etc. Cohort analysis is the method by which these groups are tracked over time, helping you spot trends, understand repeat behaviors (purchases, engagement, amount spent, etc.), and monitor your customer and revenue retention.
It’s common for cohorts to be created based on a customer’s first usage of the platform, where “usage” is dependent on your business’ key metrics. For Uber or Lyft, usage would be booking a trip through one of their apps. For GrubHub, it’s ordering some food. For AirBnB, it’s booking a stay. With these companies, a purchase is at their core, be it taking a trip or ordering dinner — their revenues are tied to their users’ purchase behavior.
In others, a purchase is not central to the business model and the business is more interested in “engagement” with the platform. Facebook and Twitter are examples of this - are you visiting their sites every day? Are you performing some action on them - maybe a “like” on Facebook or a “favorite” on a tweet?1
When building a cohort analysis, it’s important to consider the relationship between the event or interaction you’re tracking and its relationship to your busines
Why is it valuable?
Cohort analysis can be helpful when it comes to understanding your business’ health and “stickiness” - the loyalty of your customers. Stickiness is critical since it’s far cheaper and easier to keep a current customer than to acquire a new one. For startups, it’s also a key indicator of product-market fit.
Additionally, your product evolves over time. New features are added and removed, the design changes, etc. Observing individual groups over time is a starting point to understanding how these changes affect user behavior.
It’s also a good way to visualize your user retention/churn as well as formulating a basic understanding of their lifetime value.
Cohort Analysis (Retention over User & Product Lifetime)
A cohort is a group of subjects who share a defining characteristic. We can observe how a cohort behaves across time and compare it to other cohorts. Cohorts are used in medicine, psychology, econometrics, ecology and many other areas to perform a cross-section (compare difference across subjects) at intervals through time.
**Types of cohorts: **
- Time Cohorts are customers who signed up for a product or service during a particular time frame. Analyzing these cohorts shows the customers’ behavior depending on the time they started using the company’s products or services. The time may be monthly or quarterly even daily.
- Behaovior cohorts are customers who purchased a product or subscribed to a service in the past. It groups customers by the type of product or service they signed up. Customers who signed up for basic level services might have different needs than those who signed up for advanced services. Understaning the needs of the various cohorts can help a company design custom-made services or products for particular segments.
- Size cohorts refer to the various sizes of customers who purchase company’s products or services. This categorization can be based on the amount of spending in some periodic time after acquisition or the product type that the customer spent most of their order amount in some period of time.
**Import Libraries and DataSet **
# import library
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime as dt
#For Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
#For Machine Learning Algorithm
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
df = pd.read_excel("Online Retail.xlsx")
**Explore + Clean the data **
Variables Description
**InvoiceNo Invoice number. Nominal, a six digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation
**StockCode Product (item) code. Nominal, a five digit integral number uniquely assigned to each distinct product
**Description Product (item) name. Nominal
**Quantity The quantities of each product (item) per transaction. Numeric
**InvoiceDate Invoice Date and time. Numeric, the day and time when each transaction was generated
**UnitPrice Unit price. Numeric, product price per unit in sterling
**CustomerID Customer number. Nominal, a six digit integral number uniquely assigned to each customer
**Country Country name. Nominal, the name of the country where each customer resides
df.head(5)
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 InvoiceNo 541909 non-null object
1 StockCode 541909 non-null object
2 Description 540455 non-null object
3 Quantity 541909 non-null int64
4 InvoiceDate 541909 non-null datetime64[ns]
5 UnitPrice 541909 non-null float64
6 CustomerID 406829 non-null float64
7 Country 541909 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB
Nots that: There are Missing Data in Description and The Customer ID Columns , let’s check that
Check and Clean Missing Data
df.isnull().sum()
InvoiceNo 0
StockCode 0
Description 0
Quantity 0
InvoiceDate 0
UnitPrice 0
CustomerID 0
Country 0
dtype: int64
df= df.dropna(subset=['CustomerID'])
df.isnull().sum().sum()
0
Check & Clean Duplicates Data
df.duplicated().sum()
5225
df = df.drop_duplicates()
df.duplicated().sum()
0
df.describe()
Quantity | UnitPrice | CustomerID | |
---|---|---|---|
count | 401604.000000 | 401604.000000 | 401604.000000 |
mean | 12.183273 | 3.474064 | 15281.160818 |
std | 250.283037 | 69.764035 | 1714.006089 |
min | -80995.000000 | 0.000000 | 12346.000000 |
25% | 2.000000 | 1.250000 | 13939.000000 |
50% | 5.000000 | 1.950000 | 15145.000000 |
75% | 12.000000 | 3.750000 | 16784.000000 |
max | 80995.000000 | 38970.000000 | 18287.000000 |
Note That : The min for unit price = 0 and the min for Quantity with negative value
df=df[(df['Quantity']>0) & (df['UnitPrice']>0)]
df.describe()
Quantity | UnitPrice | CustomerID | |
---|---|---|---|
count | 392692.000000 | 392692.000000 | 392692.000000 |
mean | 13.119702 | 3.125914 | 15287.843865 |
std | 180.492832 | 22.241836 | 1713.539549 |
min | 1.000000 | 0.001000 | 12346.000000 |
25% | 2.000000 | 1.250000 | 13955.000000 |
50% | 6.000000 | 1.950000 | 15150.000000 |
75% | 12.000000 | 3.750000 | 16791.000000 |
max | 80995.000000 | 8142.750000 | 18287.000000 |
df.shape
(392692, 8)
# Let's create a feature with total cost of the transactions
df['Total_cost'] = df.Quantity * df.UnitPrice
EDA
Now let’s do some Exploratory Data Analysis on the processed dataset
# Check the oldest and latest date in the dataset.
print(f'Oldest date is - {df.InvoiceDate.min()}\n')
print(f'Latest date is - {df.InvoiceDate.max()}')
Oldest date is - 2010-12-01 08:26:00
Latest date is - 2011-12-09 12:50:00
# Count of transactions in different years
df.InvoiceDate.dt.year.value_counts(sort=False).plot(kind='bar', rot=45);
# Let's visualize some top products from the whole range.
top_products = df['Description'].value_counts()[:20]
plt.figure(figsize=(10,6))
sns.set_context("paper", font_scale=1.5)
sns.barplot(y = top_products.index,
x = top_products.values)
plt.title("Top selling products")
plt.show();
#Count of transactions in different months within 2011 year.
df[df.InvoiceDate.dt.year==2011].InvoiceDate.dt.month.value_counts(sort=False).plot(kind='bar');
An increasing pattern can be observed month by month wise with a sharp decline in the month of December. That is evident because only first 8-9 days of December 2011 month is available in the dataset i.e. around 70% of the month transactions are not considered. Due to this fact, sales figure looks legitimate.
# Let's visualize the top grossing months
monthly_gross = df[df.InvoiceDate.dt.year==2011].groupby(df.InvoiceDate.dt.month).Total_cost.sum()
plt.figure(figsize=(10,5))
sns.lineplot(y=monthly_gross.values,x=monthly_gross.index, marker='o');
plt.xticks(range(1,13))
plt.show()
# Let's visualize the Unit price distribution
plt.figure(figsize=(16,4))
sns.boxplot(y='UnitPrice', data=df, orient='h');
Let’s Make Cohort Analysis
For cohort analysis, there are a few labels that we have to create:
- Invoice period: A string representation of the year and month of a single transaction/invoice.
- Cohort group: A string representation of the the year and month of a customer’s first purchase. This label is common across all invoices for a particular customer.
- Cohort period / Cohort Index: A integer representation a customer’s stage in its “lifetime”. The number represents the number of months passed since the first purchase.
def get_month(x) : return dt.datetime(x.year,x.month,1)
df['InvoiceMonth'] = df['InvoiceDate'].apply(get_month)
grouping = df.groupby('CustomerID')['InvoiceMonth']
df['CohortMonth'] = grouping.transform('min')
df.tail()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | InvoiceMonth | CohortMonth | CohortIndex | |
---|---|---|---|---|---|---|---|---|---|---|---|
541904 | 581587 | 22613 | PACK OF 20 SPACEBOY NAPKINS | 12 | 2011-12-09 12:50:00 | 0.85 | 12680.0 | France | 2011-12-01 | 2011-08-01 | 5 |
541905 | 581587 | 22899 | CHILDREN'S APRON DOLLY GIRL | 6 | 2011-12-09 12:50:00 | 2.10 | 12680.0 | France | 2011-12-01 | 2011-08-01 | 5 |
541906 | 581587 | 23254 | CHILDRENS CUTLERY DOLLY GIRL | 4 | 2011-12-09 12:50:00 | 4.15 | 12680.0 | France | 2011-12-01 | 2011-08-01 | 5 |
541907 | 581587 | 23255 | CHILDRENS CUTLERY CIRCUS PARADE | 4 | 2011-12-09 12:50:00 | 4.15 | 12680.0 | France | 2011-12-01 | 2011-08-01 | 5 |
541908 | 581587 | 22138 | BAKING SET 9 PIECE RETROSPOT | 3 | 2011-12-09 12:50:00 | 4.95 | 12680.0 | France | 2011-12-01 | 2011-08-01 | 5 |
Calculate time offset in months
Calculating time offset for each transaction allows you to report the metrics for each cohort in a comparable fashion.
First, we will create some variables that capture the integer value of years and months for Invoice and Cohort Date using the get_date_int() function
def get_month_int (dframe,column):
year = dframe[column].dt.year
month = dframe[column].dt.month
day = dframe[column].dt.day
return year, month , day
# Get the integers for date parts from the `InvoiceMonth` column
invoice_year,invoice_month,_ = get_month_int(df,'InvoiceMonth')
# Get the integers for date parts from the `CohortMonth` column
cohort_year,cohort_month,_ = get_month_int(df,'CohortMonth')
# Calculate difference in months
year_diff = invoice_year - cohort_year
month_diff = invoice_month - cohort_month
# Extract the difference in months from all previous values
df['CohortIndex'] = year_diff * 12 + month_diff + 1
Calculate retention rate
Customer retention is a very useful metric to understand how many of all the customers are still active. It gives you the percentage of active customers compared to the total number of customers
#Count monthly active customers from each cohort
grouping = df.groupby(['CohortMonth', 'CohortIndex'])
cohort_data = grouping['CustomerID'].apply(pd.Series.nunique)
# Return number of unique elements in the object.
cohort_data = cohort_data.reset_index()
cohort_counts = cohort_data.pivot(index='CohortMonth',columns='CohortIndex',values='CustomerID')
cohort_counts
CohortIndex | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CohortMonth | |||||||||||||
2010-12-01 | 948.0 | 362.0 | 317.0 | 367.0 | 341.0 | 376.0 | 360.0 | 336.0 | 336.0 | 374.0 | 354.0 | 474.0 | 260.0 |
2011-01-01 | 421.0 | 101.0 | 119.0 | 102.0 | 138.0 | 126.0 | 110.0 | 108.0 | 131.0 | 146.0 | 155.0 | 63.0 | NaN |
2011-02-01 | 380.0 | 94.0 | 73.0 | 106.0 | 102.0 | 94.0 | 97.0 | 107.0 | 98.0 | 119.0 | 35.0 | NaN | NaN |
2011-03-01 | 440.0 | 84.0 | 112.0 | 96.0 | 102.0 | 78.0 | 116.0 | 105.0 | 127.0 | 39.0 | NaN | NaN | NaN |
2011-04-01 | 299.0 | 68.0 | 66.0 | 63.0 | 62.0 | 71.0 | 69.0 | 78.0 | 25.0 | NaN | NaN | NaN | NaN |
2011-05-01 | 279.0 | 66.0 | 48.0 | 48.0 | 60.0 | 68.0 | 74.0 | 29.0 | NaN | NaN | NaN | NaN | NaN |
2011-06-01 | 235.0 | 49.0 | 44.0 | 64.0 | 58.0 | 79.0 | 24.0 | NaN | NaN | NaN | NaN | NaN | NaN |
2011-07-01 | 191.0 | 40.0 | 39.0 | 44.0 | 52.0 | 22.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2011-08-01 | 167.0 | 42.0 | 42.0 | 42.0 | 23.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2011-09-01 | 298.0 | 89.0 | 97.0 | 36.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2011-10-01 | 352.0 | 93.0 | 46.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2011-11-01 | 321.0 | 43.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2011-12-01 | 41.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
**Retention Rate Table **
# Retention table
cohort_size = cohort_counts.iloc[:,0]
retention = cohort_counts.divide(cohort_size,axis=0) #axis=0 to ensure the divide along the row axis
retention.round(3) * 100 #to show the number as percentage
CohortIndex | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CohortMonth | |||||||||||||
2010-12-01 | 100.0 | 36.6 | 32.3 | 38.4 | 36.3 | 39.8 | 36.3 | 34.9 | 35.4 | 39.5 | 37.4 | 50.3 | 26.6 |
2011-01-01 | 100.0 | 22.1 | 26.6 | 23.0 | 32.1 | 28.8 | 24.7 | 24.2 | 30.0 | 32.6 | 36.5 | 11.8 | NaN |
2011-02-01 | 100.0 | 18.7 | 18.7 | 28.4 | 27.1 | 24.7 | 25.3 | 27.9 | 24.7 | 30.5 | 6.8 | NaN | NaN |
2011-03-01 | 100.0 | 15.0 | 25.2 | 19.9 | 22.3 | 16.8 | 26.8 | 23.0 | 27.9 | 8.6 | NaN | NaN | NaN |
2011-04-01 | 100.0 | 21.3 | 20.3 | 21.0 | 19.7 | 22.7 | 21.7 | 26.0 | 7.3 | NaN | NaN | NaN | NaN |
2011-05-01 | 100.0 | 19.0 | 17.3 | 17.3 | 20.8 | 23.2 | 26.4 | 9.5 | NaN | NaN | NaN | NaN | NaN |
2011-06-01 | 100.0 | 17.4 | 15.7 | 26.4 | 23.1 | 33.5 | 9.5 | NaN | NaN | NaN | NaN | NaN | NaN |
2011-07-01 | 100.0 | 18.1 | 20.7 | 22.3 | 27.1 | 11.2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2011-08-01 | 100.0 | 20.7 | 24.9 | 24.3 | 12.4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2011-09-01 | 100.0 | 23.4 | 30.1 | 11.4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2011-10-01 | 100.0 | 24.0 | 11.5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2011-11-01 | 100.0 | 11.1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2011-12-01 | 100.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
#Build the heatmap
plt.figure(figsize=(15, 8))
plt.title('Retention rates')
sns.heatmap(data=retention,annot = True,fmt = '.0%',vmin = 0.0,vmax = 0.5,cmap="BuPu_r")
plt.show()
** Note That: Customer retention is a very useful metric to understand how many of the all customers are still active.Retention gives you the percentage of active customers compared to the total number of customers.**
**Average quantity for each cohort **
#Average quantity for each cohort
grouping = df.groupby(['CohortMonth', 'CohortIndex'])
cohort_data = grouping['Quantity'].mean()
cohort_data = cohort_data.reset_index()
average_quantity = cohort_data.pivot(index='CohortMonth',columns='CohortIndex',values='Quantity')
average_quantity.round(1)
average_quantity.index = average_quantity.index.date
#Build the heatmap
plt.figure(figsize=(15, 8))
plt.title('Average quantity for each cohort')
sns.heatmap(data=average_quantity,annot = True,vmin = 0.0,vmax =20,cmap="BuGn_r")
plt.show()
ax = Retention_rates.T.mean().plot(figsize=(11,6), marker='s')
plt.title("Retention rate (%) per CohortGroup", fontname='Ubuntu', fontsize=20, fontweight='bold')
plt.xticks(np.arange(1, 16.1, 1), fontsize=10)
plt.yticks(np.arange(0, 1.1, 0.1), fontsize=10)
ax.set_xlabel("CohortPeriod", fontsize=10)
ax.set_ylabel("Retention(%)", fontsize=10)
plt.show()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-52-f89e89d88b09> in <module>
----> 1 ax = Retention_rates.T.mean().plot(figsize=(11,6), marker='s')
2 plt.title("Retention rate (%) per CohortGroup", fontname='Ubuntu', fontsize=20, fontweight='bold')
3
4 plt.xticks(np.arange(1, 16.1, 1), fontsize=10)
5 plt.yticks(np.arange(0, 1.1, 0.1), fontsize=10)
NameError: name 'Retention_rates' is not defined
Recency, Frequency and Monetary Value calculation
**What is RFM? **
-
RFM is an acronym of recency, frequency and monetary. Recency is about when was the last order of a customer. It means the number of days since a customer made the last purchase. If it’s a case for a website or an app, this could be interpreted as the last visit day or the last login time.
-
Frequency is about the number of purchase in a given period. It could be 3 months, 6 months or 1 year. So we can understand this value as for how often or how many a customer used the product of a company. The bigger the value is, the more engaged the customers are. Could we say them as our VIP? Not necessary. Cause we also have to think about how much they actually paid for each purchase, which means monetary value.
-
Monetary is the total amount of money a customer spent in that given period. Therefore big spenders will be differentiated with other customers such as MVP or VIP.
**The RFM values can be grouped in several ways: **
**1.Percentiles e.g. quantiles **
2.Pareto 80/20 cut
3.Custom - based on business knowledge
We are going to implement percentile-based grouping.
Process of calculating percentiles:
- Sort customers based on that metric
- Break customers into a pre-defined number of groups of equal size
- Assign a label to each group
#New Total Sum Column
df['TotalSum'] = df['UnitPrice']* df['Quantity']
#Data preparation steps
print('Min Invoice Date:',df.InvoiceDate.dt.date.min(),'max Invoice Date:',
df.InvoiceDate.dt.date.max())
df.head(3)
Min Invoice Date: 2010-12-01 max Invoice Date: 2011-12-09
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | InvoiceMonth | CohortMonth | CohortIndex | Total_cost | TotalSum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 2010-12-01 | 2010-12-01 | 1 | 15.30 | 15.30 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 2010-12-01 | 2010-12-01 | 1 | 20.34 | 20.34 |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 2010-12-01 | 2010-12-01 | 1 | 22.00 | 22.00 |
In the real world, we would be working with the most recent snapshot of the data of today or yesterday
snapshot_date = df['InvoiceDate'].max() + dt.timedelta(days=1)
snapshot_date
#The last day of purchase in total is 09 DEC, 2011. To calculate the day periods,
#let's set one day after the last one,or
#10 DEC as a snapshot_date. We will cound the diff days with snapshot_date.
Timestamp('2011-12-10 12:50:00')
# Calculate RFM metrics
rfm = df.groupby(['CustomerID']).agg({'InvoiceDate': lambda x : (snapshot_date - x.max()).days,
'InvoiceNo':'count','TotalSum': 'sum'})
#Function Lambdea: it gives the number of days between hypothetical today and the last transaction
#Rename columns
rfm.rename(columns={'InvoiceDate':'Recency','InvoiceNo':'Frequency','TotalSum':'MonetaryValue'}
,inplace= True)
#Final RFM values
rfm.head()
Recency | Frequency | MonetaryValue | |
---|---|---|---|
CustomerID | |||
12346.0 | 326 | 1 | 77183.60 |
12347.0 | 2 | 182 | 4310.00 |
12348.0 | 75 | 31 | 1797.24 |
12349.0 | 19 | 73 | 1757.55 |
12350.0 | 310 | 17 | 334.40 |
**Note That : **
**#We will rate “Recency” customer who have been active more recently better than the less recent customer,because each company wants its customers to be recent **
**#We will rate “Frequency” and “Monetary Value” higher label because we want Customer to spend more money and visit more often(that is different order than recency). **
#Building RFM segments
r_labels =range(4,0,-1)
f_labels=range(1,5)
m_labels=range(1,5)
r_quartiles = pd.qcut(rfm['Recency'], q=4, labels = r_labels)
f_quartiles = pd.qcut(rfm['Frequency'],q=4, labels = f_labels)
m_quartiles = pd.qcut(rfm['MonetaryValue'],q=4,labels = m_labels)
rfm = rfm.assign(R=r_quartiles,F=f_quartiles,M=m_quartiles)
# Build RFM Segment and RFM Score
def add_rfm(x) : return str(x['R']) + str(x['F']) + str(x['M'])
rfm['RFM_Segment'] = rfm.apply(add_rfm,axis=1 )
rfm['RFM_Score'] = rfm[['R','F','M']].sum(axis=1)
rfm.head()
Recency | Frequency | MonetaryValue | R | F | M | RFM_Segment | RFM_Score | |
---|---|---|---|---|---|---|---|---|
CustomerID | ||||||||
12346.0 | 326 | 1 | 77183.60 | 1 | 1 | 4 | 114 | 6.0 |
12347.0 | 2 | 182 | 4310.00 | 4 | 4 | 4 | 444 | 12.0 |
12348.0 | 75 | 31 | 1797.24 | 2 | 2 | 4 | 224 | 8.0 |
12349.0 | 19 | 73 | 1757.55 | 3 | 3 | 4 | 334 | 10.0 |
12350.0 | 310 | 17 | 334.40 | 1 | 1 | 2 | 112 | 4.0 |
**The Result is a Table which has a row for each customer with their RFM **
Analyzing RFM Segments
Largest RFM segments ** **It is always the best practice to investigate the size of the segments before you use them for targeting or other business Application.
rfm.groupby(['RFM_Segment']).size().sort_values(ascending=False)[:5]
RFM_Segment
444 450
111 381
344 217
122 206
211 179
dtype: int64
**Filtering on RFM segments **
#Select bottom RFM segment "111" and view top 5 rows
rfm[rfm['RFM_Segment']=='111'].head()
Recency | Frequency | MonetaryValue | R | F | M | RFM_Segment | RFM_Score | |
---|---|---|---|---|---|---|---|---|
CustomerID | ||||||||
12353.0 | 204 | 4 | 89.00 | 1 | 1 | 1 | 111 | 3.0 |
12361.0 | 287 | 10 | 189.90 | 1 | 1 | 1 | 111 | 3.0 |
12401.0 | 303 | 5 | 84.30 | 1 | 1 | 1 | 111 | 3.0 |
12402.0 | 323 | 11 | 225.60 | 1 | 1 | 1 | 111 | 3.0 |
12441.0 | 367 | 11 | 173.55 | 1 | 1 | 1 | 111 | 3.0 |
**Summary metrics per RFM Score **
rfm.groupby('RFM_Score').agg({'Recency': 'mean','Frequency': 'mean',
'MonetaryValue': ['mean', 'count'] }).round(1)
Recency | Frequency | MonetaryValue | ||
---|---|---|---|---|
mean | mean | mean | count | |
RFM_Score | ||||
3.0 | 260.7 | 8.2 | 157.4 | 381 |
4.0 | 177.2 | 13.6 | 240.0 | 388 |
5.0 | 152.9 | 21.2 | 366.6 | 518 |
6.0 | 95.9 | 27.9 | 820.8 | 457 |
7.0 | 79.6 | 38.0 | 758.1 | 463 |
8.0 | 64.1 | 56.0 | 987.3 | 454 |
9.0 | 45.9 | 78.7 | 1795.1 | 414 |
10.0 | 32.4 | 110.5 | 2056.4 | 426 |
11.0 | 21.3 | 186.9 | 4062.0 | 387 |
12.0 | 7.2 | 367.8 | 9285.9 | 450 |
Use RFM score to group customers into Gold, Silver and Bronze segments:
def segments(df):
if df['RFM_Score'] > 9 :
return 'Gold'
elif (df['RFM_Score'] > 5) and (df['RFM_Score'] <= 9 ):
return 'Sliver'
else:
return 'Bronze'
rfm['General_Segment'] = rfm.apply(segments,axis=1)
rfm.groupby('General_Segment').agg({'Recency':'mean','Frequency':'mean',
'MonetaryValue':['mean','count']}).round(1)
Recency | Frequency | MonetaryValue | ||
---|---|---|---|---|
mean | mean | mean | count | |
General_Segment | ||||
Bronze | 192.2 | 15.1 | 266.5 | 1287 |
Gold | 20.1 | 225.6 | 5246.8 | 1263 |
Sliver | 72.0 | 49.4 | 1072.4 | 1788 |
Data Pre-Processing for Kmeans Clustering
K-Means clustering is one type of unsupervised learning algorithms, which makes groups based on the distance between the points. How? There are two concepts of distance in K-Means clustering. Within Cluster Sums of Squares (WSS) and Between Cluster Sums of Squares (BSS).
We must check these Key k-means assumptions before we implement our Kmeans Clustering Mode
- Symmetric distribution of variables (not skewed)
- Variables with same average values
- Variables with same variance
rfm_rfm = rfm[['Recency','Frequency','MonetaryValue']]
print(rfm_rfm.describe())
Recency Frequency MonetaryValue
count 4338.000000 4338.000000 4338.000000
mean 92.536422 90.523744 2048.688081
std 100.014169 225.506968 8985.230220
min 1.000000 1.000000 3.750000
25% 18.000000 17.000000 306.482500
50% 51.000000 41.000000 668.570000
75% 142.000000 98.000000 1660.597500
max 374.000000 7676.000000 280206.020000
From this table, we find this Problem: Mean and Variance are not Equal
Soluation: Scaling variables by using a scaler from scikit-learn library
# plot the distribution of RFM values
f,ax = plt.subplots(figsize=(10, 12))
plt.subplot(3, 1, 1); sns.distplot(rfm.Recency, label = 'Recency')
plt.subplot(3, 1, 2); sns.distplot(rfm.Frequency, label = 'Frequency')
plt.subplot(3, 1, 3); sns.distplot(rfm.MonetaryValue, label = 'Monetary Value')
plt.style.use('fivethirtyeight')
plt.tight_layout()
plt.show()
/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Also, there is another Problem: UnSymmetric distribution of variables (data skewed)
Soluation:Logarithmic transformation (positive values only) will manage skewness
**We use these Sequence of structuring pre-processing steps: ** **1. Unskew the data - log transformation **
**2. Standardize to the same average values **
**3. Scale to the same standard deviation **
4. Store as a separate array to be used for clustering ___________
Why the sequence matters?
- Log transformation only works with positive data
- Normalization forces data to have negative values and log will not work
#Unskew the data with log transformation
rfm_log = rfm[['Recency', 'Frequency', 'MonetaryValue']].apply(np.log, axis = 1).round(3)
#or rfm_log = np.log(rfm_rfm)
# plot the distribution of RFM values
f,ax = plt.subplots(figsize=(10, 12))
plt.subplot(3, 1, 1); sns.distplot(rfm_log.Recency, label = 'Recency')
plt.subplot(3, 1, 2); sns.distplot(rfm_log.Frequency, label = 'Frequency')
plt.subplot(3, 1, 3); sns.distplot(rfm_log.MonetaryValue, label = 'Monetary Value')
plt.style.use('fivethirtyeight')
plt.tight_layout()
plt.show()
/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Implementation of K-Means Clustering
Key steps
- Data pre-processing
- Choosing a number of clusters
- Running k-means clustering on pre-processed data
- Analyzing average RFM values of each cluster
** 1. Data Pre-Processing**
#Normalize the variables with StandardScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(rfm_log)
#Store it separately for clustering
rfm_normalized= scaler.transform(rfm_log)
2. Choosing a Number of Clusters
Methods to define the number of clusters
- Visual methods - elbow criterion
- Mathematical methods - silhouette coefficient
- Experimentation and interpretation
**Elbow criterion method **
- Plot the number of clusters against within-cluster sum-of-squared-errors (SSE) - sum of squared distances from every data point to their cluster center
- Identify an “elbow” in the plot
- Elbow - a point representing an “optimal” number of clusters
from sklearn.cluster import KMeans
#First : Get the Best KMeans
ks = range(1,8)
inertias=[]
for k in ks :
# Create a KMeans clusters
kc = KMeans(n_clusters=k,random_state=1)
kc.fit(rfm_normalized)
inertias.append(kc.inertia_)
# Plot ks vs inertias
f, ax = plt.subplots(figsize=(15, 8))
plt.plot(ks, inertias, '-o')
plt.xlabel('Number of clusters, k')
plt.ylabel('Inertia')
plt.xticks(ks)
plt.style.use('ggplot')
plt.title('What is the Best Number for KMeans ?')
plt.show()
**Note Theat: We Choose No.KMeans = 3 **
# clustering
kc = KMeans(n_clusters= 3, random_state=1)
kc.fit(rfm_normalized)
#Create a cluster label column in the original DataFrame
cluster_labels = kc.labels_
#Calculate average RFM values and size for each cluster:
rfm_rfm_k3 = rfm_rfm.assign(K_Cluster = cluster_labels)
#Calculate average RFM values and sizes for each cluster:
rfm_rfm_k3.groupby('K_Cluster').agg({'Recency': 'mean','Frequency': 'mean',
'MonetaryValue': ['mean', 'count'],}).round(0)
Recency | Frequency | MonetaryValue | ||
---|---|---|---|---|
mean | mean | mean | count | |
K_Cluster | ||||
0 | 171.0 | 15.0 | 293.0 | 1527 |
1 | 13.0 | 260.0 | 6574.0 | 953 |
2 | 69.0 | 65.0 | 1170.0 | 1858 |
Snake plots to understand and compare segments
- Market research technique to compare different segments
- Visual representation of each segment’s attributes
- Need to first normalize data (center & scale)
- Plot each cluster’s average normalized values of each attribute
rfm_normalized = pd.DataFrame(rfm_normalized,index=rfm_rfm.index,columns=rfm_rfm.columns)
rfm_normalized['K_Cluster'] = kc.labels_
rfm_normalized['General_Segment'] = rfm['General_Segment']
rfm_normalized.reset_index(inplace = True)
#Melt the data into a long format so RFM values and metric names are stored in 1 column each
rfm_melt = pd.melt(rfm_normalized,id_vars=['CustomerID','General_Segment','K_Cluster'],value_vars=['Recency', 'Frequency', 'MonetaryValue'],
var_name='Metric',value_name='Value')
rfm_melt.head()
CustomerID | General_Segment | K_Cluster | Metric | Value | |
---|---|---|---|---|---|
0 | 12346.0 | Sliver | 2 | Recency | 1.409982 |
1 | 12347.0 | Gold | 1 | Recency | -2.146578 |
2 | 12348.0 | Sliver | 2 | Recency | 0.383648 |
3 | 12349.0 | Gold | 2 | Recency | -0.574961 |
4 | 12350.0 | Bronze | 0 | Recency | 1.375072 |
Snake Plot and Heatmap
f, (ax1, ax2) = plt.subplots(1,2, figsize=(15, 8))
sns.lineplot(x = 'Metric', y = 'Value', hue = 'General_Segment', data = rfm_melt,ax=ax1)
# a snake plot with K-Means
sns.lineplot(x = 'Metric', y = 'Value', hue = 'K_Cluster', data = rfm_melt,ax=ax2)
plt.suptitle("Snake Plot of RFM",fontsize=24) #make title fontsize subtitle
plt.show()
/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
**Relative importance of segment attributes **
- Useful technique to identify relative importance of each segment’s attribute
- Calculate average values of each cluster
- Calculate average values of population
- Calculate importance score by dividing them and subtracting 1 (ensures 0 is returned when cluster average equals population average)
**Let’s try again with a heat map. Heat maps are a graphical representation of data where larger values were colored in darker scales and smaller values in lighter scales. We can compare the variance between the groups quite intuitively by colors. **
# The further a ratio is from 0, the more important that attribute is for a segment relative to the total population
cluster_avg = rfm_rfm_k3.groupby(['K_Cluster']).mean()
population_avg = rfm_rfm.mean()
relative_imp = cluster_avg / population_avg - 1
relative_imp.round(2)
Recency | Frequency | MonetaryValue | |
---|---|---|---|
K_Cluster | |||
0 | 0.85 | -0.84 | -0.86 |
1 | -0.86 | 1.88 | 2.21 |
2 | -0.25 | -0.28 | -0.43 |
# the mean value in total
total_avg = rfm.iloc[:, 0:3].mean()
# calculate the proportional gap with total mean
cluster_avg = rfm.groupby('General_Segment').mean().iloc[:, 0:3]
prop_rfm = cluster_avg/total_avg - 1
prop_rfm.round(2)
Recency | Frequency | MonetaryValue | |
---|---|---|---|
General_Segment | |||
Bronze | 1.08 | -0.83 | -0.87 |
Gold | -0.78 | 1.49 | 1.56 |
Sliver | -0.22 | -0.45 | -0.48 |
# heatmap with RFM
f, (ax1, ax2) = plt.subplots(1,2, figsize=(15, 5))
sns.heatmap(data=relative_imp, annot=True, fmt='.2f', cmap='Blues',ax=ax1)
ax1.set(title = "Heatmap of K-Means")
# a snake plot with K-Means
sns.heatmap(prop_rfm, cmap= 'Oranges', fmt= '.2f', annot = True,ax=ax2)
ax2.set(title = "Heatmap of RFM quantile")
plt.suptitle("Heat Map of RFM",fontsize=20) #make title fontsize subtitle
plt.show()
You can Updated RFM data by adding Tenure variable : ** ** -Tenure: time since the first transaction ، Defines how long the customer has been with the company
**Conclusion: We talked about how to get RFM values from customer purchase data, and we made two kinds of segmentation with RFM quantiles and K-Means clustering methods. **