Cohort Analysis for Data Science


This article was published as part of the Data Science Blogathon

After understanding and working with this practical tutorial, may:

  • Understand what cohort and cohort analysis is.
  • Handling missing values
  • Extraction of the month from the date
  • Assign cohort to each transaction
  • Assign cohort index to each transaction
  • Calculate the number of unique customers in each group.
  • Create a cohort table for the retention rate
  • Visualize the cohort table using the heat map
  • Interpret the retention rate

What is cohort and cohort analysis?

A cohort is a collection of users who have something in common. A traditional cohort, for instance, divide people by the week or month they were first acquired. When referring to non-time dependent groupings, the term segment is often used instead of cohort.

Cohort analysis is a descriptive analytical technique in cohort analysis. Clients are divided into mutually exclusive cohorts, which are then tracked over time. Vanity indicators do not offer the same level of perspective as cohort research. Helps in deeper interpretation of high-level patterns by providing consumer and product lifecycle metrics.

Generally, there are three main types of cohorts:

  • Time cohorts: customers who signed up for a product or service during a particular period of time.
  • Behavioral cohorts: customers who have purchased a product or subscribed to a service in the past.
  • Size cohorts: refer to the different sizes of customers who buy the company's products or services.

But nevertheless, we will be doing Time-based cohort analysis. Customers will be divided into acquisition cohorts based on the month of their first purchase. Later, the cohort index would be assigned to each of the customer's purchases, which will represent the number of months since the first transaction.


  1. Find the percentage of active clients compared to the total number of clients after each month: Customer Segmentations
  2. Interpret the retention rate

Here is the complete code for this tutorial. if you want to follow the information as you go through the tutorial.

Step involved in the analysis of the cohort retention rate

1. Data loading and cleaning

2. Assign the cohort and calculate the

Paso 2.1

  • Truncate the data object to a required one (here we need the month, so the date of the transaction)
  • Create groupby object with target column (here, customer_id)
  • Transform with a min function () to assign the smallest transaction date in the month value to each customer.

The result of this process is the cohort of the acquisition month for each client, namely, we have assigned the cohort of the acquisition month to each client.

Paso 2.2

  • Calculate time compensation by extracting integer values ​​for the year, month and day of a datetime object ().
  • Calculate the number of months between any transaction and the first transaction for each customer. We will use the TransactionMonth and CohortMonth values ​​to do this.

The result of this will be cohortIndex, namely, the difference between “TransactionMonth” Y “CohortMonth” in terms of the number of months and call the column “cohortIndex”.

Paso 2.3

  • Create a groupby object with CohortMonth and CohortIndex.
  • Count the number of customers in each group applying the pandas nunique function ().
  • Reset index and create pandas pivot with CohortMonth on rows, CohortIndex in columns and customer_id counts as values.

The result of this will be the table that will serve as the basis for calculating the retention rate and also other matrices.

3. Calculate business matrices: Retention rate.

Retention measures how many customers from each cohort have returned in the following months.

  • Using the data framework called cohort_counts, we will select the first columns (equal to the total number of customers in cohorts)
  • Calculate the proportion of how many of these customers returned in the following months.

The result gives a retention rate.

4. Viewing the retention rate

5. Interpretation of the retention rate


Monthly cohort retention rate.

Let's start:

Import libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt
import missingno as msno
from textwrap import wrap

Data loading and cleaning

# Loading dataset
transaction_df = pd.read_excel('transcations.xlsx')
# View data

Comprobando y trabajando con valor faltante

# Inspect missing values in the dataset
# Replace the ' 's with NaN
transaction_df = transaction_df.replace(" ",np. NaN)
# Impute the missing values with mean imputation
transaction_df = transaction_df.fillna(transaction_df.mean())
# Count the number of NaNs in the dataset to verify
for col in transaction_df.columns:
    # Check if the column is of object type
    if transaction_df[col].dtypes == 'object':
        # Impute with the most frequent value
        transaction_df[col] = transaction_df[col].fillna(transaction_df[col].value_counts().index[0])
# Count the number of NaNs in the dataset and print the counts to verify

Here, podemos ver que tenemos 1542 valores nulos. Que tratamos con valores medios y más frecuentes según el tipo de datos. Ahora que hemos completado nuestra limpieza y comprensión de datos, comenzaremos el análisis de cohorte.

Asignó las cohortes y calculó la compensación mensual.

# A function that will parse the date Time based cohort:  1 day of month
def get_month(x): return dt.datetime(x.year, x.month, 1) 
# Create transaction_date column based on month and store in TransactionMonth
transaction_df['TransactionMonth'] = transaction_df['transaction_date'].apply(get_month) 
# Grouping by customer_id and select the InvoiceMonth value
grouping = transaction_df.groupby('customer_id')['TransactionMonth'] 
# Assigning a minimum InvoiceMonth value to the dataset
transaction_df['CohortMonth'] = grouping.transform('min')
# printing top 5 rows


Cálculo de la compensación de tiempo en el mes como índice de cohorte

Calculating the time compensation for each transaction allows you to evaluate the metrics for each cohort in a comparable way.

First, we will create 6 variables that capture the integer value of years, months and days for Transaction and Cohort Date using the get_date_int function ().

def get_date_int(df, column):
    year = df[column].dt.year
    month = df[column].dt.month
    day = df[column]
    return year, month, day
# Getting the integers for date parts from the `InvoiceDay` column
transcation_year, transaction_month, _ = get_date_int(transaction_df, 'TransactionMonth')
# Getting the integers for date parts from the `CohortDay` column
cohort_year, cohort_month, _ = get_date_int(transaction_df, 'CohortMonth')

We will now calculate the difference between invoice dates and cohort dates in years, months separately. then calculate the total difference of months between the two. This will be the cohort or compensation index of our month, which we will use in the next section to calculate the retention rate.

#  Get the  difference in years
years_diff = transcation_year - cohort_year
# Calculate difference in months
months_diff = transaction_month - cohort_month
""" Extract the difference in months from all previous values
 "+1" in addeded at the end so that first month is marked as 1 instead of 0 for easier interpretation. 
transaction_df['CohortIndex'] = years_diff * 12 + months_diff  + 1 

Here, at first, we created a group() object with CohortMonth and CohortIndex and save it as a grouping.

Later, we call this object, we select the Customer identification column and calculate the average.

We then store the results as cohort_data. Later, reset the index before calling the pivot function to be able to access the columns now stored as indexes.

Finally, we create a PivotTable by skipping

  • CohortMes to the index parameter,
  • Cohort index to the column parameter,
  • Customer identification to the values parameter.

and round it to 1 digit and see what we get.

# Counting daily active user from each chort
grouping = transaction_df.groupby(['CohortMonth', 'CohortIndex'])
# Counting number of unique customer Id's falling in each group of CohortMonth and CohortIndex
cohort_data = grouping['customer_id'].apply(Pd. Series.nunique)
cohort_data = cohort_data.reset_index()
 # Assigning column names to the dataframe created above
cohort_counts = cohort_data.pivot(index='CohortMonth',
# Printing top 5 rows of Dataframe


Calculate business metrics: retention rate

The percentage of active customers compared to the total number of customers after a specified time interval is called the retention rate..

In this section, we will calculate the retention count for each cohort month matched with the cohort index

Now that we have a count of the retained customers for each cohorteMes Y cohortIndex. We will calculate the retention rate for each cohort.

We will create a pivot table for this purpose.

cohort_sizes = cohort_counts.iloc[:,0]
retention = cohort_counts.divide(cohort_sizes, axis=0)
# Coverting the retention rate into percentage and Rounding off.

Retention Rate Data Frame Represents Retained Customer Across All Cohorts. We can read it as follows:

  • The index value represents the cohort
  • The columns represent the number of months since the current cohort

For instance: The value in CohortMonth 2017-01-01, CohortIndex 3 it is 35,9 and represents 35,9% of clients in the cohort 2017-01 were held in the 3is mes.

What's more, you can see in the DataFrame for retention rate:

  • Retention rate The first index, namely, the first month is from 100%, since all clients of that particular client signed up in the first month
  • The retention rate can increase or decrease in subsequent indices.
  • Values ​​towards the lower right have many NaN values.

Visualizing the retention rate

Before we start to draw our heat map, let's set the index of our retention rate data frame to a more readable string format.

average_standard_cost.index = average_standard_cost.index.strftime('%Y-%m')
# Initialize the figure
plt.figure(figsize=(16, 10))
# Adding a title
plt.title('Average Standard Cost: Monthly Cohorts', fontsize = 14)
# Creating the heatmap
sns.heatmap(average_standard_cost, annot = True,vmin = 0.0, vmax =20,cmap="YlGnBu", fmt="g")
plt.ylabel('Cohort Month')
plt.xlabel('Cohort Index')
plt.yticks( rotation='360')

Interpreting the retention rate

The most effective way to visualize and analyze cohort analysis data is through a heat map., as we did previously. Provides both actual metric values and color coding to see differences in numbers visually.

If you don't have a basic knowledge about heatmap, you can check my blog. Exploratory data analysis for beginners with Python, where I have talked about heat maps for beginners.

Here, have 12 cohorts for each month and 12 cohort indices. The darker the shades of blue, the higher the values. A) Yes, if we see in the month of the cohort 2017-07 in the 5th cohort index, we see the dark blue tone with a 48% which means that the 48% of the cohorts that signed in July 2017 they were active 5 months later.

This concludes our cohort analysis for the retention rate.. Similarly, we can perform cohort analysis for other commercial matrices.

Click here to learn more about cohort analysis for companies free with DataCamp.(Affiliate link)

Therefore, we completed our Cohort Analysis, where you learned about basic and cohort analyzes, conducting time cohorts, working with pandas pivot and creating a hold table along with visualization. We also learned how to explore other matrices.

Now, you can start creating and exploring the metrics that are important to your business on your own.

The media shown in this article is not the property of DataPeaker and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.