Analyze Cricket Data with Python: a practical guide

Share on facebook
Share on twitter
Share on linkedin
Share on telegram
Share on whatsapp


This article was published as part of the Data Science Blogathon


Python is a versatile language. Used for general programming and development purposes, and also for complex tasks like machine learning, data science and data analytics. Not only is it easy to learn, it also has some wonderful libraries, which makes it the first choice programming language for many people.

In this article, we will see one of those Python use cases. We will use Python to analyze the performance of the Indian cricketer MS Dhoni in your One Day International career (ODI).


Data set

If you are familiar with the concept of web scraping, you can extract the data from this ESPN Cricinfo link. If you don't know about web scraping, do not worry! You can download the data directly from here. The data is available as an Excel file for download.

Once you have the dataset with you, you will need to load it in Python. You can use the below code snippet to load the dataset in Python:

# importing essential libraries and packages
import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
# reading the dataset
df = pd.read_excel('MS_Dhoni_ODI_record.xlsx')

Once the dataset has been read, we need to look at the beginning and end of the dataset to make sure it is imported correctly. The dataset header should look like this:


If the data loads correctly, we can go to the next step, data cleaning and preparation.

Data cleaning and preparation

This data has been extracted from a web page, so they are not very clean. We will start by removing the first 2 characters from the opposition string because that's not necessary.

# removing the first 2 characters in the opposition string
df['opposition'] = df['opposition'].apply(lambda x: x[2:])

Then, we will create a column for the year the game was played. Make sure the date column is present in the DateTime format in your DataFrame. On the contrary, utilice pd.to_datetime () to convert it to DateTime format.

# creating a feature for match year
df['year'] = df['date'].dt.year.astype(int)

We will also create a column that indicates if Dhoni was not in that entry or not.

# creating a feature for being not out
df['score'] = df['score'].apply(str)
df['not_out'] = np.where(df['score'].str.endswith('*'), 1, 0)

Now we will remove the column of the ODI number because it is not necessary.

# dropping the odi_number feature because it adds no value to the analysis
df.drop(columns="odi_number", inplace=True)

We will also remove all matches from our records where Dhoni did not hit and store this information in a new DataFrame.

# dropping those innings where Dhoni did not bat and storing in a new DataFrame
df_new = df.loc[((df['score'] != 'DNB') & (df['score'] != 'TDNB')), 'runs_scored':]

Finally, we will fix the data types of all columns present in our new DataFrame.

# fixing the data types of numerical columns
df_new['runs_scored'] = df_new['runs_scored'].astype(int)
df_new['balls_faced'] = df_new['balls_faced'].astype(int)
df_new['strike_rate'] = df_new['strike_rate'].astype(float)
df_new['fours'] = df_new['fours'].astype(int)
df_new['sixes'] = df_new['sixes'].astype(int)

Race statistics

We will take a look at the descriptive statistics of MS Dhoni's ODI career. You can use the following code for this:

first_match_date = df['date']'%B %d, %AND') # first match
print('First match:', first_match_date)
last_match_date = df['date']'%B %d, %AND') # last match
print('nLast match:', last_match_date)
number_of_matches = df.shape[0] # number of mathces played in career
print('nNumber of matches played:', number_of_matches)
number_of_inns = df_new.shape[0] # number of innings
print('nNumber of innings played:', number_of_inns)
not_outs = df_new['not_out'].sum() # number of not outs in career
print('nNot outs:', not_outs)
runs_scored = df_new['runs_scored'].sum() # runs scored in career
print('nRuns scored in career:', runs_scored)
balls_faced = df_new['balls_faced'].sum() # balls faced in career
print('nBalls faced in career:', balls_faced)
career_sr = (runs_scored / balls_faced)*100 # career strike rate
print('nCareer strike rate: {:.2f}'.format(career_sr))
career_avg = (runs_scored / (number_of_inns - not_outs)) # career average
print('nCareer average: {:.2f}'.format(career_avg))
highest_score_date = df_new.loc[df_new.runs_scored == df_new.runs_scored.max(), 'date'].values[0]
highest_score = df.loc[ == highest_score_date, 'score'].values[0] # highest score
print('nHighest score in career:', highest_score)
hundreds = df_new.loc[df_new['runs_scored'] >= 100].shape[0] # number of 100s
print('nNumber of 100s:', hundreds)
fifties = df_new.loc[(df_new['runs_scored']>=50)&(df_new['runs_scored']<100)].shape[0] #number of 50s
print('nNumber of 50s:', fifties)
fours = df_new['fours'].sum() # number of fours in career
print('nNumber of 4s:', fours)
sixes = df_new['sixes'].sum() # number of sixes in career
print('nNumber of 6s:', sixes)

The output should look like this:


This gives us a good idea of ​​the overall career of MS Dhoni. Started playing in 2004, and last played an ODI in 2019. In a career of more than 15 years, has scored 10 hundred and a staggering amount of 73 fifty. Has scored more than 10,000 careers in his career with an average of 50.6 and a strike rate of 87.6. Your highest score is 183 *.

Now we will do a more exhaustive analysis of their performance against different teams. We will also see their performance year after year. We will take the help of visualizations for this.


First, we will see how many games you have played against different oppositions. You can use the following code for this purpose:

# number of matches played against different oppositions
df['opposition'].value_counts().plot(kind='bar', title="Number of matches against different oppositions", figsize=(8, 5));

The output should look like this:


We can see that he has played most of his matches against Sri Lanka, Australia, England, West Indies, South Africa and Pakistan.

Let's see how many careers he has scored against different oppositions. You can use the following code snippet to generate the result:

runs_scored_by_opposition = pd.DataFrame(df_new.groupby('opposition')['runs_scored'].sum())
runs_scored_by_opposition.plot(kind='bar', title="Runs scored against different oppositions", figsize=(8, 5))

The output will look like this:


We can see that Dhoni has scored the most runs against Sri Lanka, followed by Australia, England and Pakistan. He has also played many games against these teams, so it makes sense.

To have a clearer picture, let's take a look at your batting average against each team. The following code snippet will help us to obtain the desired result:

innings_by_opposition = pd.DataFrame(df_new.groupby('opposition')['date'].count())
not_outs_by_opposition = pd.DataFrame(df_new.groupby('opposition')['not_out'].sum())
temp = runs_scored_by_opposition.merge(innings_by_opposition, left_index=True, right_index=True)
average_by_opposition = temp.merge(not_outs_by_opposition, left_index=True, right_index=True)
average_by_opposition.rename(columns = {'date': 'innings'}, inplace=True)
average_by_opposition['eff_num_of_inns'] = average_by_opposition['innings'] - average_by_opposition['not_out']
average_by_opposition['average'] = average_by_opposition['runs_scored'] / average_by_opposition['eff_num_of_inns']
average_by_opposition.replace(e.g. inf, np.nan, inplace=True)
major_nations = ['Australia', 'England', 'New Zealand', 'Pakistan', 'South Africa', 'Sri Lanka', 'West Indies']

To generate the graph, use the code snippet below:

plt.figure(figsize = (8, 5))
plt.plot(average_by_opposition.loc[major_nations, 'average'].values, marker="O")
plt.plot([career_avg]*len(major_nations), '--')
plt.title('Average against major teams')
plt.xticks(range(0, 7), major_nations)
plt.ylim(20, 70)
plt.legend(['Avg against opposition', 'Career average']);

The output will look like this:


As we can see, Dhoni has performed remarkably against tough teams like Australia, England and Sri Lanka. His average against these teams is close to his career average or slightly higher.. The only team he hasn't performed well against is South Africa.

Let's now see their year-on-year statistics. We will start by looking how many games have you played each year after its debut. The code for that will be:

df['year'].value_counts().sort_index().plot(kind='bar', title="Matches played by year", figsize=(8, 5))

The plot will look like this:


We can see that in 2012, 2014 Y 2016, Dhoni played very few ODI matches for India. In general, after 2005-2009, the average number of matches played decreased slightly.

We should also look at how many careers that he has marked every year. The code for that will be:

df_new.groupby('year')['runs_scored'].sum().plot(kind='line', marker="O", title="Runs scored by year", figsize=(8, 5))
years = df['year'].unique().tolist()

The output should look like this:


You can clearly see that Dhoni scored the most runs of the year 2009, followed by 2007 Y 2008. The number of runs began to decline after 2010 (because the number of games played also started to decrease).

Finally, let's see his Average career batting progression per inning. These are time series data and have been plotted in a line diagram. The code for that will be:

df_new.reset_index(drop=True, inplace=True)
career_average = pd.DataFrame()
career_average['runs_scored_in_career'] = df_new['runs_scored'].cumsum()
career_average['innings'] = df_new.index.tolist()
career_average['innings'] = career_average['innings'].apply(lambda x: x+1)
career_average['not_outs_in_career'] = df_new['not_out'].cumsum()
career_average['eff_num_of_inns'] = career_average['innings'] - career_average['not_outs_in_career']
career_average['average'] = career_average['runs_scored_in_career'] / career_average['eff_num_of_inns']

The code snippet for the plot will be:

plt.figure(figsize = (8, 5))
plt.plot([career_avg]*career_average.shape[0], '--')
plt.title('Career average progression by innings')
plt.xlabel('Number of innings')
plt.legend(['Avg progression', 'Career average']);

The output graph will look like this:


We can see that after a slow start and a drop in performance on input number 50, Dhoni's performance recovered substantially. Towards the end of his career, consistently averaged above 50.


In this article, we analyze the batting performance of Indian cricketer MS Dhoni. We look at the general statistics of your career, your performance against different opponents and your performance year after year.

This article has been written by Vishesh Arora. You can connect with me at LinkedIn.

The media shown in this article is not the property of DataPeaker and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.