This article was published as part of the Data Science Blogathon
Introduction
Python is a versatile language. Used for general programming and development purposes, and also for complex tasks like machine learning, data science and data analytics. Not only is it easy to learn, it also has some wonderful libraries, which makes it the first choice programming language for many people.
In this article, we will see one of those Python use cases. We will use Python to analyze the performance of the Indian cricketer MS Dhoni in your One Day International career (ODI).
Data set
If you are familiar with the concept of web scraping, you can extract the data from this ESPN Cricinfo link. If you don't know about web scraping, do not worry! You can download the data directly from here. The data is available as an Excel file for download.
Once you have the dataset with you, you will need to load it in Python. You can use the below code snippet to load the dataset in Python:
# importing essential libraries and packages import pandas as pd import numpy as np import datetime import matplotlib.pyplot as plt import seaborn as sns # reading the dataset df = pd.read_excel('MS_Dhoni_ODI_record.xlsx')
Once the dataset has been read, we need to look at the beginning and end of the dataset to make sure it is imported correctly. The dataset header should look like this:
If the data loads correctly, we can go to the next step, data cleaning and preparation.
Data cleaning and preparation
This data has been extracted from a web page, so they are not very clean. We will start by removing the first 2 characters from the opposition string because that's not necessary.
# removing the first 2 characters in the opposition string df['opposition'] = df['opposition'].apply(lambda x: x[2:])
Then, we will create a column for the year the game was played. Make sure the date column is present in the DateTime format in your DataFrame. On the contrary, utilice pd.to_datetime () to convert it to DateTime format.
# creating a feature for match year df['year'] = df['date'].dt.year.astype(int)
We will also create a column that indicates if Dhoni was not in that entry or not.
# creating a feature for being not out df['score'] = df['score'].apply(str) df['not_out'] = np.where(df['score'].str.endswith('*'), 1, 0)
Now we will remove the column of the ODI number because it is not necessary.
# dropping the odi_number feature because it adds no value to the analysis df.drop(columns="odi_number", inplace=True)
We will also remove all matches from our records where Dhoni did not hit and store this information in a new DataFrame.
# dropping those innings where Dhoni did not bat and storing in a new DataFrame df_new = df.loc[((df['score'] != 'DNB') & (df['score'] != 'TDNB')), 'runs_scored':]
Finally, we will fix the data types of all columns present in our new DataFrame.
# fixing the data types of numerical columns df_new['runs_scored'] = df_new['runs_scored'].astype(int) df_new['balls_faced'] = df_new['balls_faced'].astype(int) df_new['strike_rate'] = df_new['strike_rate'].astype(float) df_new['fours'] = df_new['fours'].astype(int) df_new['sixes'] = df_new['sixes'].astype(int)
Race statistics
We will take a look at the descriptive statistics of MS Dhoni's ODI career. You can use the following code for this:
first_match_date = df['date'].dt.date.min().strftime('%B %d, %AND') # first match print('First match:', first_match_date) last_match_date = df['date'].dt.date.max().strftime('%B %d, %AND') # last match print('nLast match:', last_match_date) number_of_matches = df.shape[0] # number of mathces played in career print('nNumber of matches played:', number_of_matches) number_of_inns = df_new.shape[0] # number of innings print('nNumber of innings played:', number_of_inns) not_outs = df_new['not_out'].sum() # number of not outs in career print('nNot outs:', not_outs) runs_scored = df_new['runs_scored'].sum() # runs scored in career print('nRuns scored in career:', runs_scored) balls_faced = df_new['balls_faced'].sum() # balls faced in career print('nBalls faced in career:', balls_faced) career_sr = (runs_scored / balls_faced)*100 # career strike rate print('nCareer strike rate: {:.2f}'.format(career_sr)) career_avg = (runs_scored / (number_of_inns - not_outs)) # career average print('nCareer average: {:.2f}'.format(career_avg)) highest_score_date = df_new.loc[df_new.runs_scored == df_new.runs_scored.max(), 'date'].values[0] highest_score = df.loc[df.date == highest_score_date, 'score'].values[0] # highest score print('nHighest score in career:', highest_score) hundreds = df_new.loc[df_new['runs_scored'] >= 100].shape[0] # number of 100s print('nNumber of 100s:', hundreds) fifties = df_new.loc[(df_new['runs_scored']>=50)&(df_new['runs_scored']<100)].shape[0] #number of 50s print('nNumber of 50s:', fifties) fours = df_new['fours'].sum() # number of fours in career print('nNumber of 4s:', fours) sixes = df_new['sixes'].sum() # number of sixes in career print('nNumber of 6s:', sixes)
The output should look like this:
This gives us a good idea of the overall career of MS Dhoni. Started playing in 2004, and last played an ODI in 2019. In a career of more than 15 years, has scored 10 hundred and a staggering amount of 73 fifty. Has scored more than 10,000 careers in his career with an average of 50.6 and a strike rate of 87.6. Your highest score is 183 *.
Now we will do a more exhaustive analysis of their performance against different teams. We will also see their performance year after year. We will take the help of visualizations for this.
Analysis
First, we will see how many games you have played against different oppositions. You can use the following code for this purpose:
# number of matches played against different oppositions df['opposition'].value_counts().plot(kind='bar', title="Number of matches against different oppositions", figsize=(8, 5));
The output should look like this:
We can see that he has played most of his matches against Sri Lanka, Australia, England, West Indies, South Africa and Pakistan.
Let's see how many careers he has scored against different oppositions. You can use the following code snippet to generate the result:
runs_scored_by_opposition = pd.DataFrame(df_new.groupby('opposition')['runs_scored'].sum()) runs_scored_by_opposition.plot(kind='bar', title="Runs scored against different oppositions", figsize=(8, 5)) plt.xlabel(None);
The output will look like this:
We can see that Dhoni has scored the most runs against Sri Lanka, followed by Australia, England and Pakistan. He has also played many games against these teams, so it makes sense.
To have a clearer picture, let's take a look at your batting average against each team. The following code snippet will help us to obtain the desired result:
innings_by_opposition = pd.DataFrame(df_new.groupby('opposition')['date'].count()) not_outs_by_opposition = pd.DataFrame(df_new.groupby('opposition')['not_out'].sum()) temp = runs_scored_by_opposition.merge(innings_by_opposition, left_index=True, right_index=True) average_by_opposition = temp.merge(not_outs_by_opposition, left_index=True, right_index=True) average_by_opposition.rename(columns = {'date': 'innings'}, inplace=True) average_by_opposition['eff_num_of_inns'] = average_by_opposition['innings'] - average_by_opposition['not_out'] average_by_opposition['average'] = average_by_opposition['runs_scored'] / average_by_opposition['eff_num_of_inns'] average_by_opposition.replace(e.g. inf, np.nan, inplace=True) major_nations = ['Australia', 'England', 'New Zealand', 'Pakistan', 'South Africa', 'Sri Lanka', 'West Indies']
To generate the graph, use the code snippet below:
plt.figure(figsize = (8, 5)) plt.plot(average_by_opposition.loc[major_nations, 'average'].values, marker="O") plt.plot([career_avg]*len(major_nations), '--') plt.title('Average against major teams') plt.xticks(range(0, 7), major_nations) plt.ylim(20, 70) plt.legend(['Avg against opposition', 'Career average']);
The output will look like this:
As we can see, Dhoni has performed remarkably against tough teams like Australia, England and Sri Lanka. His average against these teams is close to his career average or slightly higher.. The only team he hasn't performed well against is South Africa.
Let's now see their year-on-year statistics. We will start by looking how many games have you played each year after its debut. The code for that will be:
df['year'].value_counts().sort_index().plot(kind='bar', title="Matches played by year", figsize=(8, 5)) plt.xticks(rotation=0);
The plot will look like this:
We can see that in 2012, 2014 Y 2016, Dhoni played very few ODI matches for India. In general, after 2005-2009, the average number of matches played decreased slightly.
We should also look at how many careers that he has marked every year. The code for that will be:
df_new.groupby('year')['runs_scored'].sum().plot(kind='line', marker="O", title="Runs scored by year", figsize=(8, 5)) years = df['year'].unique().tolist() plt.xticks(years) plt.xlabel(None);
The output should look like this:
You can clearly see that Dhoni scored the most runs of the year 2009, followed by 2007 Y 2008. The number of runs began to decline after 2010 (because the number of games played also started to decrease).
Finally, let's see his Average career batting progression per inning. These are time series data and have been plotted in a line diagram. The code for that will be:
df_new.reset_index(drop=True, inplace=True) career_average = pd.DataFrame() career_average['runs_scored_in_career'] = df_new['runs_scored'].cumsum() career_average['innings'] = df_new.index.tolist() career_average['innings'] = career_average['innings'].apply(lambda x: x+1) career_average['not_outs_in_career'] = df_new['not_out'].cumsum() career_average['eff_num_of_inns'] = career_average['innings'] - career_average['not_outs_in_career'] career_average['average'] = career_average['runs_scored_in_career'] / career_average['eff_num_of_inns']
The code snippet for the plot will be:
plt.figure(figsize = (8, 5)) plt.plot(career_average['average']) plt.plot([career_avg]*career_average.shape[0], '--') plt.title('Career average progression by innings') plt.xlabel('Number of innings') plt.legend(['Avg progression', 'Career average']);
The output graph will look like this:
We can see that after a slow start and a drop in performance on input number 50, Dhoni's performance recovered substantially. Towards the end of his career, consistently averaged above 50.
EndNote
In this article, we analyze the batting performance of Indian cricketer MS Dhoni. We look at the general statistics of your career, your performance against different opponents and your performance year after year.
This article has been written by Vishesh Arora. You can connect with me at LinkedIn.
The media shown in this article is not the property of DataPeaker and is used at the author's discretion.