*This article was published as part of the Data Science Blogathon*

## Introduction

In my previous article, I talk about the theoretical concepts about outliers and try to find the answer to the question: **“When do we have to remove outliers and when do we keep outliers?”**.

To better understand this article, you must first read that **Article** and then move on to this so that you have a clear idea about Outlier Analysis in Data Science Projects.

In this article, We will try to answer the following questions together with the **Piton **implementation,

👉 **How to deal with outliers?**

👉 **How to detect outliers?**

👉 **What are the techniques for the detection and removal of outliers?**

** **

## Let us begin

**How to deal with outliers?**

👉 **Garrison:** Exclude outliers. of our analysis. Applying this technique our data becomes thin when there are more outliers present in the data set. Its main advantage is its **the fastest **nature.

👉**Tamponade: **In this technique,** **Cap our outliers and make the limit namely, above or below a particular value, all values will be considered outliers, and the number of outliers in the data set gives that bounding number.

**For instance,** If you are working in the income function, people above a certain income level may behave in the same way as those with lower incomes. In this case, can limit the value of income to a level that keeps it intact and, Consequently, deal with outliers.

👉**Treat outliers as a missing value: **By** **assuming outliers as missing observations, treat them accordingly, namely, equal to missing values.

You can check the missing value item **here**

👉 **Discretization:** In this technique, when making the groups we include the outliers in a particular group and force them to behave in the same way as those of other points in that group. This technique is also known as **Binning**.

You can learn more about discretization **here**.

**How to detect outliers?**

👉 **For normal distributions: **Use empirical normal distribution relationships.

– The data points below ** media-3 * (sigma) **or above

**are outliers.**

*media + 3 * (sigma)*where mean and sigma are the **average value** Y **Standard deviation** of a particular column.

Fig. Characteristics of a normal distribution

**Image source: Link**

** **

👉 **For skewed distributions:** Use the Inter-Quartile Range proximity rule (IQR).

– The data points below ** Q1 – 1.5 IQR **or above

**are outliers.**

*Q3 + 1.5 IQR*where Q1 and Q3 are the **25** Y **Percentile 75** of the data set respectively, and IQR represents the interquartile range and is given by Q3 – Q1.

Fig. IQR to detect outliers

**Image source: Link**

** **

👉 **For other distributions: **Use** **percentile-based approach.

**For instance, **Data points that are far from the percentile 99% and less than the percentile 1 are considered outliers.

Fig. Percentile representation

**Image source: Link**

**Techniques for the detection and elimination of outliers:**

👉 **Z score treatment:**

** Assumption**– Characteristics are normally or approximately normally distributed.

**Paso 1: Import of required dependencies**

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns

**Paso 2: read and load the data set**

df = pd.read_csv('placement.csv') df.sample(5)

**Paso 3: Sketch the distribution graphs for the characteristics**

import warnings warnings.filterwarnings('ignore') plt.figure(figsize=(16,5)) plt.subplot(1,2,1) sns.distplot(df['cgpa']) plt.subplot(1,2,2) sns.distplot(df['placement_exam_marks']) plt.show()

**Paso 4: find limit values**

print("Highest allowed",df['cgpa'].mean() + 3*df['cgpa'].std()) print("Lowest allowed",df['cgpa'].mean() - 3*df['cgpa'].std())

__Production:__

Highest allowed 8.808933625397177 Lowest allowed 5.113546374602842

**Paso 5: find outliers**

df[(df['cgpa'] > 8.80) | (df['cgpa'] < 5.11)]

**Paso 6: Outlier trim**

new_df = df[(df['cgpa'] < 8.80) & (df['cgpa'] > 5.11)] new_df

**Paso 7: outlier limitation**

upper_limit = df['cgpa'].mean() + 3*df['cgpa'].std() lower_limit = df['cgpa'].mean() - 3*df['cgpa'].std()

**Paso 8: now, apply cap**

df['cgpa'] = np.where( df['cgpa']>upper_limit, upper_limit, np.where( df['cgpa']<lower_limit, lower_limit, df['cgpa'] ) )

**Paso 9: now view the statistics using the function “Describe”**

df['cgpa'].describe()

__Production:__

count 1000.000000 mean 6.961499 std 0.612688 min 5.113546 25% 6.550000 50% 6.960000 75% 7.370000 max 8.808934 Name: cgpa, dtype: float64

**This completes our Z-score based technique!!**

** **

👉 **Filtering based on IQR:**

Used when our data distribution is skewed.

**Paso 1: import the necessary dependencies**

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns

**Paso 2: read and load the data set**

df = pd.read_csv('placement.csv') df.head()

**Paso 3: Sketch the distribution graph of the characteristics.**

plt.figure(figsize=(16,5)) plt.subplot(1,2,1) sns.distplot(df['cgpa']) plt.subplot(1,2,2) sns.distplot(df['placement_exam_marks']) plt.show()

**Paso 4: Form a box plot for the skewed characteristic**

sns.boxplot(df['placement_exam_marks'])

**Paso 5: Find the IQR**

percentile25 = df['placement_exam_marks'].quantile(0.25) percentile75 = df['placement_exam_marks'].quantile(0.75)

**Paso 6: Find the upper and lower limit**

upper_limit = percentile75 + 1.5 * iqr lower_limit = percentile25 - 1.5 * iqr

**Paso 7: find outliers**

df[df['placement_exam_marks'] > upper_limit] df[df['placement_exam_marks'] < lower_limit]

**Paso 8: Cutout**

new_df = df[df['placement_exam_marks'] < upper_limit] new_df.shape

**Paso 9: Compare the plots after clipping**

plt.figure(figsize=(16,8)) plt.subplot(2,2,1) sns.distplot(df['placement_exam_marks']) plt.subplot(2,2,2) sns.boxplot(df['placement_exam_marks']) plt.subplot(2,2,3) sns.distplot(new_df['placement_exam_marks']) plt.subplot(2,2,4) sns.boxplot(new_df['placement_exam_marks']) plt.show()

**Paso 10: plugged**

new_df_cap = df.copy() new_df_cap['placement_exam_marks'] = np.where( new_df_cap['placement_exam_marks'] > upper_limit, upper_limit, np.where( new_df_cap['placement_exam_marks'] < lower_limit, lower_limit, new_df_cap['placement_exam_marks'] ) )

**Paso 11: Compare the parcels after the limitation**

plt.figure(figsize=(16,8)) plt.subplot(2,2,1) sns.distplot(df['placement_exam_marks']) plt.subplot(2,2,2) sns.boxplot(df['placement_exam_marks']) plt.subplot(2,2,3) sns.distplot(new_df_cap['placement_exam_marks']) plt.subplot(2,2,4) sns.boxplot(new_df_cap['placement_exam_marks']) plt.show()

**This completes our IQR-based technique!!**

** **

👉 **Percentile:**

– This technique works by setting a particular threshold value, which decides based on our approach to the problem.

– While we remove outliers by limiting, that particular method is known as **Winsorización**.

– Here we always keep** symmetry **on both sides means that if we remove the 1% right, then on the left we also decrease a 1%.

**Paso 1: import the necessary dependencies**

import numpy as np import pandas as pd

**Paso 2: read and load the data set**

df = pd.read_csv('weight-height.csv') df.sample(5)

**Paso 3: Sketch the distribution graph of the characteristic of “height”**

sns.distplot(df['Height'])

**Paso 4: Sketch the box plot of the characteristic of “height”**

sns.boxplot(df['Height'])

**Paso 5: Find the upper and lower limit**

upper_limit = df['Height'].quantile(0.99) lower_limit = df['Height'].quantile(0.01)

**Paso 7: apply trim**

new_df = df[(df['Height'] <= 74.78) & (df['Height'] >= 58.13)]

**Paso 8: Compare the distribution and the box plot after clipping**

sns.distplot(new_df['Height']) sns.boxplot(new_df['Height'])

👉 **Winsorización:**

**Paso 9: Apply limitation (Winsorización)**

df['Height'] = np.where(df['Height'] >= upper_limit, upper_limit, np.where(df['Height'] <= lower_limit, lower_limit, df['Height']))

**Paso 10: Compare distribution and box plot after constraint**

sns.distplot(df['Height']) sns.boxplot(df['Height'])

**This completes our percentile-based technique!!**

** **

**Final notes**

*Thank you for reading!*

If you liked this and want to know more, visit my other articles on data science and machine learning by clicking on the **Link**

Feel free to contact me at **Linkedin, Email**.

Anything not mentioned or do you want to share your thoughts? Feel free to comment below and I'll get back to you.

**About the Author**

**Chirag Goyal**

Nowadays, I am pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from **Indian Institute of Technology Jodhpur (IITJ). **I am very excited about machine learning, deep learning and artificial intelligence.

** **

** The media shown in this article is not the property of DataPeaker and is used at the author's discretion. **