# How to detect and eliminate outliers

## Introduction

In my previous article, I talk about the theoretical concepts about outliers and try to find the answer to the question: “When do we have to remove outliers and when do we keep outliers?”.

To better understand this article, you must first read that Article and then move on to this so that you have a clear idea about Outlier Analysis in Data Science Projects.

👉 How to deal with outliers?

👉 How to detect outliers?

👉 What are the techniques for the detection and removal of outliers?

## Let us begin

### How to deal with outliers?

👉 Garrison: Exclude outliers. of our analysis. Applying this technique our data becomes thin when there are more outliers present in the data set. Its main advantage is its the fastest nature.

👉Tamponade: In this technique, Cap our outliers and make the limit namely, above or below a particular value, all values ​​will be considered outliers, and the number of outliers in the data set gives that bounding number.

For instance, If you are working in the income function, people above a certain income level may behave in the same way as those with lower incomes. In this case, can limit the value of income to a level that keeps it intact and, Consequently, deal with outliers.

👉Treat outliers as a missing value: By assuming outliers as missing observations, treat them accordingly, namely, equal to missing values.

You can check the missing value item here

👉 Discretization: In this technique, when making the groups we include the outliers in a particular group and force them to behave in the same way as those of other points in that group. This technique is also known as Binning.

### How to detect outliers?

👉 For normal distributions: Use empirical normal distribution relationships.

– The data points below media-3 * (sigma) or above media + 3 * (sigma) are outliers.

where mean and sigma are the average value Y Standard deviation of a particular column.

Fig. Characteristics of a normal distribution

Image source:

👉 For skewed distributions: Use the Inter-Quartile Range proximity rule (IQR).

– The data points below Q1 – 1.5 IQR or above Q3 + 1.5 IQR are outliers.

where Q1 and Q3 are the 25 Y Percentile 75 of the data set respectively, and IQR represents the interquartile range and is given by Q3 – Q1.

Fig. IQR to detect outliers

Image source:

👉 For other distributions: Use percentile-based approach.

For instance, Data points that are far from the percentile 99% and less than the percentile 1 are considered outliers.

Fig. Percentile representation

Image source:

### Techniques for the detection and elimination of outliers:

👉 Z score treatment:

Assumption– Characteristics are normally or approximately normally distributed.

Paso 1: Import of required dependencies

```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns```

```df = pd.read_csv('placement.csv')
df.sample(5)```

Paso 3: Sketch the distribution graphs for the characteristics

```import warnings
warnings.filterwarnings('ignore')
plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
sns.distplot(df['cgpa'])
plt.subplot(1,2,2)
sns.distplot(df['placement_exam_marks'])
plt.show()```

Paso 4: find limit values

```print("Highest allowed",df['cgpa'].mean() + 3*df['cgpa'].std())
print("Lowest allowed",df['cgpa'].mean() - 3*df['cgpa'].std())```

Production:

```Highest allowed 8.808933625397177
Lowest allowed 5.113546374602842```

Paso 5: find outliers

`df[(df['cgpa'] > 8.80) | (df['cgpa'] < 5.11)]`

Paso 6: Outlier trim

```new_df = df[(df['cgpa'] < 8.80) & (df['cgpa'] > 5.11)]
new_df```

Paso 7: outlier limitation

```upper_limit = df['cgpa'].mean() + 3*df['cgpa'].std()
lower_limit = df['cgpa'].mean() - 3*df['cgpa'].std()```

Paso 8: now, apply cap

```df['cgpa'] = np.where(
df['cgpa']>upper_limit,
upper_limit,
np.where(
df['cgpa']<lower_limit,
lower_limit,
df['cgpa']
)
)```

Paso 9: now view the statistics using the function “Describe”

`df['cgpa'].describe()`

Production:

```count    1000.000000
mean        6.961499
std         0.612688
min         5.113546
25%         6.550000
50%         6.960000
75%         7.370000
max         8.808934
Name: cgpa, dtype: float64```

This completes our Z-score based technique!!

👉 Filtering based on IQR:

Used when our data distribution is skewed.

Paso 1: import the necessary dependencies

```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns```

```df = pd.read_csv('placement.csv')

Paso 3: Sketch the distribution graph of the characteristics.

```plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
sns.distplot(df['cgpa'])
plt.subplot(1,2,2)
sns.distplot(df['placement_exam_marks'])
plt.show()```

Paso 4: Form a box plot for the skewed characteristic

`sns.boxplot(df['placement_exam_marks'])`

Paso 5: Find the IQR

```percentile25 = df['placement_exam_marks'].quantile(0.25)
percentile75 = df['placement_exam_marks'].quantile(0.75)```

Paso 6: Find the upper and lower limit

```upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr```

Paso 7: find outliers

```df[df['placement_exam_marks'] > upper_limit]
df[df['placement_exam_marks'] < lower_limit]```

Paso 8: Cutout

```new_df = df[df['placement_exam_marks'] < upper_limit]
new_df.shape```

Paso 9: Compare the plots after clipping

```plt.figure(figsize=(16,8))
plt.subplot(2,2,1)
sns.distplot(df['placement_exam_marks'])
plt.subplot(2,2,2)
sns.boxplot(df['placement_exam_marks'])
plt.subplot(2,2,3)
sns.distplot(new_df['placement_exam_marks'])
plt.subplot(2,2,4)
sns.boxplot(new_df['placement_exam_marks'])
plt.show()```

Paso 10: plugged

```new_df_cap = df.copy()
new_df_cap['placement_exam_marks'] = np.where(
new_df_cap['placement_exam_marks'] > upper_limit,
upper_limit,
np.where(
new_df_cap['placement_exam_marks'] < lower_limit,
lower_limit,
new_df_cap['placement_exam_marks']
)
)```

Paso 11: Compare the parcels after the limitation

```plt.figure(figsize=(16,8))
plt.subplot(2,2,1)
sns.distplot(df['placement_exam_marks'])
plt.subplot(2,2,2)
sns.boxplot(df['placement_exam_marks'])
plt.subplot(2,2,3)
sns.distplot(new_df_cap['placement_exam_marks'])
plt.subplot(2,2,4)
sns.boxplot(new_df_cap['placement_exam_marks'])
plt.show()```

This completes our IQR-based technique!!

👉 Percentile:

– This technique works by setting a particular threshold value, which decides based on our approach to the problem.

– While we remove outliers by limiting, that particular method is known as Winsorización.

– Here we always keep symmetry on both sides means that if we remove the 1% right, then on the left we also decrease a 1%.

Paso 1: import the necessary dependencies

```import numpy as np
import pandas as pd```

```df = pd.read_csv('weight-height.csv')
df.sample(5)```

Paso 3: Sketch the distribution graph of the characteristic of “height”

`sns.distplot(df['Height'])`

Paso 4: Sketch the box plot of the characteristic of “height”

`sns.boxplot(df['Height'])`

Paso 5: Find the upper and lower limit

```upper_limit = df['Height'].quantile(0.99)
lower_limit = df['Height'].quantile(0.01)```

Paso 7: apply trim

`new_df = df[(df['Height'] <= 74.78) & (df['Height'] >= 58.13)]`

Paso 8: Compare the distribution and the box plot after clipping

```sns.distplot(new_df['Height'])
sns.boxplot(new_df['Height'])```

👉 Winsorización:

Paso 9: Apply limitation (Winsorización)

```df['Height'] = np.where(df['Height'] >= upper_limit,
upper_limit,
np.where(df['Height'] <= lower_limit,
lower_limit,
df['Height']))```

Paso 10: Compare distribution and box plot after constraint

```sns.distplot(df['Height'])
sns.boxplot(df['Height'])```

This completes our percentile-based technique!!

## Final notes

If you liked this and want to know more, visit my other articles on data science and machine learning by clicking on the Link

Feel free to contact me at Linkedin, Email.

Anything not mentioned or do you want to share your thoughts? Feel free to comment below and I'll get back to you.

## Chirag Goyal

Nowadays, I am pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from Indian Institute of Technology Jodhpur (IITJ). I am very excited about machine learning, deep learning and artificial intelligence.