How to detect and eliminate outliers

Share on facebook
Share on twitter
Share on linkedin
Share on telegram
Share on whatsapp

Contents

This article was published as part of the Data Science Blogathon

Introduction

In my previous article, I talk about the theoretical concepts about outliers and try to find the answer to the question: “When do we have to remove outliers and when do we keep outliers?”.

To better understand this article, you must first read that Article and then move on to this so that you have a clear idea about Outlier Analysis in Data Science Projects.

In this article, We will try to answer the following questions together with the Piton implementation,

👉 How to deal with outliers?

👉 How to detect outliers?

👉 What are the techniques for the detection and removal of outliers?

Let us begin


How to deal with outliers?

👉 Garrison: Exclude outliers. of our analysis. Applying this technique our data becomes thin when there are more outliers present in the data set. Its main advantage is its the fastest nature.

👉Tamponade: In this technique, Cap our outliers and make the limit namely, above or below a particular value, all values ​​will be considered outliers, and the number of outliers in the data set gives that bounding number.

For instance, If you are working in the income function, people above a certain income level may behave in the same way as those with lower incomes. In this case, can limit the value of income to a level that keeps it intact and, Consequently, deal with outliers.

👉Treat outliers as a missing value: By assuming outliers as missing observations, treat them accordingly, namely, equal to missing values.

You can check the missing value item here

👉 Discretization: In this technique, when making the groups we include the outliers in a particular group and force them to behave in the same way as those of other points in that group. This technique is also known as Binning.

You can learn more about discretization here.

How to detect outliers?

👉 For normal distributions: Use empirical normal distribution relationships.

– The data points below media-3 * (sigma) or above media + 3 * (sigma) are outliers.

where mean and sigma are the average value Y Standard deviation of a particular column.

normal20distribution20deviations-1002347

Fig. Characteristics of a normal distribution

Image source: Link

👉 For skewed distributions: Use the Inter-Quartile Range proximity rule (IQR).

– The data points below Q1 – 1.5 IQR or above Q3 + 1.5 IQR are outliers.

where Q1 and Q3 are the 25 Y Percentile 75 of the data set respectively, and IQR represents the interquartile range and is given by Q3 – Q1.

box_plot_ref_needed-1049043

Fig. IQR to detect outliers

Image source: Link

👉 For other distributions: Use percentile-based approach.

For instance, Data points that are far from the percentile 99% and less than the percentile 1 are considered outliers.

fig-6-example-4457464

Fig. Percentile representation

Image source: Link

Techniques for the detection and elimination of outliers:

👉 Z score treatment:

Assumption– Characteristics are normally or approximately normally distributed.

Paso 1: Import of required dependencies

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Paso 2: read and load the data set

df = pd.read_csv('placement.csv')
df.sample(5)

Detect and remove outliers cgpa

Paso 3: Sketch the distribution graphs for the characteristics

import warnings
warnings.filterwarnings('ignore')
plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
sns.distplot(df['cgpa'])
plt.subplot(1,2,2)
sns.distplot(df['placement_exam_marks'])
plt.show()

distribution chart Detect and remove outliers

Paso 4: find limit values

print("Highest allowed",df['cgpa'].mean() + 3*df['cgpa'].std())
print("Lowest allowed",df['cgpa'].mean() - 3*df['cgpa'].std())

Production:

Highest allowed 8.808933625397177
Lowest allowed 5.113546374602842

Paso 5: find outliers

df[(df['cgpa'] > 8.80) | (df['cgpa'] < 5.11)]

Paso 6: Outlier trim

new_df = df[(df['cgpa'] < 8.80) & (df['cgpa'] > 5.11)]
new_df

Paso 7: outlier limitation

upper_limit = df['cgpa'].mean() + 3*df['cgpa'].std()
lower_limit = df['cgpa'].mean() - 3*df['cgpa'].std()

Paso 8: now, apply cap

df['cgpa'] = np.where(
    df['cgpa']>upper_limit,
    upper_limit,
    np.where(
        df['cgpa']<lower_limit,
        lower_limit,
        df['cgpa']
    )
)

Paso 9: now view the statistics using the function “Describe”

df['cgpa'].describe()

Production:

count    1000.000000
mean        6.961499
std         0.612688
min         5.113546
25%         6.550000
50%         6.960000
75%         7.370000
max         8.808934
Name: cgpa, dtype: float64

This completes our Z-score based technique!!

👉 Filtering based on IQR:

Used when our data distribution is skewed.

Paso 1: import the necessary dependencies

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Paso 2: read and load the data set

df = pd.read_csv('placement.csv')
df.head()

Paso 3: Sketch the distribution graph of the characteristics.

plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
sns.distplot(df['cgpa'])
plt.subplot(1,2,2)
sns.distplot(df['placement_exam_marks'])
plt.show()

Paso 4: Form a box plot for the skewed characteristic

sns.boxplot(df['placement_exam_marks'])

Detect and remove outliers from the box plot

Paso 5: Find the IQR

percentile25 = df['placement_exam_marks'].quantile(0.25)
percentile75 = df['placement_exam_marks'].quantile(0.75)

Paso 6: Find the upper and lower limit

upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr

Paso 7: find outliers

df[df['placement_exam_marks'] > upper_limit]
df[df['placement_exam_marks'] < lower_limit]

Paso 8: Cutout

new_df = df[df['placement_exam_marks'] < upper_limit]
new_df.shape

Paso 9: Compare the plots after clipping

plt.figure(figsize=(16,8))
plt.subplot(2,2,1)
sns.distplot(df['placement_exam_marks'])
plt.subplot(2,2,2)
sns.boxplot(df['placement_exam_marks'])
plt.subplot(2,2,3)
sns.distplot(new_df['placement_exam_marks'])
plt.subplot(2,2,4)
sns.boxplot(new_df['placement_exam_marks'])
plt.show()

comparison of post clippings Detect and remove outliers

Paso 10: plugged

new_df_cap = df.copy()
new_df_cap['placement_exam_marks'] = np.where(
    new_df_cap['placement_exam_marks'] > upper_limit,
    upper_limit,
    np.where(
        new_df_cap['placement_exam_marks'] < lower_limit,
        lower_limit,
        new_df_cap['placement_exam_marks']
    )
)

Paso 11: Compare the parcels after the limitation

plt.figure(figsize=(16,8))
plt.subplot(2,2,1)
sns.distplot(df['placement_exam_marks'])
plt.subplot(2,2,2)
sns.boxplot(df['placement_exam_marks'])
plt.subplot(2,2,3)
sns.distplot(new_df_cap['placement_exam_marks'])
plt.subplot(2,2,4)
sns.boxplot(new_df_cap['placement_exam_marks'])
plt.show()

comparison post limit

This completes our IQR-based technique!!

👉 Percentile:

– This technique works by setting a particular threshold value, which decides based on our approach to the problem.

– While we remove outliers by limiting, that particular method is known as Winsorización.

– Here we always keep symmetry on both sides means that if we remove the 1% right, then on the left we also decrease a 1%.

Paso 1: import the necessary dependencies

import numpy as np
import pandas as pd

Paso 2: read and load the data set

df = pd.read_csv('weight-height.csv')
df.sample(5)

data height

Paso 3: Sketch the distribution graph of the characteristic of “height”

sns.distplot(df['Height'])

Paso 4: Sketch the box plot of the characteristic of “height”

sns.boxplot(df['Height'])

plot height

Paso 5: Find the upper and lower limit

upper_limit = df['Height'].quantile(0.99)
lower_limit = df['Height'].quantile(0.01)

Paso 7: apply trim

new_df = df[(df['Height'] <= 74.78) & (df['Height'] >= 58.13)]

Paso 8: Compare the distribution and the box plot after clipping

sns.distplot(new_df['Height'])
sns.boxplot(new_df['Height'])

Detect and remove outliers by clipping the boxplot

👉 Winsorización:

Paso 9: Apply limitation (Winsorización)

df['Height'] = np.where(df['Height'] >= upper_limit,
        upper_limit,
        np.where(df['Height'] <= lower_limit,
        lower_limit,
        df['Height']))

Paso 10: Compare distribution and box plot after constraint

sns.distplot(df['Height'])
sns.boxplot(df['Height'])

boxplot post limit Detect and remove outliers

This completes our percentile-based technique!!

Final notes

Thank you for reading!

If you liked this and want to know more, visit my other articles on data science and machine learning by clicking on the Link

Feel free to contact me at Linkedin, Email.

Anything not mentioned or do you want to share your thoughts? Feel free to comment below and I'll get back to you.

About the Author

Chirag Goyal

Nowadays, I am pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from Indian Institute of Technology Jodhpur (IITJ). I am very excited about machine learning, deep learning and artificial intelligence.

The media shown in this article is not the property of DataPeaker and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.