The tool understands your data

Contents

This article was published as part of the Data Science Blogathon.

Introduction

Well! We all love cakes. If you take a closer look at the baking process, you will notice how the right combination of the various ingredients and a smart yeast agent, baking powder, you can decide the rise and fall of your cake.

“Bake the cake” may seem out of place in the whitepaper, but I think it's pretty relatable and a delightful analogy for understanding the importance of EDA in the data science process.

When Baking the Cake is for Data Science Pipeline, entonces Clever Leavening Agent (baking powder) is for Exploratory Data Analysis.

Before your mouth water for a cake like mine, let's understand.

What exactly is exploratory data analysis?

Exploratory data analysis is an approach to data analysis that employs a variety of techniques to:

  • Get insight into the data.
  • Take sanity checks. (To be sure that the insights we are extracting are actually coming from the correct data set).
  • Find out where data is missing.
  • Check for outliers.
  • Summarize the data.

Take the famous case study of “BLACK FRIDAY SALES” to understand, Why do we need EDA?

image4-8-2640789

The core problem is understanding customer behavior by predicting the purchase amount. But, Isn't it too abstract and leaves you puzzled as to what to do with the data, especially when you have so many different products with various categories?

Before continuing reading, think a bit about this question: Would you put all the ingredients available in the kitchen as is in the oven to bake the cake?

Obviously, the answer is no! Before taking the complete dataset as it is into consideration to bake it in the machine learning model, will want

  1. Extract important information
    1. Identification of variables (whether the data contains categorical or numeric variables or a combination of both).
    2. The behavior of the variables (if the variables have values ​​of 0 a 10 o de 0 a 1 million).
    3. Relationship between variables (how variables depend on each other).
  2. Check data consistency

    1. To ensure that all data is present. (If we have collected data for three years, any missing weeks can be a problem in later stages).
    2. Is there any missing value present?
    3. Are there outliers in the data set? (for instance: a person with 2000 years is definitely an anomaly)

  3. Function engineering
    1. Feature engineering (to create new features from existing raw features in the dataset).

** EDA, in essence, can break or do any machine learning model. **

Steps in exploratory data analysis

image7-5-3960218

There is 5 steps in EDA: ->

  1. Variable identification: In this step, we identify each variable discovering its type. According to our needs, we can change the data type of any variable.image3-8-2786999
    ~ Statistics play an important role in data analysis. It is a set of rules and concepts for the analysis and interpretation of data. There are different types of analysis that need to be performed depending on the requirements. ~ Let's study them
  2. Univariate analysis: In univariate analysis, we study the individual characteristics of each characteristic / variable available in the data set. There are two types of functions: continuous and categorical. In the image below, I have given a cheat sheet of various graphical techniques that can be applied to analyze them. image12-2-7443997

    Continuous variable:

    To show univariate analysis on one of the continuous variables from the Black Friday sale data set: “Purchase”, I have created a function that takes data as input and draws a KDE graph that explains the characteristics of the function.
    image11-5-7430627
    image14-1-8178459

    Categorical variable

    To display univariate analysis on categorical variables in the Black Friday sale dataset: `City_Category` y` Marital_Status`, I have created a function that takes data and characteristics as input that returns a count graph explaining the frequency of the categories in the characteristic.
    image2-9-8610981

    image10-3-5407698

  3. Bivariate analysis: In bivariate analysis, we study the relationship between any two variables that can be categorical-continuous, categorical-categorical or continuous-continuous (as shown in the reference sheet shown below along with the graphical techniques used to analyze them).
    image9-4-9350005
    En Black Friday Sales, we have categorical independent variables and continuous target variables, so we can do categorical-continuous analysis to understand the relationship between them.
    image13-1-3448189
    image8-5-3396502

    Inference:
    From the two previous analyzes, We have observed in the univariate analysis that a number of clients is maximum in city category B. But the bivariate analysis when performed between `City_Category` and` Purchase` shows a different story that the average purchase is maximum of city category C Therefore, these inferences can give us a better intuition about the data, which in turn aids in better data preparation and feature engineering of features.

    It is important to note that simply relying on univariate and bivariate analysis can be quite misleading., so to verify the inferences drawn from these two you can validate with Hypothesis testing. We can do a t test, chi-square test, Anova that allows us to quantify if two samples are significantly similar or different from each other. Here I have created a function to analyze continuous and categorical relationships that return the value of the t statistic.
    image5-9-6614731
    image1-10-7408321
    image6-6-5146715In Univariate Analysis we observe that there is a significant difference between the number of married and unmarried clients. From the t test, we obtain the value of the statistic t 0.89, which is greater than the significance level, namely, 0.05, which shows that there is no significant difference between the average single and married purchase.

  4. Lost value treatment : The main reason for this step is to find out if there is any specific reason why these values ​​are missing and how we treat them. Because if we don't treat them, can interfere with the pattern that runs on the data, which in turn can degrade model performance. Some of the ways that missing values ​​can be dealt with are: – Fill them with media, median, mode and can use imputers.
  5. Outlier removal : It is essential that we understand the presence of outliers, as some of the predictive models are sensitive to them and we should treat them accordingly.

Final notes

In this article, I have briefly discussed the importance of EDA in the data science pipeline and the steps that are involved in proper analysis. I have also shown how an incorrect or incomplete analysis can be quite misleading and can significantly affect the performance of machine learning models..

“If you don't brown your data, you're just another person with an opinion”;)

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.