How to One-Hot Encoding for Multi-Category Variables

Share on facebook
Share on twitter
Share on linkedin
Share on telegram
Share on whatsapp

Contents

In this article, We will learn how we can handle multiple category variables using the One Hot Encoding function engineering technique.

But before continuing, let's have a short discussion on function engineering and One Hot Encoding.

Function engineering

Then, Feature engineering is the process of extracting features from raw data using domain knowledge of the problem. These functions can be used to improve the performance of machine learning algorithms and, if performance increases, will provide the best accuracy. We can also say that function engineering is the same as applied machine learning. Feature engineering is the most important art in machine learning that creates a huge difference between a good model and a bad model. This is the third step in the life cycle of any data science project.

The concept of transparency for machine learning models is somewhat complicated, as different models often require different approaches for different types of data. Such as:-

  • Continuous data
  • Categorical characteristics
  • Missing values
  • Normalization
  • Dates and time

But here we will only discuss the categorical features, categorical characteristics are those characteristics in which the data type is an object type. The data point value in any categorical characteristic is not in numerical form, but in the form of an object.

There are many techniques for handling categorical variables, some are:

  • Tag encoding or ordinal encoding
  • A hot coding
  • Dummy coding
  • Effects encoding
  • Binary encoding
  • Basel coding
  • Hash encoding
  • Destination encoding

Then, here we handle categorical characteristics by One Hot Encoding, Thus, first, discutiremos One Hot Encoding.

A hot coding

We know that categorical variables contain the label values ​​instead of numeric values. The number of possible values ​​is often limited to a fixed set. Categorical variables are often called nominal. Many machine learning algorithms cannot operate directly on tag data. Require all input and output variables to be numeric.

This means that categorical data must be converted to a numeric form. If the categorical variable is an output variable, you may also want to convert model predictions back to categorical form to present or use in some application.

for instance data on gender are in the form of 'masculine’ Y 'woman'.

But if we use one-hot encoding, coding and allowing the model to assume a natural order between categories can result in poor performance or unexpected results.

One-hot encoding can be applied to the representation of integers. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

For instance, we code the variable of colors,

Red color Color blue
0 1
1 0
0 1

Now we will begin our journey. In the first step, we take a set of house price prediction data.

Data set

Here we will use the data set of house_price which is used to predict the price of the house according to the size of the area.

If you want to download the house price prediction dataset, click on here.

Module import

Now, we have to import important python modules to be used for one-hot encoding

# importing pandas
import pandas as pd
# importing numpy
import numpy as np
# importing OneHotEncoder
from sklearn.preprocessing import OneHotEncoder()

Here, we use pandas that are used for data analysis, NumPyused for n-dimensional arrays, and from sklearn, we will use a class One hot encoder important for categorical encoding.

Now we have to read this data using Python.

Read data set

Generally, the dataset is in CSV form, and the dataset we use is also in CSV form. To read the CSV file we will use the pandas read_csv function (). look down:

# reading dataset
df = pd.read_csv('house_price.csv')
df.head()

production:-

screenshot-2021-05-03-212841-1-3858364

But we only have to use categorical variables for an active encoder and we will only try to explain with categorical variables for easy understanding.

to partition categorical variables from data, we have to check how many characteristics have categorical values.

Checking categorical values

To verify the values ​​we use the pandas select_dtypes function that is used to select the data types of the variable.

# checking features
cat = df.select_dtypes(include="O").keys()
# display variabels
cat

production:-

screenshot-2021-05-03-214320-1859834

Now we have to remove those numeric columns from the dataset and we will use this categorical variable for our use. We only use 3-4 categorical columns of the dataset to apply one-hot encoding.

Create a new data frame

Now, to use categorical variables, we will create a new data frame from selected categorical columns.

# creating new df
# setting columns we use
new_df = pd.read_csv('house_price.csv',usecols =['Neighborhood','Exterior1st ','Exterior2nd']) 
new_df.head()

production:-

screenshot-2021-05-03-214440-7585045

Now we have to find out how many unique categories are present in each categorical column.

Find unique values

To find unique values ​​we will use the pandas unique function ().

# unique values in each columns
for x in new_df.columns:
    #prinfting unique values
    print(x ,':', len(new_df[x].unique()))

production:-

Neighborhood: 25
Exterior 1st: 15
Exterior 2nd: 16

Now, we will use our technique to apply one-hot encoding on multi-category variables.

Technique for multi-category variables

The technique is to limit the one-hot encoding to 10 variable's most frequent labels. This means that we would make a binary variable only for each of the 10 most frequent tags, this is equivalent to grouping all other tags into a new category, which in this case will be eliminated. A) Yes, the 10 new dummy variables indicate whether one of the 10 most frequent tags is present is 1 or not then 0 for a particular observation.

Most frequent variables

Here we will select the 20 most frequent variables.

Suppose we take a categorical variable Neighborhood.

# finding the top 20 categories
new_df.Neighborhood.value_counts().sort_values(ascending=False).head(20)

production:

screenshot-2021-05-03-220947-7634589

When you see in this output image, you will notice that the Names The label repeats 225 times in the neighborhood columns and we go down this number is decreasing.

So we take the 10 top results from the top and we convert this top result 10 in one-hot encoding and the labels on the left become zero.

production:-

screenshot-2021-05-03-223131-9993341

List of most frequent categorical variables

# make list with top 10 variables
top_10 = [x for x in new_df.Neighborhood.value_counts().sort_values(ascending=False).head(10).index]
top_10

production:-

[‘NAmes’,
‘CollgCr’,
‘OldTown’,
‘Edwards’,
‘Somerst’,
‘Gilbert’,
‘NridgHt’,
‘Sawyer’,
‘NWAmes’,
‘SawyerW’]

There are the 10 top categorical labels in the Neighborhood column.

Make binary

Now, we have to do the 10 binary variables of the top_10 labels:

# make binary labels

for tag in top_10:

new_df[label] = np.where (new_df[‘Neighborhood’]== label, 1,0)

new_df[[‘Neighborhood’]+ top_10]


production:-

Names CollgCr Old Town Edwards Somerst Gilbert NridgHt Sawyer NWAmes SawyerW
0 CollgCr 0 1 0 0 0 0 0 0 0 0
1 Veenker 0 0 0 0 0 0 0 0 0 0
2 CollgCr 0 1 0 0 0 0 0 0 0 0
3 Crawfor 0 0 0 0 0 0 0 0 0 0
4 NoRidge 0 0 0 0 0 0 0 0 0 0
5 Mitchel 0 0 0 0 0 0 0 0 0 0
6 Somerst 0 0 0 0 1 0 0 0 0 0
7 NWAmes 0 0 0 0 0 0 0 0 1 0
8 Old Town 0 0 1 0 0 0 0 0 0 0
9 BrkSide 0 0 0 0 0 0 0 0 0 0
10 Sawyer 0 0 0 0 0 0 0 1 0 0
11 NridgHt 0 0 0 0 0 0 1 0 0 0

You can see how the top_10 tags are now converted to binary format.

Let's take an example, see in the table where 1 index Veenker that did not belong to our tag top_10 categories, so it will result in 0 all columns.

Now we will do it for all the categorical variables that we have selected previously.

All variables selected in OneHotEncoding

# for all categorical variables we selected
def top_x(df2,variable,top_x_labels):
for label in top_x_labels:
df2[variable+'_'+label] = np.where(data[variable]==label,1,0)
# read the data again
data = pd.read_csv('D://xdatasets/train.csv',usecols = ['Neighborhood','Exterior1st','Exterior2nd'])
#encode Nighborhood into the 10 most frequent categories
top_x(data,'Neighborhood',top_10)
# display data
data.head()

Production:-

screenshot-2021-05-03-225312-4-1118700

Now, here we apply one-hot encoding on all multi-category variables.

Now we will see the advantages and disadvantages of One Hot Encoding for multiple variables.

Advantage

  • Easy to implement
  • Does not require much time for variable exploration.
  • Doesn't massively expand the feature space.

Disadvantages

  • It does not add any information that could make the variable more predictive
  • Do not save the information of ignored variables.

Final notes

Then, the summary of this is that we learn how to handle variables of multiple categories. If you run into this problem, so this is a very difficult task. So thanks for reading this article..

Connect with me on Linkedin: Profile

Read my other articles: https://www.analyticsvidhya.com/blog/author/mayurbadole2407/

Thank you

The media shown in this article is not the property of DataPeaker and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.