Building Machine Learning Pipelines with Pyspark

Share on facebook
Share on twitter
Share on linkedin
Share on telegram
Share on whatsapp



  • Here's a quick introduction to building ML pipelines with PySpark
  • The ability to build these machine learning pipelines is a must-have skill for any aspiring data scientist.
  • This is a practical article with a structured code approach from PySpark, So prepare your favorite python IDE!


Take a moment to reflect on this.: What are the skills an aspiring data scientist must possess to obtain a position in the industry?

A machine learning The project has many moving components that need to come together before we can run it successfully. The ability to know how to build an end-to-end machine learning pipeline is a valuable asset. As a data scientist (aspiring or established), you should know how these machine learning pipelines work.

This is, in a nutshell, the fusion of two disciplines: data science and software engineering. These two go hand in hand for a data scientist. It's not just about building models, we need to have the software skills to build enterprise-grade systems.


Then, in this article, we will focus on the basic idea behind building these machine learning pipelines using PySpark. This is a practical article, so start your favorite python IDE and let's get going!!

Note: This is the part 2 from my PySpark series for beginners. You can check out the introductory article below:

Table of Contents

  1. Perform Basic Operations on a Spark Data Frame
    1. Read a CSV file
    2. Defining the schema
  2. Data exploration using PySpark
    1. Check the dimensions of the data
    2. Describe the data
    3. Missing value count
    4. Find count of unique values ​​in a column
  3. Encode categorical variables using PySpark
    1. String indexing
    2. A hot coding
  4. Vector Assembler
  5. Building Machine Learning Pipelines with PySpark
    1. Transformers and estimators
    2. Examples of pipes

Perform Basic Operations on a Spark Data Frame

An essential step (and first) in any data science project is to understand the data before building any Machine learning model. Most data science wannabes stumble here, they just don't spend enough time understanding what they're working with. There is a tendency to rush and build models, a fallacy that should to avoid.

We will follow this principle in this article.. I will follow a structured approach at all times to ensure that we do not miss any critical steps.

First, let's take a moment and understand each variable we will be working with here. We are going to use a data set of a India vs Bangladesh cricket match. Let's see the different variables we have in the data set:

  • Batter: Unique identification of the batter (whole)
  • Batsman_Name: Batter's name (String)
  • Bowler: Unique identification of the bowler (whole)
  • Bowler_Name: Bowler's Name (String)
  • Comment: Description of the event as broadcast (chain)
  • Detail: Another chain that describes events as windows and extra deliveries (Chain)
  • Fired: Unique identification of the batter if discarded (String)
  • ID: unique queue id (chain)
  • Isball: Whether the delivery was legal or not (boolean)
  • Isboundary: Whether the batter hit a limit or not (tracks)
  • Iswicket: Whether the batter fired or not (tracks)
  • Upon: About the number (Double)
  • Careers: It runs on that particular installment (whole)
  • Timestamp: Time the data was recorded (timestamp)

So let's get started, agree?

Read a CSV file

When we turn on Spark, the SparkSession The variable is appropriately available under the name ‘Spark – spark‘. We can use this to read various types of files, as CSV, JSON, TEXT, etc. This allows us to save the data as a Spark data frame.

By default, treats the data type of all columns as a string. You can check the data types using the printSchema function in the data frame:


Defining the schema

Now, we don't want all columns in our dataset to be treated as strings. Then, What can we do about it?

We can define the custom schema for our data frame in Spark. For this, we need to create an object of StructType which has a list of StructField. And of course, we should define StructField with a column name, the data type of the column and whether null values ​​are allowed for the particular column or not.

Please refer to the following code snippet to understand how to create this custom schema:


Remove columns from data

In any machine learning project, we always have some columns that are not needed to solve the problem. I'm sure you've faced this dilemma before too, either in industry or in a online hackathon.

In our case, we can use the drop function to remove the column from the data. Use the asterisk


pyspark machine learning pipeline

Data exploration using PySpark

Check the dimensions of the data

Unlike Pandas, Spark dataframes do not have the shape function to check the dimensions of the data. Instead, we can use the code below to check the dimensions of the dataset:

Describe the data Sparks describe The function gives us most of the statistical results as the mean, count, minimum, maximum and standard deviation. You can use the abstract


pyspark machine learning pipeline

Missing value count

It's weird when we get a data set with no missing values. Can you remember the last time it happened?

It is important to check the number of missing values ​​present in all columns. Knowing the count helps us deal with missing values ​​before creating any machine learning model with that data..


pyspark machine learning pipeline

Value counts of a column Unlike Pandas, we don't have the value_counts () function in Spark dataframes. You can use the group by


pyspark machine learning pipeline

Encode categorical variables using PySpark

Most machine learning algorithms accept data only in numerical form. Therefore, it is essential to convert any categorical variable present in our data set to numbers.

Remember that we cannot just remove them from our dataset, as they may contain useful information. It would be a nightmare to lose that just because we don't want to figure out how to use them!!

Let's look at some of the methods for encoding categorical variables using PySpark.

String indexing


pyspark machine learning pipeline

One-Hot Coding

One-hot encoding is a concept every data scientist should know. I have trusted him several times when dealing with missing values. It's a lifesaver! Here is the warning: Spark’s OneHotEncoder

does not directly encode the categorical variable. First, we need to use String Indexer to convert the variable to numeric form and then use OneHotEncoderEstimator

to encode multiple columns of the dataset.


pyspark machine learning pipeline

Vector Assembler

A vector assembler combines a given list of columns into a single vector column.

This is typically used at the end of the data exploration and preprocessing steps. In this stage, we usually work with some raw or transformed features that can be used to train our model. Vector Assembler converts them to a single column of features to train the machine learning model


pyspark machine learning pipeline

Building Machine Learning Pipelines with PySpark

A machine learning project generally involves steps like data pre-processing, feature extraction, fitting the model and evaluating results. We need to perform many transformations on the data in sequence. As you can imagine, keeping track of them can become a tedious task.

This is where machine learning pipelines come in..

A pipeline allows us to keep the data flow of all the relevant transformations that are required to achieve the final result. We need to define the stages of the pipeline that act as a chain of command for Spark to run. Here,

each stage is a transformer or estimator.

Transformers and estimators As the name suggests, Transformers

convert one data frame to another by updating the current values ​​of a particular column (how to convert categorical columns to numeric) or mapping it to some other values ​​using defined logic. An estimator implements the to fit in() method in a data frame and produces a model. For instance, Logistic regression is an estimator that trains a classification model when we call the to fit in()


Let's understand this with the help of some examples.

Examples of pipes


pyspark machine learning pipelines

  • We have created the data frame. Suppose we have to transform the data in the following order: stage_1: Label Encode o String Index la columna
  • Category 1 stage_2: Label Encode o String Index la columna
  • category_2 stage_3: One-Hot Encode la columna indexada


pyspark machine learning pipelines At every stage, we will pass the name of the input and output column and configure the pipeline passing the stages defined in the list of Pipeline



pyspark machine learning pipelines

Now, Let's take a more complex example of how to configure a pipeline. Here, we will make transformations in the data and we will build a logistic regression model.


pyspark machine learning pipelines

  • Now, suppose this is the order of our channeling: stage_1: Label Encode o String Index la columna
  • feature_2 stage_2: Label Encode o String Index la columna
  • feature_3 stage_3: One Hot Encode the indexed column of feature_2 Y
  • feature_3
  • stage_4: Create a vector of all the characteristics needed to train a logistic regression model


pyspark machine learning pipelines


pyspark machine learning pipelines


pyspark machine learning pipelines


Final notes

This was a short but intuitive article on how to build machine learning pipelines using PySpark. I will reiterate it again because it is so important: you need to know how these pipes work. This is a big part of your role as a data scientist..

Have you worked on an end-to-end machine learning project before? Or have you been part of a team that built these pipes in an industrial setting? Let's connect in the comment section below and discuss.


Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.