Spark SQL, catalyst optimizer | Analyze data with Spark SQL

Contents

Overview

  • Relational databases are ubiquitous, but what happens when you need to scale your infrastructure?
  • We will discuss the role Spark SQL plays in this situation and understand why it is such a useful tool for learning.
  • This tutorial also shows how Spark SQL works using a Python case study

Introduction

Almost all organizations use relational databases for various tasks, from managing and tracking a large amount of information to organizing and processing transactions. It is one of the first concepts that we are taught in programming school.

And let's be thankful for that because this is a crucial gear in a data scientist's skill set!! You just can't get by without knowing how databases work. It is a key aspect of any machine learning draft.

Structured query language (SQL) is easily the most popular language when it comes to databases. Unlike other programming languages, is easy to learn and helps us get started with our data extraction process. For most data science jobs, SQL proficiency ranks higher than most other programming languages.

spark-sql-9316923

But there is a major challenge with SQL: will have a hard time getting it to work when dealing with huge data sets. This is where Spark SQL takes a front seat and closes the gap.. I will talk more about this in the next section..

This hands-on tutorial will introduce you to the world of Spark SQL, how does it work, what are the different features it offers and how can you implement it using python. We will also talk about an important concept that you will come across often in interviews.: the catalyst optimizer.

Let us begin!

Note: If you are completely new to the SQL world, I highly recommend the following course:

Table of Contents

  1. Challenges with relational database scaling
  2. Spark SQL overview
  3. Spark SQL features
  4. How does Spark SQL execute a query?
  5. What is a Catalyst Optimizer?
  6. Running SQL commands with Spark
  7. Using Apache Spark at scale

Challenges with relational database scaling

The question is why should I learn Spark SQL? I mentioned this briefly before, but let's look at it in a little more detail now.

Relational databases for a large project (machine learning) contain hundreds or perhaps thousands of tables and most features in one table map to other features in other tables. These databases are designed to run only on a single machine in order to maintain the rules of table mappings and avoid the problems of distributed computing..

This often becomes a problem for organizations when they want to scale with this design.. It would require more complex and expensive hardware with significantly more processing and storage capacity. As you can imagine, Upgrading from simpler hardware to more complex hardware can be a big challenge.

An organization may need to take their website offline for some time to make the necessary changes. During this period, lose business with new customers they could potentially have acquired.

What's more, as data volume increases, organizations struggle to handle this huge amount of data using traditional relational databases. This is where Spark SQL comes into the picture..

Spark SQL overview

Without big data analytics, companies are blind and deaf and roam the web like deer on a highway.

~ Geoffrey Moore

Hadoop and MapReduce frameworks have been around for a long time in big data analytics. But these frames require a lot of read and write operations on a hard drive, which makes them very expensive in terms of time and speed.

Apache Spark is the most efficient data processing framework in enterprises today. It is true that the cost of Spark is high since it requires a lot of RAM for in-memory calculation, but it's still a favorite among data scientists and big data engineers.

In the Spark ecosystem, we have the following components:

  1. MLlib: This is Spark's scalable machine learning library that provides high-quality algorithms for regression, clustering, classification, etc. You can get started building machine learning pipelines using Spark's MLlib using this article: How to build machine learning pipelines using PySpark?
  2. Spark Streaming: We are generating data at an unprecedented rate and scale right now. How do we ensure that our machine learning pipeline continues to produce results as soon as the data is generated and collected? Learn to use a machine learning model to make predictions about data transmission with PySpark?
  3. GraphX: It's a Spark API for graphics, a network graphics engine that supports parallel graphics calculation.
  4. Spark SQL: This is a distributed framework for structured data processing provided by Spark

We know that relational databases also store the relationships between the different variables as well as the different tables and are designed in such a way that they can handle complex queries..

Spark SQL is an amazing combination of relational processing and functional Spark programming.. Provides support for multiple data sources and makes SQL queries possible, resulting in a very powerful tool for analyzing structured data at scale.

screenshot-from-2020-01-14-10-48-25-6880080

Spark SQL features

Spark SQL has a ton of awesome features, but I wanted to highlight some keys that you will use a lot in your function:

  1. Query structure data within Spark programs: Most of you may already be familiar with SQL. Therefore, you don't need to learn how to define a complex function in Python or Scala to use Spark. You can use the exact same query to get the results of your largest data sets!!
  2. Compatible con Hive: No solo SQL, but you can also run the same Hive queries with Spark SQL Engine. Allows full compatibility with current Hive queries.
  3. One way to access the data: In typical enterprise-level projects, does not have a common data source. Instead, must handle various types of files and databases. Spark SQL supports almost all file types and gives you a common way to access a variety of data sources, as Hive, Euro, Parquet, JSON y JDBC
  4. Performance and scalability: When working with large data sets, there are chances of errors occurring between the time the query is run. Spark SQL supports full fault tolerance mid-query, so we can work even with a thousand nodes simultaneously
  5. User-defined functions: UDF is a Spark SQL feature that defines new column-based functions that extend the Spark SQL vocabulary for transforming data sets

How does Spark SQL execute a query?

How does Spark SQL work, essentially? Let's understand the process in this section.

  • Analysis: First, when you consult something, Spark SQL finds the relationship to be calculated. It is calculated using an abstract syntax tree (AST) where you check the correct usage of the elements used to define the query and then create a logical plan to execute the query.

spark_sql_parser_parser_rule_context-3190536

Source

  • Logical optimization: In this next step, rule-based optimization is applied to the logical plan. Use techniques like:
    • Filter data ahead of time if the query contains a where clause
    • Use the available index on the tables, as it can improve performance, Y
    • Even making sure that the different data sources are brought together in the most efficient order.
  • Physical planning: In this step, one or more physical plans are formed using the logical plan. Spark SQL then selects the plan that will be able to execute the query in the most efficient way, namely, using less computational resources.
  • GENERATION Code: In the final step, Spark SQL generates code. It involves generating a Java bytecode to run on each machine. Catalyst uses a special feature of the Scala language called “Quasiquotes” to facilitate code generation.

sql-plan-3029088

What is a Catalyst Optimizer?

Optimization means updating the existing system or workflow in such a way that it works more efficiently, while using fewer resources. An optimizer known as Catalyst Optimizer is implemented in Spark SQL which supports rules-based and cost-based optimization techniques.

In rule-based optimization, we have defined a set of rules that will determine how the query will be executed. It will rewrite the existing query in a better way to improve performance.

For instance, let's say there is an index available on the table. Later, the index will be used for query execution according to the rules and filters WHERE will be applied first on the initial data if possible (instead of applying them last).

What's more, there are some cases where using an index slows down a query. We know that it is not always possible for a set of defined rules to always make great decisions, truth?

Here is the problem: rule-based optimization does not take data distribution into account. This is where we turn to a cost-based Optimizer. Use statistics on the table, your indexes and data distribution to make better decisions.

Running SQL commands with Spark

Time to code!

I have created a random data set of 25 million rows. You can download the full data set here. We have a text file with comma separated values. Then, first, we will import the required libraries, we will read the dataset and see how Spark will partition the data into partitions:

two-rows-7326469

Here,

  • The first value in each row is the age of the person (which must be a whole number)
  • The second value is the person's blood group (which must be a string)
  • The third and fourth values ​​are city and gender (both are chains), Y
  • The final value is an id (which is of integer type)

We will map the data in each row to a specific data type and name using Spark rows:

tow-rows-2-1327720

five-rows-7309776

Then, we will create a data frame using the parsed rows. Our goal is to find the counts of values ​​of the gender variable by using a simple group by function in data frame:

dataframe-head-9695740

value_count_gender-2092469

time-1-1279402

It took around 26 ms to calculate the count of values ​​of 25 million rows using a groupby function in the data frame. You can calculate the time using %%weather in the private cell of his Jupyter notebook.

Now, we will perform the same query using Spark SQL and see if it improves performance or not.

First, you need to register the data frame as a temporary table using the function registerTempTable. This creates an in-memory table that is only scoped to the cluster in which it was created. The lifespan of this temporary table is limited to just one session. It is stored using Hive In-Memory Column Format which is highly optimized for relational data.

What's more, You don't even need to write complex functions to get results if you are comfortable with SQL!! Here, you just need to pass the same SQL query to get the desired results on larger data:

value_count_gender-2092469

time-2-5859703

It only took around 18 ms calculate the counts of values. This is much faster than even a Spark data frame.

Then, we will perform another SQL query to calculate the average age in a city:

average-age-8119183

Apache Spark use case at scale

We know that Facebook has more than 2000 million monthly active users and with more data, face equally complex challenges. For a single query, need to analyze tens of terabytes of data in a single query. Facebook believes that Spark had matured to the point where we could compare it to Hive for a number of batch processing use cases..

Let me illustrate this using a case study from Facebook itself.. One of its tasks was to prepare the characteristics for the ranking of entities that Facebook uses in its various online services. Previously, used the Hive-based infrastructure, that required a lot of resources and was difficult to maintain, as the pipeline was divided into hundreds of Hive jobs. Then they built a faster and more manageable pipeline with Spark. You can read his full tour here.

They have compared the results of Spark vs Hive Pipeline. Here is a comparison chart in terms of latency (elapsed time from one end of the job to the other) which clearly shows that Spark is much faster than Hive.

latency-4546514

Final notes

We covered the core idea behind Spark SQL in this article and also learned how to use it to our advantage.. We also took a large data set and applied our learning in Python.

Spark SQL is relatively unknown to many aspiring data science, but it will be useful in your role in the industry or even in interviews. It is quite an important addition in the eyes of the hiring manager..

Share your thoughts and suggestions in the comment section below..

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.