SQL window functions: an essential knowledge for Big Data engineers

Share on facebook
Share on twitter
Share on linkedin
Share on telegram
Share on whatsapp

Contents

Overview

  • Get to know the functions of the SQL window
  • Understand what aggregate functions lack and why we need window functions in SQL

Introduction

Data is proliferating at an astonishing rate, Growing up 44 zettabytes en 2020! And it goes without saying that the technology to handle such gigantic amounts of data is also changing at a proportional rate..

Today we have a wide variety of tools such as Hive and Spark to handle Big Data. But, even though they differ in certain respects, still employ the basics of SQL, making it very easy for people from all walks of life to manipulate Big Data with a breeze. Although we still fail in certain aspects of SQL. Then, in this article, I'm going to talk about one of those aspects in particular: window functions.

window-functions-e28093-a-must-know-for-data-engineers-and-data-scientiststop-5-gan-libraries-you-must-know-9130551

That's right! There are window functions in SQL, it is not a joke! And considering the astonished expression on his face, this article seems to be the need of the moment. Even I was not very aware of these functions until recently, which shows how underestimated these functions are.

But wait until the end of this article because Window Functions will really blow your mind with the simplicity with which they solve such complex problems.. And if, data engineers, data scientists, data analysts and everyone else who flirts with data should give these roles their due credit.

Before moving on, I suggest you familiarize yourself with basic SQL functions by reviewing this article: 24 commonly used SQL functions. And if you are interested in learning SQL in a course format, check our course: Structured query language (SQL) for data science.

Table of Contents

  • We present the data set
  • Where are aggregate functions lagging?
  • What are window functions in SQL?
  • Understanding SQL window functions – Over clause
  • Windows with PARTITION BY
  • Organize rows within partitions
  • Window functions
    1. Row number
    2. Rango vs Dense_Rank
    3. Nth_Value
    4. Ntile
    5. Lead and lag
  • SQL code file

We present the data set

before continuing, let me introduce you to the fictitious dataset that we will be working on in this article. Suppose there is a company that maintains name records, employee's job and salary as follows:

sql-employee-dataset-1998124

We will use this sample dataset to understand the concepts in this article.. Very well, Let's start!

Where are aggregate functions lagging?

Suppose you want to determine the total salary of all employees in the company. How would you do it? You can simply use the SUM aggregate function () in the SALARY column.

sql-sum-1576119

Easy.

How about determining the total salary of employees by job category? Use the last query and add a GROUP BY clause on the JOB column.

sql-group-by-5473358

Excellent!

Now let me ask you two more questions:

  1. Show total salary and total salary by job category along with the value of each row.
  2. Organize salary in descending order within each job category.

sql-windows-function-meme-9729450

Did you get it? ¿No? Probably?

These were definitely not as easy as the first ones that you could get instantly. But why?

Good, if you think about it, previous queries required simple aggregate functions to solve the problem. SQL aggregate functions only give us a single value for the aggregate row group (think about the first query we wrote).

But the latest queries cannot be solved simply by using such functions. Those queries want us to keep the original identity of the individual rows, something aggregate functions fail to address. Therefore, to resolve these types of queries, we need different types of functions: window functions.

What are window functions in SQL?

Window functions perform calculations on a set of rows that are related to each other. But, unlike aggregate functions, window functions do not collapse the result of the rows into a single value. However, all rows keep their original identity and the calculated result is returned for each row.

Understanding SQL window functions – Over clause

For instance, if i were to show the total salary of the employees along with each row value, it would look like this:

over-clause-sql-7280516

the UPON clause means a window of rows on which a window function is applied. Can be used with aggregate functions, as we have used it here with the SUM function, thus making it a window function. Or it can also be used with non-aggregated functions that are only used as window functions (we'll learn more about them in later sections).

Then, the syntax to define a simple window function that outputs the same value for all rows is as follows:

window_function_name () OVER ()

But, How about we apply the window function to specific rows instead of the whole table?

Windows with PARTITION BY

the PARTITION BY The clause is used in conjunction with the OVER clause. Divide the rows into different partitions. Then, the window function acts on these partitions.

For instance, to display the total salary by job category for all rows, we would have to modify our original SQL query as follows:

partition-by-sql-8513715

As you can see, the total_work_wage The column represents the sum of the sales for that specific job category and not for the entire table.

Then, the syntax to define window function for row partition is as follows:

window_function_name () OVER ()

Now, How about arranging the rows within each partition?

Organize rows within partitions

We know that to organize rows in a table, we can use the ORDER BY clause. Then, to organize rows within each partition, we have to modify the OVER clause with the ORDER BY clause.

ordered-window-function-sql-5670519

Here, the ranks have been divided according to their job category, as indicated in the JOB column. As it scrolls down, You will notice that the SALARY column has been sorted in descending order and the orderly_work_wage The column represents the running total for the job category (start over after every partition).

Then, The syntax to define the window function for partitioning rows and arranging them in order is as follows:

window_function_name () OVER ( )

Window functions

Now that we know how to define window functions using the OVER clause and some of its modified versions, We can finally move on to working with window functions!

1. Row_Number

Sometimes, your dataset may not have a column that describes the sequential order of the rows, as is the case with our data set. Then, we can make use of the ROW NUMBER() window function. Assign a unique sequential number to each row in the table.

row_number-sql-9831525

Notice that the numbering starts from 1. What's more, to avoid any conflict with the MySQL keyword for the function, I have put the column name in quotes.

But, since it is a window function, we can also limit it to partitions and then order those partitions.

row-number-order-by-clause-sql-9379153

Here, We have divided the rows in the WORK column and have arranged them according to the employee's SALARY. Notice how the numbering restarts every time a new partition starts.

But suppose we want to classify employees based on their salaries.

2. Ranking vs Dense_Rank

the RANK() The window function, as the name suggests, sorts the rows within your partition based on the given condition.

rank-function-sql-1-1438222

Notice the highlighted part. In the case of ROW_NUMBER (), we have a sequential number. Secondly, in the case of RANK (), we have the same range for rows with the same value.

But here's a problem. Although rows with the same value are assigned the same rank, subsequent rank skips missing rank. This would not give us the desired results if we were to return "top N different" values ​​from a table. Therefore, we have a different function to solve this problem.

the DENSE_RANK () The function is similar to RANK () except for one difference, does not skip any ranges when sorting rows.

dense-rank-sql-1-5031301

Here, all ranges are distinct and increase sequentially within each partition. Compared to the RANK function (), no range has been skipped within a partition.

3. Nth_Value

If you want to retrieve the nth value of a window frame for an expression, you can use window function NTH_VALUE (expression, N).

For instance, to recover the third highest salary in each JOB category, we can divide the rows according to the column WORK, then sort the rows within the partitions according to decreasing salary and, Finally, use the NTH_VALUE function to retrieve the value. The command will be the following:

nth-value-sql-6847841

You must have noticed something different after the Order By clause. That is the Framework clause. Determine the subset of the partition (the milestone) which will be used by the window function to calculate the value of the current row.

Here, I mentioned that all previous and next rows of a current row will be considered within the frame when applying the window function. But, Why did i use frame clause here and not with other functions? This is because the other window functions work on the entire partition, even if a frame clause is provided. Pero solo NTH_VALUE () can work on frames within a partition.

Now suppose you want to generate the first value of each partition. Although there is a FIRST_VALUE () function too, I'm going to use the NTH_VALUE for the same.

first-value-sql-9260489

In the same way, we also have a LAST_VALUE () function. But I am going to determine the last value within each partition as above, although using decreasing order of rows.

last-value-sql-5344188

4. Ntile

Sometimes, you may want to sort the rows within the partition into a certain number of groups. This is useful when you want to determine the percentile, quartile, etc. in which a particular row is located. the NTILE () The function is used for such purposes. Returns the group number for each row in the partition.

For instance, Let's find the quartile of each row according to the employee's SALARY:

ntile-function-sql-7983059

Similarly, you can divide the rows into different number of groups and calculate the NTILE for different partitions.

5. Lead and lag

Often, you may want to compare the value of the current row with that of the previous or next row. Helps in easy data analysis. the LEAD() Y DELAY() The window functions are there for this purpose only.

lead-function-sql-3883301

Here, we create a new column containing SALARY from the next row within each partition ordered by salary using the LEAD function. Notice that the last row of each partition contains a null value because there is no subsequent row to extract data from.

Now, let's do the same with the LAG function.

lag-function-sql-9836435

Here, we create two new columns. The first column contains SALARY from the previous row within each partition ordered by salary. While the second column contains the difference between SALARY of the previous row and the current row. As you can see, this is very useful for a quick analysis of the difference between wages within the same partition.

SQL code file

All SQL code related to window functions for this article can be found in this link.

Final notes

We have seen quite a few window functions already and I hope you can now appreciate the beauty of window functions in SQL. But the learning doesn't stop here. After mastering the fundamentals, it's time to master the tools to handle Big Data.

If you want to move to the Data Engineering domain, I suggest the following articles for an easy transition:

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.