Partitioning and grouping in Hive

Contents

Overview

  • Understand the meaning of partitioning and grouping in Hive in detail.
  • We will see, how to create partitions and cubes on the Hive.

Introduction

You may have seen an encyclopedia in your school or university library. It is a set of books that will give you information about almost anything. Do you know what is the best of the encyclopedia?

partitioning-vs-bucketing-in-apache-hive-2850988

Yes, you guessed it correctly. The words are arranged alphabetically. For instance, has a word in mind “Pyramids”. You will go directly to pick up the book with the title "P". You don't have to look for that in other books. Can you imagine how difficult the task of looking for a single book would be if they were stored in no order?

Here, storing words alphabetically represents indexing, but using a different location for words that start with the same character is known as grouping.

There are similar types of storage techniques, as partitions and groupings, in Apache Hive so we can get faster results for search queries. In this article, we will see what partition and grouping is, and when to use which.

Table of Contents

  1. What is partitioning?
  2. When to use partitioning?
  3. What is grouping?
  4. When to use grouping?

What is partitioning?

Apache Hive allows us to organize the table in multiple partitions where we can group the same type of data. Used to distribute the load horizontally. Let's understand with an example:

Suppose we have to create a table in the hive containing the product details for a fashion ecommerce company. It has the following columns:

screenshot-from-2020-11-03-02-39-29-4664491

Now, the first filter most customers use is Gender, then select categories like Shirt, its size and color. Let's see how to create the partitions for this example.

CREATE TABLE products ( product_id string,
                        brand      string,
                        size       string,
                        discount   float,
                        price      float )
PARTITIONED BY (gender string,
                category string,
                color string);

Now, the hive will store the data in the directory structure as:

/user/hive/warehouse/mytable/gender=male/category=shoes/color=black

screenshot-from-2020-11-11-14-45-38-9488485

Data partitioning gives us performance benefits and also helps us organize data. Now, let's see when to use partition in hive.

When to use partitioning?

  • When the column with a high search query has a low cardinality. For instance, if you create a partition with the country name, a maximum of 195 partitions and the hive will be able to manage this many directories.
  • Secondly, do not partition columns with very high cardinality. For instance, Product ID, timestamp and price because it will create millions of directories that will be impossible for the hive to manage.
  • It is effective when the data volume on each partition is not very high. For instance, if you have the airline data and want to calculate the total number of flights in one day. Then, the result will take longer to calculate over the partition “Dubai”, since it has one of the busiest airports in the world, while for a country like “Albania” will return results faster.

What is grouping?

In the example above, we know that we cannot partition on the column price because its data type is float and there are an infinite number of possible unique prices.

Hive will have to generate a separate directory for each of the unique prices and it would be very difficult for Hive to manage them. Instead of this, we can manually define the number of deposits we want for these columns.

screenshot-from-2020-11-11-14-41-05-3002838

In grouping, partitions can be subdivided into groups based on a column's hash function. Provides additional structure to the data that can be used for more efficient queries.

CREATE TABLE products ( product_id string,
                        brand string,
                        size string,
                        discount float,
                        price float )
PARTITIONED BY (gender string,
                category string,
                color string)
CLUSTERED BY (price) INTO 50 BUCKETS;

Now, will only be created 50 deposits no matter how many unique values ​​are in the price column. For instance, in the first cube, all products with a price [ 0 – 500 ] Iran, and in the next group of products with a price [ 500 – 200 ] and so on.

When to use grouping?

  • We cannot split into a column with a very high cardinality. Too many partitions will result in multiple Hadoop files, which will increase the load on the same node, since it has to transport the metadata of each of the partitions.
  • If some combinations of the map side are involved in your queries, grouped tables are a good option. Map side join is a process where two tables are joined using map function alone without any reduced function. I recommend that you read this article to better understand the combinations of the side of the map: The side of the map joins in Hive

Final notes

In this article, we have seen what partition and grouping is, how to create them and what are their pros and cons.

I highly recommend that you check out the following resources to learn more about Apache Hive:

If you have any questions related to this article, let me know in the comment section below.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.