An introduction to Apache Pig for absolute beginners!!

Contents

This article was published as part of the Data Science Blogathon

This article focuses on Apache Pig. It is a high-level platform to process and analyze a large amount of data.

OVERVIEW

If we look at the top-level overview of Pig, Pig is an abstraction of MapReduce. Pig runs on Hadoop. Therefore, uses both the Hadoop distributed file system (HDFS) like the Hadoop processing system, MapReduce. Data flows run
by a motor. Used to analyze data sets as data streams. Includes a high-level language called Pig Latin to express these data streams.

The entry for Pig is Pig Latin, which will become MapReduce jobs. Pig uses MapReduce tricks to do all the data processing. Combines the Pig Latin scripts into a series of one or more MapReduce jobs that are run.

Apache Pig was designed by Yahoo because it is easy to learn and work with.. Then, Pig makes Hadoop pretty easy. Apache Pig was developed because MapReduce programming was getting quite difficult and many MapReduce users are not comfortable with declarative languages. Now, Pig is an open source project under Apache.

TABLE OF CONTENTS

  1. Pig characteristics
  2. Cerdo vs MapReduce
  3. Pig architecture
  4. Pig run options
  5. Basic Pig Execution Commands
  6. Pig data types
  7. Pig operators
  8. Pig Latin writing example

1. PIG CHARACTERISTICS

Let's look at some of the characteristics of Pig.

  • Has a rich set of operators like join, order, etc.
  • It is easy to program as it is similar to SQL.
  • Tasks in Apache Pig have been converted to MapReduce jobs automatically. Programmers should focus only on language semantics and not MapReduce.
  • You can create your own functions using Pig.
  • Functions in other programming languages ​​like Java can be embedded in Pig Latin scripts.
  • Apache Pig can handle
    all kinds of data, as structured data, unstructured and semi-structured and
    stores the result in HDFS.

2. CERDO VS MAPREDUCE

Let's see the difference between Pig and MapReduce.

Pig has several advantages over MapReduce.

Apache Pig is a data flow language. It means that it allows users to describe how they should be read, process and then store the data from one or more inputs to one or more outputs in parallel. While MapReduce, Secondly, is a programming style.

Apache Pig is a high-level language, while MapReduce is compiled java code.

Pig syntax for performing joins and multiple files is very intuitive and quite simple like SQL. MapReduce code
gets complex if you want to write join operations.

Apache Pig's learning curve is very small. Experience in Java and MapReduce libraries is a must
to run the MapReduce code.

Apache Pig scripts can do the equivalent of multiple lines of MapReduce code and MapReduce code needs more lines of code to perform the same operations.

Apache Pig is easy to debug and test, while MapReduce programs take a long time to code, try out, etc. Pig Latin is less expensive than MapReduce.

3. PORCINE ARCHITECTURE

Now let's look at Pig's architecture.

24764pig1-8305991

Source

Pig sits on top of Hadoop. Pig scripts can be run in Grunt shell or Pig server. Pig the Pass runtime optimizes and compiles the script and eventually converts it to MapReduce jobs. Uses HDFS to store intermediate data between MapReduce jobs and then writes its output to HDFS.

4. PIG EXECUTION OPTIONS

Apache Pig can run two modes of execution. Both produce the same results.

4.1. local mode

Command in catwalks

pig -x local

4.2. Modo Hadoop

Command in catwalks

pig -exectype mapreduce

Apache Pig can be run in three ways in the two modes above.

  1. Batch mode / script file: put the Pig commands in a script file and run the script
  2. Embedded program / UDF: embed Pig commands in java and run the scripts

5. PORK GRUNT SHELL COMMANDS

Grunt shells can be used to write Pig Latin scripts. Shell commands can be invoked using the fs and sh commands. Let's see some basics
Pig Commands.

5.1. fs command

The fs command allows you to run HDFS commands from Pig

5.1.1 To list all directories in HDFS

grunt> fs -ls;

Now, all files in HDFS will be displayed.

5.1.2. To create a new mydir directory in HDFS

grunt> fs -mkdir mydir/;

The above command will create a new directory called mydir in HDFS.

5.1.3. To delete a directory

grunt> fs -rmdir mydir;

The above command will delete the created directory mydir.

5.1.4. To copy a file to HDFS

grant> fs -put sales.txt sales/;

Here, the file called sales.txt is the source file that will be copied to the destination directory in HDFS, namely, sales.

5.1.5. To exit Grunt Shell

grunt> quit;

The above command will exit the grunt shell.

5.2. command sh

The sh command allows you to execute a Unix statement from Pig

5.2.1. To show the current date

grunt> sh date;

This command will display the current date.

5.2.2. To list local files

grunt> sh ls;

This command will show all files on the local system.

5.2.3. To run Pig Latin from grunt shell

grunt> run salesreport.pig;

The above command will run a Pig Latin script file “salesreport.pig” from grunt shell.

5.2.4. To run Pig Latin from the Unix prompt

$pig salesreport.pig;

The above command will run a Pig Latin script file “salesreport.pig” from the Unix prompt.

6. P

Pig Latin consists of the following data types.

6.1. Data atom

It is a unique value. It can be a string or a number. They are of scalar types like int, float, double, etc.

For instance, “john”, 9.0

6.2. Double

A tuple is similar to a record with a sequence of fields. It can be of any type of data.

For instance, (‘john’, ‘james’) is a tuple.

6.3. Data bag

It consists of a collection of tuples that is equivalent to a “table” and SQL. Tuples are not unique and can have an arbitrary number of fields, each can be of any type.

For instance, {(‘john’, ‘James), (‘ king ‘,’ mark ‘)} is a bag of data that is equivalent to the following table in SQL.

6.4. Data map

This type of data
contains a collection of key-value pairs. Here, the key must be a single character. Values ​​can be of any type.

For instance, [name#(‘john’, ‘james’), age#22] is a data map where name, age are key and (‘john,’ james ‘), 22 are values.

7. PIG OPERATORS

Below is the content of the student.txt file.

John,23,Hyderabad
James,45,Hyderabad
Sam,33,Chennai
,56,Delhi
,43,Mumbai

7.1. LOAD

Loads data from given file system.

A = LOAD 'student.txt' AS (name: chararray, age: int, city: chararray);

Student file data with column names like 'name', 'age', 'town’ will be loaded into a variable A.

7.2. DUMP

The DUMP operator is used to display the content of a relation. Here, the contents of A will be displayed.

DUMP A
//results
(John,23,Hyderabad)
(James,45,Hyderabad)
(Sam,33,Chennai)
(,56,Delhi)
(,43,Mumbai)

7.3. SHOP

Save function saves results to file system.

STORE A into ‘myoutput’ using PigStorage(‘*’);

Here, the data present in A will be stored in myoutput separated by '*'.

DUMP myoutput; 
//results
John*23*Hyderabad
James*45*Hyderabad
Sam*33*Chennai
*56*Delhi
*43*Mumbai

7.4. FILTER

B = FILTER A by name is not null;

FILTER operator will filter a table with some conditions. Here, the name is the column in A. Non-empty values ​​in the name will be stored in variable B.

DUMP B;
//results
(John,23,Hyderabad)
(James,45,Hyderabad)
(Sam,33,Chennai)

7.5. FOR EACH ONE TO GENERATE

C = FOREACH A GENERATE name, city;

FOREACH operator is used to accessing individual records. Here, the rows present in the name and the city will be obtained from A and stored in C.

DUMP C
//results
(John,Hyderabad)
(James,Hyderabad)
(Sam,Chennai)
(,Delhi)
(,Mumbai)

8. PIG'S LATIN SCRIPT EXAMPLE

We have a file of people whose fields are the employee's identification, the name and the hours.

001,Rajiv,21
002,siddarth,12
003,Rajesh,22

First, load this data into a variable clerk. Filter it for hours less than 20 and store part time. Sort part time in descending order and save it to another file called part_time. View content.

The script will be

employee = Load ‘people’ as (empid, name, hours);
parttime = FILTER employee BY Hours < 20;
sorted = ORDER parttime by hours DESC;
STORE sorted INTO ‘part_time’;
DUMP sorted;
DESCRIBE sorted;
//results
(003,Rajesh,22)
(001,Rajiv,21)

FINAL NOTES

These are some of the basics of Apache Pig. I hope you enjoyed reading this article. Start practicing
with the Cloudera environment.

The media shown in this article is not the property of DataPeaker and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.