An introduction to Apache Pig for absolute beginners!!

This article was published as part of the Data Science Blogathon

Este artículo se centra en Apache PigThe Pig, a domesticated mammal of the Suidae family, It is known for its versatility in agriculture and food production. Native to Asia, Its breeding has spread all over the world. Pigs are omnivores and have a high capacity to adapt to various habitats. What's more, play an important role in the economy, Providing meat, leather and other derived products. Their intelligence and social behavior are also .... It is a high-level platform to process and analyze a large amount of data.

OVERVIEW

If we look at the top-level overview of Pig, Pig es una abstracción de MapReduceMapReduce is a programming model designed to efficiently process and generate large data sets. Powered by Google, This approach breaks down work into smaller tasks, which are distributed among multiple nodes in a cluster. Each node processes its part and then the results are combined. This method allows you to scale applications and handle massive volumes of information, being fundamental in the world of Big Data..... Pig runs on Hadoop. Therefore, utiliza tanto el Distributed File SystemA distributed file system (DFS) Allows storage and access to data on multiple servers, facilitating the management of large volumes of information. This type of system improves availability and redundancy, as files are replicated to different locations, reducing the risk of data loss. What's more, Allows users to access files from different platforms and devices, promoting collaboration and... the Hadoop (HDFSHDFS, o Hadoop Distributed File System, It is a key infrastructure for storing large volumes of data. Designed to run on common hardware, HDFS enables data distribution across multiple nodes, ensuring high availability and fault tolerance. Its architecture is based on a master-slave model, where a master node manages the system and slave nodes store the data, facilitating the efficient processing of information..) like the Hadoop processing system, MapReduce. Data flows run
by a motor. Used to analyze data sets as data streams. Includes a high-level language called Pig Latin to express these data streams.

The entry for Pig is Pig Latin, which will become MapReduce jobs. Pig uses MapReduce tricks to do all the data processing. Combines the Pig Latin scripts into a series of one or more MapReduce jobs that are run.

Apache Pig was designed by Yahoo because it is easy to learn and work with.. Then, Pig makes Hadoop pretty easy. Apache Pig was developed because MapReduce programming was getting quite difficult and many MapReduce users are not comfortable with declarative languages. Now, Pig is an open source project under Apache.

Pig characteristics
Cerdo vs MapReduce
Pig architecture
Pig run options
Basic Pig Execution Commands
Pig data types
Pig operators
Pig Latin writing example

1. PIG CHARACTERISTICS

Let's look at some of the characteristics of Pig.

Has a rich set of operators like join, order, etc.
It is easy to program as it is similar to SQL.
Tasks in Apache Pig have been converted to MapReduce jobs automatically. Programmers should focus only on language semantics and not MapReduce.
You can create your own functions using Pig.
Functions in other programming languages like Java can be embedded in Pig Latin scripts.
Apache Pig can handle
all kinds of data, as structured data, unstructured and semi-structured and
stores the result in HDFS.

2. CERDO VS MAPREDUCE

Let's see the difference between Pig and MapReduce.

Pig has several advantages over MapReduce.

Apache Pig is a data flow language. It means that it allows users to describe how they should be read, process and then store the data from one or more inputs to one or more outputs in parallel. While MapReduce, Secondly, is a programming style.

Apache Pig is a high-level language, while MapReduce is compiled java code.

Pig syntax for performing joins and multiple files is very intuitive and quite simple like SQL. MapReduce code
gets complex if you want to write join operations.

Apache Pig's learning curve is very small. Experience in Java and MapReduce libraries is a must
to run the MapReduce code.

Apache Pig scripts can do the equivalent of multiple lines of MapReduce code and MapReduce code needs more lines of code to perform the same operations.

Apache Pig is easy to debug and test, while MapReduce programs take a long time to code, try out, etc. Pig Latin is less expensive than MapReduce.

3. PORCINE ARCHITECTURE

Now let's look at Pig's architecture.

Source

Pig sits on top of Hadoop. Pig scripts can be run in Grunt shell or Pig server. Pig the Pass runtime optimizes and compiles the script and eventually converts it to MapReduce jobs. Uses HDFS to store intermediate data between MapReduce jobs and then writes its output to HDFS.

4. PIG EXECUTION OPTIONS

Apache Pig can run two modes of execution. Both produce the same results.

4.1. local mode

Command in catwalks

pig -x local

4.2. Modo Hadoop

Command in catwalks

pig -exectype mapreduce

Apache Pig can be run in three ways in the two modes above.

Batch mode / script file: put the Pig commands in a script file and run the script
Embedded program / UDF: embed Pig commands in java and run the scripts

5. PORK GRUNT SHELL COMMANDS

Grunt shells can be used to write Pig Latin scripts. Shell commands can be invoked using the fs and sh commands. Let's see some basics
Pig Commands.

5.1. fs command

The fs command allows you to run HDFS commands from Pig

5.1.1 To list all directories in HDFS

grunt> fs -ls;

Now, all files in HDFS will be displayed.

5.1.2. To create a new mydir directory in HDFS

grunt> fs -mkdir mydir/;

The above command will create a new directory called mydir in HDFS.

5.1.3. To delete a directory

grunt> fs -rmdir mydir;

The above command will delete the created directory mydir.

5.1.4. To copy a file to HDFS

grant> fs -put sales.txt sales/;

Here, the file called sales.txt is the source file that will be copied to the destination directory in HDFS, namely, sales.

5.1.5. To exit Grunt Shell

grunt> quit;

The above command will exit the grunt shell.

5.2. command sh

The sh command allows you to execute a Unix statement from Pig

5.2.1. To show the current date

grunt> sh date;

This command will display the current date.

5.2.2. To list local files

grunt> sh ls;

This command will show all files on the local system.

5.2.3. To run Pig Latin from grunt shell

grunt> run salesreport.pig;

The above command will run a Pig Latin script file “salesreport.pig” from grunt shell.

5.2.4. To run Pig Latin from the Unix prompt

$pig salesreport.pig;

The above command will run a Pig Latin script file “salesreport.pig” from the Unix prompt.

6. P

Pig Latin consists of the following data types.

6.1. Data atom

It is a unique value. It can be a string or a number. They are of scalar types like int, float, double, etc.

For instance, “john”, 9.0

6.2. Double

A tuple is similar to a record with a sequence of fields. It can be of any type of data.

For instance, (‘john’, ‘james’) is a tuple.

6.3. Data bag

It consists of a collection of tuples that is equivalent to a “table” and SQL. Tuples are not unique and can have an arbitrary number of fields, each can be of any type.

For instance, {(‘john’, ‘James), (‘ king ‘,’ mark ‘)} is a bag of data that is equivalent to the following table in SQL.

6.4. Data map

This type of data
contains a collection of key-value pairs. Here, the key must be a single character. Values can be of any type.

For instance, [name#(‘john’, ‘james’), age#22] is a data map where name, age are key and (‘john,’ james ‘), 22 are values.

7. PIG OPERATORS

Below is the content of the student.txt file.

John,23,Hyderabad
James,45,Hyderabad
Sam,33,Chennai
,56,Delhi
,43,Mumbai

7.1. LOAD

Loads data from given file system.

A = LOAD 'student.txt' AS (name: chararray, age: int, city: chararray);

Student file data with column names like 'name', 'age', 'town’ se cargarán en una variableIn statistics and mathematics, a "variable" is a symbol that represents a value that can change or vary. There are different types of variables, and qualitative, that describe non-numerical characteristics, and quantitative, representing numerical quantities. Variables are fundamental in experiments and studies, since they allow the analysis of relationships and patterns between different elements, facilitating the understanding of complex phenomena.... A.

7.2. DUMP

The DUMP operator is used to display the content of a relation. Here, the contents of A will be displayed.

DUMP A
//results
(John,23,Hyderabad)
(James,45,Hyderabad)
(Sam,33,Chennai)
(,56,Delhi)
(,43,Mumbai)

7.3. SHOP

Save function saves results to file system.

STORE A into ‘myoutput’ using PigStorage(‘*’);

Here, the data present in A will be stored in myoutput separated by '*'.

DUMP myoutput; 
//results
John*23*Hyderabad
James*45*Hyderabad
Sam*33*Chennai
*56*Delhi
*43*Mumbai

7.4. FILTER

B = FILTER A by name is not null;

FILTER operator will filter a table with some conditions. Here, the name is the column in A. Non-empty values in the name will be stored in variable B.

DUMP B;
//results
(John,23,Hyderabad)
(James,45,Hyderabad)
(Sam,33,Chennai)

7.5. FOR EACH ONE TO GENERATE

C = FOREACH A GENERATE name, city;

FOREACH operator is used to accessing individual records. Here, the rows present in the name and the city will be obtained from A and stored in C.

DUMP C
//results
(John,Hyderabad)
(James,Hyderabad)
(Sam,Chennai)
(,Delhi)
(,Mumbai)

8. PIG'S LATIN SCRIPT EXAMPLE

We have a file of people whose fields are the employee's identification, the name and the hours.

001,Rajiv,21
002,siddarth,12
003,Rajesh,22

First, load this data into a variable clerk. Filter it for hours less than 20 and store part time. Sort part time in descending order and save it to another file called part_time. View content.

The script will be

employee = Load ‘people’ as (empid, name, hours);
parttime = FILTER employee BY Hours < 20;
sorted = ORDER parttime by hours DESC;
STORE sorted INTO ‘part_time’;
DUMP sorted;
DESCRIBE sorted;
//results
(003,Rajesh,22)
(001,Rajiv,21)

FINAL NOTES

These are some of the basics of Apache Pig. I hope you enjoyed reading this article. Start practicing
with the Cloudera environment.

The media shown in this article is not the property of DataPeaker and is used at the author's discretion.

An introduction to Apache Pig for absolute beginners!!

Contents

OVERVIEW

TABLE OF CONTENTS

1. PIG CHARACTERISTICS

2. CERDO VS MAPREDUCE

4. PIG EXECUTION OPTIONS

4.1. local mode

4.2. Modo Hadoop

5. PORK GRUNT SHELL COMMANDS

5.1. fs command

5.1.1 To list all directories in HDFS

5.1.2. To create a new mydir directory in HDFS

5.1.3. To delete a directory

5.1.4. To copy a file to HDFS

5.1.5. To exit Grunt Shell

5.2. command sh

5.2.1. To show the current date

5.2.2. To list local files

5.2.3. To run Pig Latin from grunt shell

5.2.4. To run Pig Latin from the Unix prompt

6. P

6.1. Data atom

6.2. Double

6.3. Data bag

6.4. Data map

7. PIG OPERATORS

7.1. LOAD

7.2. DUMP

7.3. SHOP

7.4. FILTER

7.5. FOR EACH ONE TO GENERATE

8. PIG'S LATIN SCRIPT EXAMPLE

FINAL NOTES

Related

Related Posts:

Recent posts

Artificial Intelligence in Video: How New Technologies Are Changing Video Production?

IT profiles you should consider

How to record a screen on Windows computer?

¿Do you know the seniority levels?

Find Your Best Slip Rings and Rotary Joints Here

Posittion Agency: Advantages of link building for an online store

Subscribe to our Newsletter

Gaming

Brands

Business

Languages