Apache Pig is a high-level data flow platform for executing MapReduce programs on Hadoop. The language used for Pig is called Pig Latin. Pig scripts get converted into MapReduce jobs that are executed on data stored in HDFS. Pig can handle structured, semi-structured, or unstructured data and store results back in HDFS. Common Pig operations include joining, sorting, filtering, grouping, and using built-in and user-defined functions.
2. CONTENTS:
PIG background
PIG Architecture
PIG Latin Basics
PIG Execution Modes
PIG Processing: loading and transforming data
PIG Built-in functions
Filtering, grouping, sorting
Data Installation of PIG and PIG Latin commands
3. Apache Pig was
originally developed
at Yahoo Research around
2006 for researchers to have
an ad hoc way of creating
and executing MapReduce
jobs on very large data sets.
In 2007, it was moved into
the Apache Software
Foundation.
4. The story goes that the researchers
working on the project initially
referred to it simply as 'the
language'. Eventually they needed
to call it something.
Off the top of his head, one
researcher suggested Pig, and the
name stuck.
It is quirky yet memorable and easy
to spell.
While some have hinted that the
name sounds coy or silly, it has
provided us with an entertaining
nomenclature, such as Pig Latin for
the language, Grunt for the shell,
and PiggyBank for the CPAN-like
shared repository.
5. Pig tutorial provides basic and advanced concepts of Pig. Our Pig tutorial
is designed for beginners and professionals.
Pig is a high-level data flow platform for executing Map Reduce programs
of Hadoop. It was developed by Yahoo. The language for Pig is pig Latin.
Pig provides an engine for executing data flows in parallel on Hadoop. It
includes a language, Pig Latin, for expressing these data flows.
Pig Latin includes operators for many of the traditional data operations
(join, sort, filter, etc.), as well as the ability for users to develop their own
functions for reading, processing, and writing data.
Pig is an Apache open source project. This means users are free to
download it as source or binary, use it for themselves, contribute to it,
and—under the terms of the Apache License—use it in their products and
change it as they see fit.
6. Apache Pig is a high-level data flow platform for executing
MapReduce programs of Hadoop. The language used for Pig
is Pig Latin.
The Pig scripts get internally converted to Map Reduce jobs
and get executed on data stored in HDFS. Apart from that,
Pig can also execute its job in Apache Tez or Apache Spark.
Pig can handle any type of data, i.e., structured, semi-
structured or unstructured and stores the corresponding
results into Hadoop Data File System. Every task which can
be achieved using PIG can also be achieved using java used
in MapReduce.
7. FEATURES OF APACHE PIG
Join Datasets
Sort Datasets
Filter
Data Types
Group By
User Defined Functions
8. ADVANTAGES OF APACHE PIG
•Less code - The Pig consumes less line of
code to perform any operation.
•Reusability - The Pig code is flexible
enough to reuse again.
•Nested data types - The Pig provides a
useful concept of nested data types like
tuple, bag, and map.
9.
10.
11. HIVE VS PIG VS SQL – WHEN TO USE WHAT?
When to Use Hive
Facebook widely uses Apache Hive for the analytical purposes. Furthermore, they usually
promote Hive language due to its extensive feature list and similarities with SQL. Here are
some of the scenarios when Apache Hive is ideal to use:
• To query large datasets: Apache Hive is specially used for analytics purposes on
huge datasets. It is an easy way to approach and quickly carry out complex querying on
datasets and inspect the datasets stored in the Hadoop ecosystem.
• For extensibility: Apache Hive contains a range of user APIs that help in building
the custom behaviour for the query engine.
• For someone familiar with SQL concepts: If you are familiar with SQL, Hive
will be very easy to use as you will see many similarities between the two. Hive uses the
clauses like select, where, order by, group by, etc. similar to SQL.
• To work on Structured Data: In case of structured data, Hive is widely adopted
everywhere.
• To analyse historical data: Apache Hive is a great tool for analysis and querying
of the data which is historical and collected over a period.
12. When to Use Pig
Apache Pig, developed by Yahoo Research in the year 2006 is famous for its extensibility and
optimization scope. This language uses a multi-query approach that reduces the time in data
scanning. It usually runs on a client side of clusters of Hadoop. It is also quite easy to use when you
are familiar with the SQL ecosystem.You can use Apache Pig for the following special scenarios:
• To use as an ETL tool: Apache Pig is an excellent ETL (Extract-Transform-Load) tool for big
data. It is a data flow system that uses Pig Latin, a simple language for data queries and
manipulation.
• As a programmer with the scripting knowledge: The programmers with the
scripting knowledge can learn how to use Apache Pig very easily and efficiently.
• For fast processing: Apache Pig is faster than Hive because it uses a multi-query approach.
Apache Pig is famous worldwide for its speed.
• When you don’t want to work with Schema: In case of Apache Pig, there is no need
for creating a schema for the data loading related work.
• For SQL like functions: It has many functions related to SQL along with the cogroup
function.
13. When to Use SQL
SQL is a general purpose database management language used around the globe. It has
been updating itself as per the user expectations for decades. It is declarative and hence
focuses explicitly on ‘what’ is needed.
It is popularly used for the transactional as well as analytical queries. When the
requirements are not too demanding, SQL works as an excellent tool. Here are few
scenarios –
• For better performance: SQL is famous for its ability to pull data quickly and
frequently. It supports OLAP (Online Analytical Processing) applications and performs
better for these applications. Hive is slow in case of online transactional needs.
• When the datasets are small: SQL works well with small datasets and
performs much better for smaller amounts of data. It also has many ways for the
optimisation of data.
• For frequent data manipulation: If your requirement needs frequent
modification in records or you need to update a large number of records frequently, SQL
can perform these activities well. SQL also provides an entirely interactive experience to
the user.
14.
15. APACHE PIG RUN MODES
Apache Pig executes in two modes: Local Mode and MapReduce Mode.
Local Mode
• It executes in a single JVM and is used for development experimenting and
prototyping.
• Here, files are installed and run using localhost.
• The local mode works on a local file system. The input and output data stored
in the local file system.
The command for local mode grunt shell:
1.$ pig-x local
MapReduce Mode
• The MapReduce mode is also known as Hadoop Mode.
• It is the default mode.
• In this Pig renders Pig Latin into MapReduce jobs and executes them on the
cluster.
• It can be executed against semi-distributed or fully distributed Hadoop
installation.
• Here, the input and output data are present on HDFS.
The command for Map reduce mode:
16. WAYS TO EXECUTE PIG PROGRAM
These are the following ways of executing a Pig program on
local and MapReduce mode: -
• Interactive Mode - In this mode, the Pig is executed in the
Grunt shell. To invoke Grunt shell, run the pig command.
Once the Grunt mode executes, we can provide Pig Latin
statements and command interactively at the command line.
• Batch Mode - In this mode, we can run a script file having a
.pig extension. These files contain Pig Latin commands.
• Embedded Mode - In this mode, we can define our own
functions. These functions can be called as UDF (User
Defined Functions). Here, we use programming languages
like Java and Python.
18. The language used to analyse data in Hadoop using Pig is known
as Pig Latin.
It is a high-level data processing language which provides a rich
set of data types and operators to perform various operations on
the data.
To perform a particular task Programmers using Pig,
programmers need to write a Pig script using the Pig Latin
language, and execute them using any of the execution
mechanisms (Grunt Shell, UDFs, Embedded).
After execution, these scripts will go through a series of
transformations applied by the Pig Framework, to produce the
desired output.
Internally, Apache Pig converts these scripts into a series of
MapReduce jobs, and thus, it makes the programmer’s job easy.
The architecture of Apache Pig is shown below.
19. APACHE PIG COMPONENTS
As shown in the figure, there are various components in the Apache
Pig framework. Let us take a look at the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax
of the script, does type checking, and other miscellaneous checks.
The output of the parser will be a DAG (directed acyclic graph), which
represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the
nodes and the data flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which
carries out the logical optimizations such as projection and pushdown.
20. Compiler
The compiler compiles the optimized logical
plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to
Hadoop in a sorted order.
Finally, these MapReduce jobs are executed on
Hadoop producing the desired results.
21. PIG LATIN DATA MODEL
The data model of Pig Latin is
fully nested and it allows
complex non-atomic datatypes
such as map and tuple.
Given below is the
diagrammatical representation
of Pig Latin’s data model.
22. An atomic value is one that is indivisible within the
context of a database field definition (e.g. integer,
real, code of some sort etc.)
Field values that are not atomic are of two
undesirable types (Elmasri & Navathe 1989
p.139,41):
Undesirable - non atomic field types: Composite.
Multivalued.
23. Atom
Any single value in Pig Latin, irrespective of their data,
type is known as an Atom. It is stored as string and
can be used as string and number. int, long, float,
double, chararray, and bytearray are the atomic values
of Pig. A piece of data or a simple atomic value is
known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is
known as a tuple, the fields can be of any type. A tuple
is similar to a row in a table of RDBMS.
Example − (Raja, 30)
24. Bag
A bag is an unordered set of tuples.
In other words, a collection of tuples (non-unique) is known as a
bag.
Each tuple can have any number of fields (flexible schema).
A bag is represented by ‘{}’.
It is similar to a table in RDBMS, but unlike a table in RDBMS, it
is not necessary that every tuple contain the same number of
fields or that the fields in the same position (column) have the
same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known
as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}
25. Map
A map (or data map) is a set of key-value pairs.
The key needs to be of type chararray and should be
unique. The value might be of any type. It is represented
by ‘[]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin
are unordered (there is no guarantee that tuples are
processed in any particular order).
26. Grunt shell is a shell command.
The Grunts shell of Apace pig is mainly used to write
pig Latin scripts.
Pig script can be executed with grunt shell which is
native shell provided by Apache pig to execute pig
queries.
We can invoke shell commands using sh and fs.
27. JOB EXECUTION FLOW
The developer creates the scripts, and then it goes to
the local file system as functions.
Moreover, when the developers submit Pig Script, it
contacts with Pig Latin Compiler.
The compiler then splits the task and run a series of MR
jobs.
Meanwhile, Pig Compiler retrieves data from the HDFS.
The output file again goes to the HDFS after running MR
jobs.
28. a. Pig Execution Modes
We can run Pig in two execution modes.
These modes depend upon where the Pig script is going
to run. It also depends on where the data is residing.
We can thus store data on a single machine or in a
distributed environment like Clusters.
The three different modes to run Pig programs are:
Non-interactive shell or script mode- The user has to
create a file, load the code and execute the script.
Then comes the Grunt shell or interactive shell for
running Apache Pig commands.
Hence, the last one named as embedded mode, which
we can use JDBC to run SQL programs from Java.
29. b. Pig Local mode
However, in this mode, pig implements on single
JVM and access the file system.
This mode is better for dealing with the small data
sets.
Meanwhile, the parallel mapper execution is
impossible.
The older version of the Hadoop is not thread-safe.
While the user can provide –x local to get into Pig
local mode of execution.
Therefore, Pig always looks for the local file system
path while loading data.
30. c. Pig Map Reduce Mode
In this mode, a user could have proper Hadoop
cluster setup and installations on it. By default,
Apache Pig installs as in MR mode. The Pig
also translates the queries into Map reduce
jobs and runs on top of Hadoop cluster. Hence,
this mode as a Map reduce runs on a
distributed cluster.
The statements like LOAD, STORE read the
data from the HDFS file system and to show
output. These Statements are also used to
31. d. Storing Results
The intermediate data generates during the
processing of MR jobs.
Pig stores this data in a non-permanent location
on HDFS storage.
The temporary location then created inside
HDFS for storing this intermediate data.
We can use DUMP for getting the final results
to the output screen.
The output results stored using STORE
operator.
32. Type Description
int 4-byte integer
long 8-byte integer
float 4-byte (single precision) floating point
double 8-byte (double precision) floating point
bytearray Array of bytes; blob
chararray String (“hello world”)
boolean True/False (case insensitive)
datetime A date and time
biginteger Java BigInteger
bigdecimal Java BigDecimal
33. Type Description
Tuple Ordered set of fields (a “row / record”)
Bag Collection of tuples (a “resultset / table”)
Map A set of key-value pairs
Keys must be of type chararray
34. BinStorage
Loads and stores data in machine-readable (binary) format
PigStorage
Loads and stores data as structured, field delimited text files
TextLoader
Loads unstructured data in UTF-8 format
PigDump
Stores data in UTF-8 format
YourOwnFormat!
via UDFs
35. Loads data from an HDFS file
var = LOAD 'employees.txt';
var = LOAD 'employees.txt' AS (id, name, salary);
var = LOAD 'employees.txt' using PigStorage()
AS (id, name, salary);
Each LOAD statement defines a new bag
Each bag can have multiple elements (atoms)
Each element can be referenced by name or position ($n)
A bag is immutable
A bag can be aliased and referenced later
36. STORE
Writes output to an HDFS file in a specified directory
grunt> STORE processed INTO 'processed_txt';
Fails if directory exists
Writes output files, part-[m|r]-xxxxx, to the directory
PigStorage can be used to specify a field delimiter
DUMP
Write output to screen
grunt> DUMP processed;
37. FOREACH
Applies expressions to every record in a bag
FILTER
Filters by expression
GROUP
Collect records with the same key
ORDER BY
Sorting
DISTINCT
Removes duplicates
38. Use the FILTER operator to restrict tuples or rows of data
Basic syntax:
alias2 = FILTER alias1 BY expression;
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = FILTER alias1 BY (col1 == 8) OR (NOT
(col2+col3 > col1));
DUMP alias2;
(4,2,1) (8,3,4) (7,2,5) (8,4,3)
39. Use the GROUP…ALL operator to group data
Use GROUP when only one relation is involved
Use COGROUP with multiple relations are involved
Basic syntax:
alias2 = GROUP alias1 ALL;
Example:
DUMP alias1;
(John,18,4.0F) (Mary,19,3.8F) (Bill,20,3.9F)
(Joe,18,3.8F)
alias2 = GROUP alias1 BY col2;
DUMP alias2;
(18,{(John,18,4.0F),(Joe,18,3.8F)})
(19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
40. Use the ORDER…BY operator to sort a relation based on one
or more fields
Basic syntax:
alias = ORDER alias BY field_alias [ASC|DESC];
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = ORDER alias1 BY col3 DESC;
DUMP alias2;
(7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)
41. Use the DISTINCT operator to remove duplicate tuples in
a relation.
Basic syntax:
alias2 = DISTINCT alias1;
Example:
DUMP alias1;
(8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3)
alias2= DISTINCT alias1;
DUMP alias2;
(8,3,4) (1,2,3) (4,3,3)
42. FLATTEN
Used to un-nest tuples as well as bags
INNER JOIN
Used to perform an inner join of two or more relations based on
common field values
OUTER JOIN
Used to perform left, right or full outer joins
SPLIT
Used to partition the contents of a relation into two or more
relations
SAMPLE
Used to select a random data sample with the stated sample
size
43. Use the JOIN operator to perform an inner, equi-
join join of two or more relations based on common
field values
The JOIN operator always performs an inner join
Inner joins ignore null keys
Filter null keys before the join
JOIN and COGROUP operators perform similar
functions
JOIN creates a flat set of output records
COGROUP creates a nested set of output records
45. Use the OUTER JOIN operator to perform left, right, or full
outer joins
Pig Latin syntax closely adheres to the SQL standard
The keyword OUTER is optional
keywords LEFT, RIGHT and FULL will imply left outer, right outer
and full outer joins respectively
Outer joins will only work provided the relations which need
to produce nulls (in the case of non-matching keys) have
schemas
Outer joins will only work for two-way joins
To perform a multi-way outer join perform multiple two-way outer
join statements
46. Natively written in Java, packaged as a jar file
Other languages include Jython, JavaScript, Ruby,
Groovy, and Python
Register the jar with the REGISTER statement
Optionally, alias it with the DEFINE statement
REGISTER /src/myfunc.jar;
A = LOAD 'students';
B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
47. DEFINE can be used to work with UDFs and also
streaming commands
Useful when dealing with complex input/output formats
/* read and write comma-delimited data */
DEFINE Y 'stream.pl' INPUT(stdin USING
PigStreaming(','))
OUTPUT(stdout USING PigStreaming(','));
A = STREAM X THROUGH Y;
/* Define UDFs to a more readable format */
DEFINE MAXNUM
org.apache.pig.piggybank.evaluation.math.MAX;
A = LOAD ‘student_data’ AS (name:chararray, gpa1:float,
gpa2:double);
B = FOREACH A GENERATE name, MAXNUM(gpa1, gpa2);
DUMP B;