SlideShare une entreprise Scribd logo
1  sur  48
By,
Anuja Gunale-Kasle
CONTENTS:
PIG background
PIG Architecture
PIG Latin Basics
PIG Execution Modes
PIG Processing: loading and transforming data
PIG Built-in functions
Filtering, grouping, sorting
Data Installation of PIG and PIG Latin commands
Apache Pig was
originally developed
at Yahoo Research around
2006 for researchers to have
an ad hoc way of creating
and executing MapReduce
jobs on very large data sets.
In 2007, it was moved into
the Apache Software
Foundation.
 The story goes that the researchers
working on the project initially
referred to it simply as 'the
language'. Eventually they needed
to call it something.
 Off the top of his head, one
researcher suggested Pig, and the
name stuck.
 It is quirky yet memorable and easy
to spell.
 While some have hinted that the
name sounds coy or silly, it has
provided us with an entertaining
nomenclature, such as Pig Latin for
the language, Grunt for the shell,
and PiggyBank for the CPAN-like
shared repository.
 Pig tutorial provides basic and advanced concepts of Pig. Our Pig tutorial
is designed for beginners and professionals.
 Pig is a high-level data flow platform for executing Map Reduce programs
of Hadoop. It was developed by Yahoo. The language for Pig is pig Latin.
 Pig provides an engine for executing data flows in parallel on Hadoop. It
includes a language, Pig Latin, for expressing these data flows.
 Pig Latin includes operators for many of the traditional data operations
(join, sort, filter, etc.), as well as the ability for users to develop their own
functions for reading, processing, and writing data.
 Pig is an Apache open source project. This means users are free to
download it as source or binary, use it for themselves, contribute to it,
and—under the terms of the Apache License—use it in their products and
change it as they see fit.
Apache Pig is a high-level data flow platform for executing
MapReduce programs of Hadoop. The language used for Pig
is Pig Latin.
The Pig scripts get internally converted to Map Reduce jobs
and get executed on data stored in HDFS. Apart from that,
Pig can also execute its job in Apache Tez or Apache Spark.
Pig can handle any type of data, i.e., structured, semi-
structured or unstructured and stores the corresponding
results into Hadoop Data File System. Every task which can
be achieved using PIG can also be achieved using java used
in MapReduce.
FEATURES OF APACHE PIG
Join Datasets
Sort Datasets
Filter
Data Types
Group By
User Defined Functions
ADVANTAGES OF APACHE PIG
•Less code - The Pig consumes less line of
code to perform any operation.
•Reusability - The Pig code is flexible
enough to reuse again.
•Nested data types - The Pig provides a
useful concept of nested data types like
tuple, bag, and map.
HIVE VS PIG VS SQL – WHEN TO USE WHAT?
When to Use Hive
 Facebook widely uses Apache Hive for the analytical purposes. Furthermore, they usually
promote Hive language due to its extensive feature list and similarities with SQL. Here are
some of the scenarios when Apache Hive is ideal to use:
• To query large datasets: Apache Hive is specially used for analytics purposes on
huge datasets. It is an easy way to approach and quickly carry out complex querying on
datasets and inspect the datasets stored in the Hadoop ecosystem.
• For extensibility: Apache Hive contains a range of user APIs that help in building
the custom behaviour for the query engine.
• For someone familiar with SQL concepts: If you are familiar with SQL, Hive
will be very easy to use as you will see many similarities between the two. Hive uses the
clauses like select, where, order by, group by, etc. similar to SQL.
• To work on Structured Data: In case of structured data, Hive is widely adopted
everywhere.
• To analyse historical data: Apache Hive is a great tool for analysis and querying
of the data which is historical and collected over a period.
When to Use Pig
 Apache Pig, developed by Yahoo Research in the year 2006 is famous for its extensibility and
optimization scope. This language uses a multi-query approach that reduces the time in data
scanning. It usually runs on a client side of clusters of Hadoop. It is also quite easy to use when you
are familiar with the SQL ecosystem.You can use Apache Pig for the following special scenarios:
• To use as an ETL tool: Apache Pig is an excellent ETL (Extract-Transform-Load) tool for big
data. It is a data flow system that uses Pig Latin, a simple language for data queries and
manipulation.
• As a programmer with the scripting knowledge: The programmers with the
scripting knowledge can learn how to use Apache Pig very easily and efficiently.
• For fast processing: Apache Pig is faster than Hive because it uses a multi-query approach.
Apache Pig is famous worldwide for its speed.
• When you don’t want to work with Schema: In case of Apache Pig, there is no need
for creating a schema for the data loading related work.
• For SQL like functions: It has many functions related to SQL along with the cogroup
function.
When to Use SQL
 SQL is a general purpose database management language used around the globe. It has
been updating itself as per the user expectations for decades. It is declarative and hence
focuses explicitly on ‘what’ is needed.
 It is popularly used for the transactional as well as analytical queries. When the
requirements are not too demanding, SQL works as an excellent tool. Here are few
scenarios –
• For better performance: SQL is famous for its ability to pull data quickly and
frequently. It supports OLAP (Online Analytical Processing) applications and performs
better for these applications. Hive is slow in case of online transactional needs.
• When the datasets are small: SQL works well with small datasets and
performs much better for smaller amounts of data. It also has many ways for the
optimisation of data.
• For frequent data manipulation: If your requirement needs frequent
modification in records or you need to update a large number of records frequently, SQL
can perform these activities well. SQL also provides an entirely interactive experience to
the user.
APACHE PIG RUN MODES
 Apache Pig executes in two modes: Local Mode and MapReduce Mode.
Local Mode
• It executes in a single JVM and is used for development experimenting and
prototyping.
• Here, files are installed and run using localhost.
• The local mode works on a local file system. The input and output data stored
in the local file system.
 The command for local mode grunt shell:
1.$ pig-x local
MapReduce Mode
• The MapReduce mode is also known as Hadoop Mode.
• It is the default mode.
• In this Pig renders Pig Latin into MapReduce jobs and executes them on the
cluster.
• It can be executed against semi-distributed or fully distributed Hadoop
installation.
• Here, the input and output data are present on HDFS.
 The command for Map reduce mode:
WAYS TO EXECUTE PIG PROGRAM
These are the following ways of executing a Pig program on
local and MapReduce mode: -
• Interactive Mode - In this mode, the Pig is executed in the
Grunt shell. To invoke Grunt shell, run the pig command.
Once the Grunt mode executes, we can provide Pig Latin
statements and command interactively at the command line.
• Batch Mode - In this mode, we can run a script file having a
.pig extension. These files contain Pig Latin commands.
• Embedded Mode - In this mode, we can define our own
functions. These functions can be called as UDF (User
Defined Functions). Here, we use programming languages
like Java and Python.
PIG
ARCHITE
CTURE
The language used to analyse data in Hadoop using Pig is known
as Pig Latin.
It is a high-level data processing language which provides a rich
set of data types and operators to perform various operations on
the data.
To perform a particular task Programmers using Pig,
programmers need to write a Pig script using the Pig Latin
language, and execute them using any of the execution
mechanisms (Grunt Shell, UDFs, Embedded).
After execution, these scripts will go through a series of
transformations applied by the Pig Framework, to produce the
desired output.
Internally, Apache Pig converts these scripts into a series of
MapReduce jobs, and thus, it makes the programmer’s job easy.
The architecture of Apache Pig is shown below.
APACHE PIG COMPONENTS
As shown in the figure, there are various components in the Apache
Pig framework. Let us take a look at the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax
of the script, does type checking, and other miscellaneous checks.
The output of the parser will be a DAG (directed acyclic graph), which
represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the
nodes and the data flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which
carries out the logical optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical
plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to
Hadoop in a sorted order.
Finally, these MapReduce jobs are executed on
Hadoop producing the desired results.
PIG LATIN DATA MODEL
The data model of Pig Latin is
fully nested and it allows
complex non-atomic datatypes
such as map and tuple.
Given below is the
diagrammatical representation
of Pig Latin’s data model.
An atomic value is one that is indivisible within the
context of a database field definition (e.g. integer,
real, code of some sort etc.)
Field values that are not atomic are of two
undesirable types (Elmasri & Navathe 1989
p.139,41):
Undesirable - non atomic field types: Composite.
Multivalued.
Atom
Any single value in Pig Latin, irrespective of their data,
type is known as an Atom. It is stored as string and
can be used as string and number. int, long, float,
double, chararray, and bytearray are the atomic values
of Pig. A piece of data or a simple atomic value is
known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is
known as a tuple, the fields can be of any type. A tuple
is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples.
In other words, a collection of tuples (non-unique) is known as a
bag.
Each tuple can have any number of fields (flexible schema).
A bag is represented by ‘{}’.
 It is similar to a table in RDBMS, but unlike a table in RDBMS, it
is not necessary that every tuple contain the same number of
fields or that the fields in the same position (column) have the
same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known
as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}
Map
A map (or data map) is a set of key-value pairs.
The key needs to be of type chararray and should be
unique. The value might be of any type. It is represented
by ‘[]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin
are unordered (there is no guarantee that tuples are
processed in any particular order).
Grunt shell is a shell command.
The Grunts shell of Apace pig is mainly used to write
pig Latin scripts.
Pig script can be executed with grunt shell which is
native shell provided by Apache pig to execute pig
queries.
We can invoke shell commands using sh and fs.
JOB EXECUTION FLOW
The developer creates the scripts, and then it goes to
the local file system as functions.
Moreover, when the developers submit Pig Script, it
contacts with Pig Latin Compiler.
The compiler then splits the task and run a series of MR
jobs.
Meanwhile, Pig Compiler retrieves data from the HDFS.
The output file again goes to the HDFS after running MR
jobs.
a. Pig Execution Modes
We can run Pig in two execution modes.
These modes depend upon where the Pig script is going
to run. It also depends on where the data is residing.
We can thus store data on a single machine or in a
distributed environment like Clusters.
The three different modes to run Pig programs are:
Non-interactive shell or script mode- The user has to
create a file, load the code and execute the script.
 Then comes the Grunt shell or interactive shell for
running Apache Pig commands.
Hence, the last one named as embedded mode, which
we can use JDBC to run SQL programs from Java.
b. Pig Local mode
However, in this mode, pig implements on single
JVM and access the file system.
This mode is better for dealing with the small data
sets.
Meanwhile, the parallel mapper execution is
impossible.
The older version of the Hadoop is not thread-safe.
While the user can provide –x local to get into Pig
local mode of execution.
Therefore, Pig always looks for the local file system
path while loading data.
c. Pig Map Reduce Mode
In this mode, a user could have proper Hadoop
cluster setup and installations on it. By default,
Apache Pig installs as in MR mode. The Pig
also translates the queries into Map reduce
jobs and runs on top of Hadoop cluster. Hence,
this mode as a Map reduce runs on a
distributed cluster.
The statements like LOAD, STORE read the
data from the HDFS file system and to show
output. These Statements are also used to
d. Storing Results
The intermediate data generates during the
processing of MR jobs.
Pig stores this data in a non-permanent location
on HDFS storage.
The temporary location then created inside
HDFS for storing this intermediate data.
We can use DUMP for getting the final results
to the output screen.
The output results stored using STORE
operator.
Type Description
int 4-byte integer
long 8-byte integer
float 4-byte (single precision) floating point
double 8-byte (double precision) floating point
bytearray Array of bytes; blob
chararray String (“hello world”)
boolean True/False (case insensitive)
datetime A date and time
biginteger Java BigInteger
bigdecimal Java BigDecimal
Type Description
Tuple Ordered set of fields (a “row / record”)
Bag Collection of tuples (a “resultset / table”)
Map A set of key-value pairs
Keys must be of type chararray
BinStorage
Loads and stores data in machine-readable (binary) format
PigStorage
Loads and stores data as structured, field delimited text files
TextLoader
Loads unstructured data in UTF-8 format
PigDump
Stores data in UTF-8 format
YourOwnFormat!
via UDFs
Loads data from an HDFS file
var = LOAD 'employees.txt';
var = LOAD 'employees.txt' AS (id, name, salary);
var = LOAD 'employees.txt' using PigStorage()
AS (id, name, salary);
Each LOAD statement defines a new bag
Each bag can have multiple elements (atoms)
Each element can be referenced by name or position ($n)
A bag is immutable
A bag can be aliased and referenced later
STORE
Writes output to an HDFS file in a specified directory
grunt> STORE processed INTO 'processed_txt';
 Fails if directory exists
 Writes output files, part-[m|r]-xxxxx, to the directory
PigStorage can be used to specify a field delimiter
DUMP
Write output to screen
grunt> DUMP processed;
FOREACH
Applies expressions to every record in a bag
FILTER
Filters by expression
GROUP
Collect records with the same key
ORDER BY
Sorting
DISTINCT
Removes duplicates
Use the FILTER operator to restrict tuples or rows of data
Basic syntax:
alias2 = FILTER alias1 BY expression;
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = FILTER alias1 BY (col1 == 8) OR (NOT
(col2+col3 > col1));
DUMP alias2;
(4,2,1) (8,3,4) (7,2,5) (8,4,3)
 Use the GROUP…ALL operator to group data
 Use GROUP when only one relation is involved
 Use COGROUP with multiple relations are involved
 Basic syntax:
alias2 = GROUP alias1 ALL;
 Example:
DUMP alias1;
(John,18,4.0F) (Mary,19,3.8F) (Bill,20,3.9F)
(Joe,18,3.8F)
alias2 = GROUP alias1 BY col2;
DUMP alias2;
(18,{(John,18,4.0F),(Joe,18,3.8F)})
(19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
Use the ORDER…BY operator to sort a relation based on one
or more fields
Basic syntax:
alias = ORDER alias BY field_alias [ASC|DESC];
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = ORDER alias1 BY col3 DESC;
DUMP alias2;
(7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)
Use the DISTINCT operator to remove duplicate tuples in
a relation.
Basic syntax:
alias2 = DISTINCT alias1;
Example:
DUMP alias1;
(8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3)
alias2= DISTINCT alias1;
DUMP alias2;
(8,3,4) (1,2,3) (4,3,3)
FLATTEN
Used to un-nest tuples as well as bags
INNER JOIN
Used to perform an inner join of two or more relations based on
common field values
OUTER JOIN
Used to perform left, right or full outer joins
SPLIT
Used to partition the contents of a relation into two or more
relations
SAMPLE
Used to select a random data sample with the stated sample
size
Use the JOIN operator to perform an inner, equi-
join join of two or more relations based on common
field values
The JOIN operator always performs an inner join
Inner joins ignore null keys
Filter null keys before the join
JOIN and COGROUP operators perform similar
functions
 JOIN creates a flat set of output records
COGROUP creates a nested set of output records
DUMP Alias1;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
DUMP Alias2;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
Join Alias1 by Col1 to
Alias2 by Col1
Alias3 = JOIN Alias1 BY
Col1, Alias2 BY Col1;
Dump Alias3;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
Use the OUTER JOIN operator to perform left, right, or full
outer joins
 Pig Latin syntax closely adheres to the SQL standard
The keyword OUTER is optional
 keywords LEFT, RIGHT and FULL will imply left outer, right outer
and full outer joins respectively
Outer joins will only work provided the relations which need
to produce nulls (in the case of non-matching keys) have
schemas
Outer joins will only work for two-way joins
 To perform a multi-way outer join perform multiple two-way outer
join statements
Natively written in Java, packaged as a jar file
Other languages include Jython, JavaScript, Ruby,
Groovy, and Python
Register the jar with the REGISTER statement
Optionally, alias it with the DEFINE statement
REGISTER /src/myfunc.jar;
A = LOAD 'students';
B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
DEFINE can be used to work with UDFs and also
streaming commands
Useful when dealing with complex input/output formats
/* read and write comma-delimited data */
DEFINE Y 'stream.pl' INPUT(stdin USING
PigStreaming(','))
OUTPUT(stdout USING PigStreaming(','));
A = STREAM X THROUGH Y;
/* Define UDFs to a more readable format */
DEFINE MAXNUM
org.apache.pig.piggybank.evaluation.math.MAX;
A = LOAD ‘student_data’ AS (name:chararray, gpa1:float,
gpa2:double);
B = FOREACH A GENERATE name, MAXNUM(gpa1, gpa2);
DUMP B;
THANK YOU…

Contenu connexe

Tendances

Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processorTushar B Kute
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Zookeeper Tutorial for beginners
Zookeeper Tutorial for beginnersZookeeper Tutorial for beginners
Zookeeper Tutorial for beginnersjeetendra mandal
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 

Tendances (20)

Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
4. hbase overview
4. hbase overview4. hbase overview
4. hbase overview
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Zookeeper Tutorial for beginners
Zookeeper Tutorial for beginnersZookeeper Tutorial for beginners
Zookeeper Tutorial for beginners
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Similaire à Apache PIG

Introduction to pig.
Introduction to pig.Introduction to pig.
Introduction to pig.Triloki Gupta
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsNetajiGandi1
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache PigSachin Vakkund
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsKrishnaVeni451953
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdfssuser92282c
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramViswanath Gangavaram
 
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
lecturte 5. Hgfjhffjyy to the data will be 1.pptlecturte 5. Hgfjhffjyy to the data will be 1.ppt
lecturte 5. Hgfjhffjyy to the data will be 1.pptYashJadhav496388
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxmrudulasb
 

Similaire à Apache PIG (20)

Apache pig
Apache pigApache pig
Apache pig
 
Introduction to pig.
Introduction to pig.Introduction to pig.
Introduction to pig.
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdf
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
 
Pig
PigPig
Pig
 
What is apache_pig
What is apache_pigWhat is apache_pig
What is apache_pig
 
What is apache_pig
What is apache_pigWhat is apache_pig
What is apache_pig
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaram
 
What is apache pig
What is apache pigWhat is apache pig
What is apache pig
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
lecturte 5. Hgfjhffjyy to the data will be 1.pptlecturte 5. Hgfjhffjyy to the data will be 1.ppt
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
 
Apache pig
Apache pigApache pig
Apache pig
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptx
 

Dernier

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 

Dernier (20)

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 

Apache PIG

  • 2. CONTENTS: PIG background PIG Architecture PIG Latin Basics PIG Execution Modes PIG Processing: loading and transforming data PIG Built-in functions Filtering, grouping, sorting Data Installation of PIG and PIG Latin commands
  • 3. Apache Pig was originally developed at Yahoo Research around 2006 for researchers to have an ad hoc way of creating and executing MapReduce jobs on very large data sets. In 2007, it was moved into the Apache Software Foundation.
  • 4.  The story goes that the researchers working on the project initially referred to it simply as 'the language'. Eventually they needed to call it something.  Off the top of his head, one researcher suggested Pig, and the name stuck.  It is quirky yet memorable and easy to spell.  While some have hinted that the name sounds coy or silly, it has provided us with an entertaining nomenclature, such as Pig Latin for the language, Grunt for the shell, and PiggyBank for the CPAN-like shared repository.
  • 5.  Pig tutorial provides basic and advanced concepts of Pig. Our Pig tutorial is designed for beginners and professionals.  Pig is a high-level data flow platform for executing Map Reduce programs of Hadoop. It was developed by Yahoo. The language for Pig is pig Latin.  Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows.  Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data.  Pig is an Apache open source project. This means users are free to download it as source or binary, use it for themselves, contribute to it, and—under the terms of the Apache License—use it in their products and change it as they see fit.
  • 6. Apache Pig is a high-level data flow platform for executing MapReduce programs of Hadoop. The language used for Pig is Pig Latin. The Pig scripts get internally converted to Map Reduce jobs and get executed on data stored in HDFS. Apart from that, Pig can also execute its job in Apache Tez or Apache Spark. Pig can handle any type of data, i.e., structured, semi- structured or unstructured and stores the corresponding results into Hadoop Data File System. Every task which can be achieved using PIG can also be achieved using java used in MapReduce.
  • 7. FEATURES OF APACHE PIG Join Datasets Sort Datasets Filter Data Types Group By User Defined Functions
  • 8. ADVANTAGES OF APACHE PIG •Less code - The Pig consumes less line of code to perform any operation. •Reusability - The Pig code is flexible enough to reuse again. •Nested data types - The Pig provides a useful concept of nested data types like tuple, bag, and map.
  • 9.
  • 10.
  • 11. HIVE VS PIG VS SQL – WHEN TO USE WHAT? When to Use Hive  Facebook widely uses Apache Hive for the analytical purposes. Furthermore, they usually promote Hive language due to its extensive feature list and similarities with SQL. Here are some of the scenarios when Apache Hive is ideal to use: • To query large datasets: Apache Hive is specially used for analytics purposes on huge datasets. It is an easy way to approach and quickly carry out complex querying on datasets and inspect the datasets stored in the Hadoop ecosystem. • For extensibility: Apache Hive contains a range of user APIs that help in building the custom behaviour for the query engine. • For someone familiar with SQL concepts: If you are familiar with SQL, Hive will be very easy to use as you will see many similarities between the two. Hive uses the clauses like select, where, order by, group by, etc. similar to SQL. • To work on Structured Data: In case of structured data, Hive is widely adopted everywhere. • To analyse historical data: Apache Hive is a great tool for analysis and querying of the data which is historical and collected over a period.
  • 12. When to Use Pig  Apache Pig, developed by Yahoo Research in the year 2006 is famous for its extensibility and optimization scope. This language uses a multi-query approach that reduces the time in data scanning. It usually runs on a client side of clusters of Hadoop. It is also quite easy to use when you are familiar with the SQL ecosystem.You can use Apache Pig for the following special scenarios: • To use as an ETL tool: Apache Pig is an excellent ETL (Extract-Transform-Load) tool for big data. It is a data flow system that uses Pig Latin, a simple language for data queries and manipulation. • As a programmer with the scripting knowledge: The programmers with the scripting knowledge can learn how to use Apache Pig very easily and efficiently. • For fast processing: Apache Pig is faster than Hive because it uses a multi-query approach. Apache Pig is famous worldwide for its speed. • When you don’t want to work with Schema: In case of Apache Pig, there is no need for creating a schema for the data loading related work. • For SQL like functions: It has many functions related to SQL along with the cogroup function.
  • 13. When to Use SQL  SQL is a general purpose database management language used around the globe. It has been updating itself as per the user expectations for decades. It is declarative and hence focuses explicitly on ‘what’ is needed.  It is popularly used for the transactional as well as analytical queries. When the requirements are not too demanding, SQL works as an excellent tool. Here are few scenarios – • For better performance: SQL is famous for its ability to pull data quickly and frequently. It supports OLAP (Online Analytical Processing) applications and performs better for these applications. Hive is slow in case of online transactional needs. • When the datasets are small: SQL works well with small datasets and performs much better for smaller amounts of data. It also has many ways for the optimisation of data. • For frequent data manipulation: If your requirement needs frequent modification in records or you need to update a large number of records frequently, SQL can perform these activities well. SQL also provides an entirely interactive experience to the user.
  • 14.
  • 15. APACHE PIG RUN MODES  Apache Pig executes in two modes: Local Mode and MapReduce Mode. Local Mode • It executes in a single JVM and is used for development experimenting and prototyping. • Here, files are installed and run using localhost. • The local mode works on a local file system. The input and output data stored in the local file system.  The command for local mode grunt shell: 1.$ pig-x local MapReduce Mode • The MapReduce mode is also known as Hadoop Mode. • It is the default mode. • In this Pig renders Pig Latin into MapReduce jobs and executes them on the cluster. • It can be executed against semi-distributed or fully distributed Hadoop installation. • Here, the input and output data are present on HDFS.  The command for Map reduce mode:
  • 16. WAYS TO EXECUTE PIG PROGRAM These are the following ways of executing a Pig program on local and MapReduce mode: - • Interactive Mode - In this mode, the Pig is executed in the Grunt shell. To invoke Grunt shell, run the pig command. Once the Grunt mode executes, we can provide Pig Latin statements and command interactively at the command line. • Batch Mode - In this mode, we can run a script file having a .pig extension. These files contain Pig Latin commands. • Embedded Mode - In this mode, we can define our own functions. These functions can be called as UDF (User Defined Functions). Here, we use programming languages like Java and Python.
  • 18. The language used to analyse data in Hadoop using Pig is known as Pig Latin. It is a high-level data processing language which provides a rich set of data types and operators to perform various operations on the data. To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, Embedded). After execution, these scripts will go through a series of transformations applied by the Pig Framework, to produce the desired output. Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes the programmer’s job easy. The architecture of Apache Pig is shown below.
  • 19. APACHE PIG COMPONENTS As shown in the figure, there are various components in the Apache Pig framework. Let us take a look at the major components. Parser Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges. Optimizer The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown.
  • 20. Compiler The compiler compiles the optimized logical plan into a series of MapReduce jobs. Execution engine Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results.
  • 21. PIG LATIN DATA MODEL The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such as map and tuple. Given below is the diagrammatical representation of Pig Latin’s data model.
  • 22. An atomic value is one that is indivisible within the context of a database field definition (e.g. integer, real, code of some sort etc.) Field values that are not atomic are of two undesirable types (Elmasri & Navathe 1989 p.139,41): Undesirable - non atomic field types: Composite. Multivalued.
  • 23. Atom Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as string and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as a field. Example − ‘raja’ or ‘30’ Tuple A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS. Example − (Raja, 30)
  • 24. Bag A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’.  It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column) have the same type. Example − {(Raja, 30), (Mohammad, 45)} A bag can be a field in a relation; in that context, it is known as inner bag. Example − {Raja, 30, {9848022338, raja@gmail.com,}}
  • 25. Map A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type. It is represented by ‘[]’ Example − [name#Raja, age#30] Relation A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in any particular order).
  • 26. Grunt shell is a shell command. The Grunts shell of Apace pig is mainly used to write pig Latin scripts. Pig script can be executed with grunt shell which is native shell provided by Apache pig to execute pig queries. We can invoke shell commands using sh and fs.
  • 27. JOB EXECUTION FLOW The developer creates the scripts, and then it goes to the local file system as functions. Moreover, when the developers submit Pig Script, it contacts with Pig Latin Compiler. The compiler then splits the task and run a series of MR jobs. Meanwhile, Pig Compiler retrieves data from the HDFS. The output file again goes to the HDFS after running MR jobs.
  • 28. a. Pig Execution Modes We can run Pig in two execution modes. These modes depend upon where the Pig script is going to run. It also depends on where the data is residing. We can thus store data on a single machine or in a distributed environment like Clusters. The three different modes to run Pig programs are: Non-interactive shell or script mode- The user has to create a file, load the code and execute the script.  Then comes the Grunt shell or interactive shell for running Apache Pig commands. Hence, the last one named as embedded mode, which we can use JDBC to run SQL programs from Java.
  • 29. b. Pig Local mode However, in this mode, pig implements on single JVM and access the file system. This mode is better for dealing with the small data sets. Meanwhile, the parallel mapper execution is impossible. The older version of the Hadoop is not thread-safe. While the user can provide –x local to get into Pig local mode of execution. Therefore, Pig always looks for the local file system path while loading data.
  • 30. c. Pig Map Reduce Mode In this mode, a user could have proper Hadoop cluster setup and installations on it. By default, Apache Pig installs as in MR mode. The Pig also translates the queries into Map reduce jobs and runs on top of Hadoop cluster. Hence, this mode as a Map reduce runs on a distributed cluster. The statements like LOAD, STORE read the data from the HDFS file system and to show output. These Statements are also used to
  • 31. d. Storing Results The intermediate data generates during the processing of MR jobs. Pig stores this data in a non-permanent location on HDFS storage. The temporary location then created inside HDFS for storing this intermediate data. We can use DUMP for getting the final results to the output screen. The output results stored using STORE operator.
  • 32. Type Description int 4-byte integer long 8-byte integer float 4-byte (single precision) floating point double 8-byte (double precision) floating point bytearray Array of bytes; blob chararray String (“hello world”) boolean True/False (case insensitive) datetime A date and time biginteger Java BigInteger bigdecimal Java BigDecimal
  • 33. Type Description Tuple Ordered set of fields (a “row / record”) Bag Collection of tuples (a “resultset / table”) Map A set of key-value pairs Keys must be of type chararray
  • 34. BinStorage Loads and stores data in machine-readable (binary) format PigStorage Loads and stores data as structured, field delimited text files TextLoader Loads unstructured data in UTF-8 format PigDump Stores data in UTF-8 format YourOwnFormat! via UDFs
  • 35. Loads data from an HDFS file var = LOAD 'employees.txt'; var = LOAD 'employees.txt' AS (id, name, salary); var = LOAD 'employees.txt' using PigStorage() AS (id, name, salary); Each LOAD statement defines a new bag Each bag can have multiple elements (atoms) Each element can be referenced by name or position ($n) A bag is immutable A bag can be aliased and referenced later
  • 36. STORE Writes output to an HDFS file in a specified directory grunt> STORE processed INTO 'processed_txt';  Fails if directory exists  Writes output files, part-[m|r]-xxxxx, to the directory PigStorage can be used to specify a field delimiter DUMP Write output to screen grunt> DUMP processed;
  • 37. FOREACH Applies expressions to every record in a bag FILTER Filters by expression GROUP Collect records with the same key ORDER BY Sorting DISTINCT Removes duplicates
  • 38. Use the FILTER operator to restrict tuples or rows of data Basic syntax: alias2 = FILTER alias1 BY expression; Example: DUMP alias1; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) alias2 = FILTER alias1 BY (col1 == 8) OR (NOT (col2+col3 > col1)); DUMP alias2; (4,2,1) (8,3,4) (7,2,5) (8,4,3)
  • 39.  Use the GROUP…ALL operator to group data  Use GROUP when only one relation is involved  Use COGROUP with multiple relations are involved  Basic syntax: alias2 = GROUP alias1 ALL;  Example: DUMP alias1; (John,18,4.0F) (Mary,19,3.8F) (Bill,20,3.9F) (Joe,18,3.8F) alias2 = GROUP alias1 BY col2; DUMP alias2; (18,{(John,18,4.0F),(Joe,18,3.8F)}) (19,{(Mary,19,3.8F)}) (20,{(Bill,20,3.9F)})
  • 40. Use the ORDER…BY operator to sort a relation based on one or more fields Basic syntax: alias = ORDER alias BY field_alias [ASC|DESC]; Example: DUMP alias1; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) alias2 = ORDER alias1 BY col3 DESC; DUMP alias2; (7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)
  • 41. Use the DISTINCT operator to remove duplicate tuples in a relation. Basic syntax: alias2 = DISTINCT alias1; Example: DUMP alias1; (8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3) alias2= DISTINCT alias1; DUMP alias2; (8,3,4) (1,2,3) (4,3,3)
  • 42. FLATTEN Used to un-nest tuples as well as bags INNER JOIN Used to perform an inner join of two or more relations based on common field values OUTER JOIN Used to perform left, right or full outer joins SPLIT Used to partition the contents of a relation into two or more relations SAMPLE Used to select a random data sample with the stated sample size
  • 43. Use the JOIN operator to perform an inner, equi- join join of two or more relations based on common field values The JOIN operator always performs an inner join Inner joins ignore null keys Filter null keys before the join JOIN and COGROUP operators perform similar functions  JOIN creates a flat set of output records COGROUP creates a nested set of output records
  • 44. DUMP Alias1; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) DUMP Alias2; (2,4) (8,9) (1,3) (2,7) (2,9) (4,6) (4,9) Join Alias1 by Col1 to Alias2 by Col1 Alias3 = JOIN Alias1 BY Col1, Alias2 BY Col1; Dump Alias3; (1,2,3,1,3) (4,2,1,4,6) (4,3,3,4,6) (4,2,1,4,9) (4,3,3,4,9) (8,3,4,8,9) (8,4,3,8,9)
  • 45. Use the OUTER JOIN operator to perform left, right, or full outer joins  Pig Latin syntax closely adheres to the SQL standard The keyword OUTER is optional  keywords LEFT, RIGHT and FULL will imply left outer, right outer and full outer joins respectively Outer joins will only work provided the relations which need to produce nulls (in the case of non-matching keys) have schemas Outer joins will only work for two-way joins  To perform a multi-way outer join perform multiple two-way outer join statements
  • 46. Natively written in Java, packaged as a jar file Other languages include Jython, JavaScript, Ruby, Groovy, and Python Register the jar with the REGISTER statement Optionally, alias it with the DEFINE statement REGISTER /src/myfunc.jar; A = LOAD 'students'; B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
  • 47. DEFINE can be used to work with UDFs and also streaming commands Useful when dealing with complex input/output formats /* read and write comma-delimited data */ DEFINE Y 'stream.pl' INPUT(stdin USING PigStreaming(',')) OUTPUT(stdout USING PigStreaming(',')); A = STREAM X THROUGH Y; /* Define UDFs to a more readable format */ DEFINE MAXNUM org.apache.pig.piggybank.evaluation.math.MAX; A = LOAD ‘student_data’ AS (name:chararray, gpa1:float, gpa2:double); B = FOREACH A GENERATE name, MAXNUM(gpa1, gpa2); DUMP B;