Pig Latin is a data flow language and execution framework for parallel computation. It allows users to express data analysis programs intuitively as a series of steps. Pig runs these steps on Hadoop for scalable processing. Key features include a simple declarative language, support for nested data types, user defined functions, and a debugging environment. The document provides an overview of Pig Latin concepts like loading and transforming data, filtering, joining, and outputting results. It also compares Pig Latin to MapReduce and SQL, highlighting Pig's advantages for iterative data analysis tasks on large datasets.
1. CSC 5800:
CSC 8710
IntelligentData Management
Systems:
Big
Algorithms and Tools
Pig Latin: A Not-So-Foreign
Language for Data Processing
1
2. What we will be covering
Introduction
MapReduce Overview
Pig Overview
Pig Features
Pig Latin
Pig Debugger
Demo
2
3. Introduction
Enormous data
Innovation critically depends upon analyzing terabytes of
data collected everyday
SQL can resolve the structure data problems
Parallel Database processing
– Data is enormous can’t be analyzed serially.
– Has to be analyzed in parallel.
– Shared nothing clusters are the way to go.
3
4. Parallel DB Products
Teradata, Oracle RAC, Netezza
Expensive at web scale
Programmers have to write complex SQL queries
because of this declarative programming is not preferred
4
5. Procedural programming
Map-Reduce programming model
It can easily perform a group by aggregation in parallel
over a cluster of machines
The programmer provides map functions which is used as
a filter or transforming method
The reduce function performs the aggregation
Appealing to the programmer because there are only 2
high level declarative functions to enable parallel
processing
5
6. MapReduce Overview
Programming Model
– To cater large data analytics
– Works over Hadoop
– Java based
– Splits data into independent chunks and process them
in-parallel
Program structure
– Mapper
– Reducer
– Driver Program
6
7. MapReduce Driver Program
Works as ‘Main’ function for MR job
Takes care of
– Number of arguments
– Input Data Location
– Input Data Types
– Output Data Location
– Output Data types
– Number of Mappers
– Number of Reducers
7
8. Mapper and Reducer Class
Mapper Class
– Main task is to perform any function logic
– Computes tasks like:
• Filtering
• Splitting
• Tokenizing
• Transforming
Reducer Class
– Works as an aggregator
– Aggregates the intermediate results gathered from
Mapper
8
9. Word Count Execution
Input
the quick
brown fox
Map
Shuffle & Sort
Reduce
the, 1
brown, 1
fox, 1
Output
Reduce
Map
brown, 2
fox, 2
how, 1
now, 1
the, 3
Reduce
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
fox, 1
the, 1
the fox ate
the mouse
Map
quick, 1
how, 1
now, 1
brown, 1
how now
brown cow
Map
ate, 1
mouse, 1
cow, 1
9
10. MapReduce Word Count Program
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
10
11. Map Reduce Limitations
1 input – 2 stage data flow is extremely rigid.
– To perform a task like join or sum iteration task,
workaround has to be devised.
– Custom code for common task like filtering or
transforming or projection
– The code is difficult to reuse and maintain
Moreover, because of its own data types, workflow and
the fact that people have to learn java, makes it’s a tough
choice to take.
11
12. Pig
An Apache open source project.
Provides an engine for executing data flows in parallel on
Hadoop.
Includes a language called ‘Pig Latin’ for expressing
these data flows.
High level declarative data workflow language.
It has best of both worlds:
– High Level declarative querying like SQL
– Low Level procedural like Map Reduce
12
14. Why Choose Pig
Written like SQL, compiled into MapReduce
Fully nested data model
Extensive support for UDFs
Can answer multiple questions in one single workflow.
A = load './input.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
store D into './output';
14
15. Features and Motivation
Design goal of pig is to provide programmers with
appealing experience for performing ad-hoc analysis of
extremely large data sets.
– DataFlow Language
– QuickStart and Interoperability
– Nested Data Model
– UDF’s
– Debugging Environment
15
16. Data Flow Language
Each step specifies a single high level data
transformation
Different from SQL where all these results are a single
output.
The system has given opportunity to provide optimization
function.
– Example:
A= Load ‘input.txt’;
B= Filter A by UDF (Column1);
C= Filter B by Column1 > 0.8;
16
17. Quick start and Interoperability
Data Load
– Capability of Ad-Hoc analysis
– Can run queries directly on Data from dump of search
engines
– Just have to provide a function that tells Pig how to
parse the content of file into tuple.
– Similarly for output
• Any output format.
• These function can be reused.
• Used for visualization or dumped to excel directly
17
18. Pig as part of workflow
Pig easily becomes a part of workflow eco-system
– Can take most of the input types
– Can output in many of the forms
– Doesn’t take over the data, i.e., it does not lock the
data that is being processed.
– Read only data analysis
18
19. Optional data schemas
Schema can be provided by the user :
– In the beginning
– On the fly
– Example:
• A= LOAD ‘input.txt’ as (Column1;Column2);
• B= Filter A by Column1>5;
If the schema is not provided then the columns can be
referred by ‘$0’, ‘$1’, ‘$2’…. for the 1st, 2nd, 3rd column
etc.
Example:
A= LOAD ‘input.txt’;
B= Filter A by $0>5;
19
20. Nested Data Model
Suppose, for a document, we want to extract the term and
its position.
Format of output : Map<document Set<position>>
SQL data model:
Term
Document ID
Position
Hi
1
2
Hi
1
5
Or keep in normalized form, i.e.,
– term_info(termid, String)
– position_info(termid, position, document)
20
21. Problem resolved using Pig
In pig we have complex data types like map, tuple or bag
to occur as a field of a table itself.
Example:
Term
Document ID
Position
Hi
1
(2,5,8..)
This approach is good because its more closer to what a
programmer thinks.
Data is stored on disk in a nested fashion only
It gives user an ease in writing UDFs.
21
22. UDFs
Significant part of data analysis is custom processing
For example, user might want to process natural
language stemming
Or checking if the page is spam or not, or many other
tasks
To work on this, Pig Latin has extensive support for
UDFs, most of the tasks can be resolved using the UDFs
It can take non-atomic input and can provide a nonatomic output also
Currently the UDFs can be written in java or python
22
23. Debugging Environment
In any language, getting a data processing program work
correctly usually takes many iterations
First few iterations mostly produce errors
With a large scale data this would result in serious time
and resource wastage
Debuggers can help
Pig has a novel debugging environment
Generates concise examples from input data
Data samples are carefully chosen to resemble real data
as far as possible
Sample data is carved specially
23
24. Pig Latin
Language in which data workflow statements are written
It runs on the shell called ‘Grunt’
It has a shared repository name Piggybank
We can create our custom UDFs and add them to
Piggybank
24
25. Data Model
Rich, yet simple data models
Atoms
– Simple atomic values like string or number
Tuple
– A collection of fields each of which can be of any data
type
– Analogous to rows in SQL
Bag
– Collection of tuples or both tuples and atoms
– Can also be heterogeneous
25
26. Data Model (cont.)
Example of a relation
Atom
Tuple
Bag
T= ‘alice’, (labours,1), {(‘ipod’, 2),‘james’}
Tuple is represented with round braces
Bag is represented with curly braces
26
27. Specifying Input Data : LOAD
Its the first step in Pig Latin program
Specifying what the input files are
How are its contents to be deserialized, i.e., converted to
pig data model.
LOAD command
– Example
queries= LOAD ‘query_log.csv’
USING PigStorage(‘,’)
AS (userId,queryString,timestamp);
27
28. LOAD (cont.)
Both the ‘USING’ clause and the ‘AS’ clause are optional
We can work without them as shown earlier ($0 for first
field)
Pig Storage is a pre-defined function
Can use custom function instead of Pig Storage
28
29. Per Tuple Processing : FOREACH
Similar to FOR statements
Its used for applying special processing to each tuple of
the dataset
Example
– Expanded_query = FOREACH queries GENERATE
UserId, Expand(queryString), timeStamp;
Its not a FILTERING command
‘Expand’ can take atomic input and can generate a bag of
outputs
29
30. Per Tuple Processing : FOREACH(cont.)
The semantics of FOREACH is such that there is no
dependency between different tuples of input, therefore
permitting efficient parallel implementation
30
31. Discarding Unwanted Data : FILTER
Used as a where clause
Can provide anything in the expression
– Query = FILTER queries By user_id neq ‘bot’;
We can provide a UDF also, like
– Query = FILTER queries by Isbot(user_id);
31
32. COGROUP
Similar to Join
Groups bags of different inputs together
Ease of use for UDF’s
– Grouped_data = COGROUP results by querystring, revenue by
querystring;
32
33. JOIN
Not all users want to use COGROUP
Simple equi-join is all that is required
– Example
Join_result = JOIN results by querystring,
revenue by querystring;
Other types of join are also supported:
– Left outer
– Right outer
– Full outer
33
34. Other Commands
Relational Operators
– UNION
– CROSS
– ORDER
– DISTINCT
– LIMIT
Eval Functions
– Concat
– Count
– Diff
34
35. PARALLEL clause
It is used to increase the parallelization of the job
We can specify the number of reduce tasks of the MR
jobs created by Pig
It only effects the reduce task
No control over map
The system also can figure out number of reducers
Mostly one reduce task is required
35
36. PARALLEL clause (cont.)
Can be applied to only those commands which come
under reduce phase
– COGROUP
– CROSS
– DISTINCT
– GROUP
– JOINS
– ORDER
A = LOAD ‘ File1’;
B = LOAD ‘ File2’;
C = CROSS A, B PARALLEL 10;
36
37. Split Clause
We can split the input record into many by providing
condition
A = LOAD ‘data’ AS (F1:int, F2:int, F3:int)
(1,2;3)
(2,3;7)
SPLIT A INTO B IF F1>7, C IF F2==5;
B (1,2,3)
C (2,5,7)
(2,5,7)
Any expression can be written
UDFs can be used
It is not partitioning
37
38. Output
There are two ways to display
– STORE
• If you want to store the output in any location
STORE output_1 INTO ‘hadoopuser/output’
– DUMP
• Basically used to display the result in the GRUNT
shell itself
• Dumping doesn’t store the output anywhere
DUMP query_result;
38
39. Building a Logical Plan
Pig interpreter first parses all the commands which the
client issues
Verifies that the input files, bags or columns referred by
the command are valid
Builds a logical plan for every bag the user defines
No processing is carried out
Processing triggers where a user invokes STORE/DUMP
command
Called as a Lazy execution approach
Helps in FILTER reordering
39
40. Debugging Environment
This is used to avoid running the complete code on the
entire dataset
User can create a sample data
Difficult to tailor these datasets and end up in self cooked
data
Pig Pen is Pig’s debugging environment
Creates side dataset automatically, called as sandbox
dataset
Pig Pen has its own user interface
40
41. Pig Pen
Outputs can be easily analyzed
Errors can be rectified earlier
41
42. Future Work
User Interface
– Drag-Drop style would help
– Logical plan diagram create made easy
UDF support for other languages
Unified Environment
– Currently, lacks in control structures like loops
– Has to embedded for all iterative tasks
42
43. Summary
Not So Foreign Language
Aims a sweet spot between SQL and MapReduce
Reusable and easy to use
Novel Debugging Environment: Pig Pen
Pig has an active and growing user base in Yahoo!
Pigs
– Eats anything
– Live anywhere
– Are domestic
43