Pig is an engine for executing data flows in parallel on Hadoop. It uses a language called Pig Latin to analyze large datasets. Pig provides relational operators like FOREACH, GROUP, and FILTER to process data in parallel. A hands-on example demonstrates loading dividend data, grouping it by stock symbol, calculating the average dividend for each symbol, and storing the results.
2. Pig
Pig - Introduction
An engine for executing data flows in parallel on
Hadoop
● Language - Pig Latin
● Pig Engine
● Can be used with or without Hadoop
3. Pig
Pig - Why do we need it?
● Programmers struggle a lot in writing
MapReduce tasks
● Pig scripts are easier to write and maintain
● Pig provides many relational operators which
are hard to write in MapReduce
4. Pig
Pig - Use Cases
1. Analyzing Data
2. Iterative processing
3. Batch Processing
4. ETL - Extract, Transform and Load
5. Pig
Pig - Philosophy
A. Pigs eat anything
• Data: Relational, nested, or unstructured.
B. Pigs live anywhere
• Parallel processing Language. Not only Hadoop
C. Pigs are domestic animals
• Controllable, provides custom functions in java/python
D. Pigs fly
• Designed for performance
6. Pig
Pig Latin - A Data Flow Language
Allows describing:
1. How data flows
2. Should be read
3. Processed
4. Stored to multiple outputs in parallel
Complex Workflows
1. Multiple inputs are joined
7. Pig
1. MapReduce mode
• Access HDFS and Hadoop cluster
• Used in production
2. Local mode
• Access local files and local machine
• Used for testing locally
• Fastens the development
Pig - Modes
8. Pig
• Login to CloudxLab Linux console.
• Type pig
• Invoke commands in grunt shell to access files in HDFS
• Control hadoop from grunt shell
Pig - MapReduce Mode
9. Pig
• Login to CloudxLab Linux console.
• Type pig -x local
• Invoke commands in grunt shell to access files in local file system
Pig - Local Mode
10. Pig
Pig - Data Types
1. int - Signed 32-bit integer - Example - 8
2. long - Signed 64-bit integer - Example - 5L
3. float - 32-bit floating point - Example - 5.5F
4. double - 64-bit floating point - Example - 10.5
5. chararray - character array - Example - ‘CloudxLab’
6. bytearray - blob - Example - Any binary data
7. datetime - Example - 1970-01-01T00:00:00.000+00:00
11. Pig
Pig - Complex Data Types
http://pig.apache.org/docs/r0.15.0/basic.html#Data+Types+
and+More
13. Pig
Pig - Store / Dump
STORE
• Stores the data to HDFS and other storages
DUMP
• Prints the value on the screen - print()
• Useful for debugging
14. Pig
Pig - Lazy Evaluation
• Each processing step results in new relation
○ except for dump and store
• No value or operation is evaluated until the value or the
transformed data is required
• This reduces the repeated calculation
15. Pig
Pig - Relational Operators - FOREACH
Takes expressions & applies to every record
1. divs = LOAD '/data/NYSE_dividends' AS (name:chararray,
stock_symbol:chararray, date:chararray, dividends:float);
2. values = FOREACH divs GENERATE stock_symbol, dividends;
3. STORE values INTO 'values_1';
4. cat values_1
16. Pig
Pig - FOREACH - Question
Will below code generate any reducer code?
gain = FOREACH divs GENERATE ticker, val;
DUMP gain;
17. Pig
Pig - Relational Operators - GROUP
Groups the data in relations based on the keys
(John, 18, 4.0F)
(Mary, 19, 3.8F)
(Bill, 20, 3.9F)
(Joe, 18, 3.8F)
DESCRIBE A;
A: {
name: chararray,
age: int,
gpa: float
}
A
(18, {(John, 18, 4.0F), (Joe, 18, 3.8F)})
(19, {(Mary, 19, 3.8F)})
(20, {(Bill, 20, 3.9F)})
B = GROUP A BY age;
DESCRIBE B;
B: {
group: int,
A: {
name: chararray,
age: int,
gpa: float
}
}