Pig

PIG: High Level Data
Flow Language
COPYRIGHT (C) CHIRAG AHUJA

Outline
Map-Reduce and the need for Pig Latin
Pig Latin
Compilation into Map-Reduce
Implementation
Comparison with Map-Reduce
Optimization in Pig

The Map-Reduce Appeal
Scale
Scalable due to simpler design
• Only parallelizable operations
• No transactions
$ Runs on cheap commodity hardware
SQL Procedural Control- a processing “pipe”

Map-Reduce
k1 v1
k2 v2
k1 v3
k2 v4
k1 v5
map
map
k1 v1
k1 v3
k1 v5
k2 v2
k2 v4
Output
records
reduce
reduce
Just a group-by-aggregate?
Input
records

Java
Example
map
reduce
Job conf.

Disadvantages
1. Extremely rigid data flow
Other flows constantly hacked in
Join, Union Split
M R
M M R M
Chains
2. Common operations must be coded by hand
• Join, filter, projection, aggregates, sorting, distinct
3. Semantics hidden inside map-reduce functions
• Difficult to maintain, extend, and optimize

Pros And Cons
Need a high-level, general data flow language

Enter Pig Latin
Need a high-level, general data flow language

What is Pig
 A platform for analyzing large data sets that consists of a high-level language
for expressing data analysis programs.
 Compiles down to Map Reduce jobs
 Developed by Yahoo!
 Open-source language

Data Flow
Load Visits
Group by url
Foreach url
generate count
Load Url Info
Join on url
Group by category
Foreach category
generate top10 urls

In Pig Latin
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(visits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);
visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;
topUrls = foreach gCategories generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;

Pig Compilation

Implementation
SQL
Pig
Hadoop
Map-Reduce
cluster
automatic
rewrite +
optimize
or
or
user

Java vs. Pig
300
250
200
150
100
50
180
160
140
120
100
80
60
40
20
0
1/20 the lines of code
Hadoop Pig
0
Hadoop Pig
Minutes
1/16 the development time
Performance is comparable (Java is slightly better)

Summary
Big demand for parallel data processing
◦ Emerging tools that do not look like SQL DBMS
◦ Programmers like dataflow pipes over static files
Hence the excitement about Map-Reduce
But, Map-Reduce is too low-level and rigid
Pig Latin
Sweet spot between map-reduce and SQL

Pig

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Pig

Similar to Pig (20)

More from Chirag Ahuja

More from Chirag Ahuja (10)

Recently uploaded

Recently uploaded (20)

Pig