2. Who Am I?
• Pig committer and PMC Member
• HCatalog committer and mentor
• Member of ASF and Incubator PMC
• Co-founder of Hortonworks
• Author of Programming Pig from O’Reilly
Photo credit: Steven Guarnaccia, The Three Little Pigs
4. Example
For all of your Load Users Load Logs
registered users, you
Semi-join
want to count how
many came to your site
Count by zip Count by
this month. You want age, gender
this count both by
geography (zip code) Store Store
results results
and by demographic
group (age and
gender)
5. In Pig Latin
-- Load web server logs
logs = load 'server_logs' using HCatLoader();
thismonth = filter logs by date >= '20110801'
and date < '20110901';
-- Load users
users = load 'users' using HCatLoader();
-- Remove any users that did not visit this month
grpd = cogroup thismonth by userid, users by userid;
fltrd = filter grpd by not IsEmpty(logs);
visited = foreach fltrd generate flatten(users);
-- Count by zip code
grpbyzip = group visited by zip;
cntzip = foreach grpbyzip generate group, COUNT(visited);
store cntzip into 'by_zip' using HCatStorer('date=201108');
-- Count by demographics
grpbydemo = group visited by (age, gender);
cntdemo = foreach grpbydemo
generate flatten(group), COUNT(visited);
store cntdemo into 'by_demo' using HCatStorer('date=201108');
6. Pig’s Place in the Data World
Data Collection Data Factory Data Warehouse
Pig Hive
Pipelines BI Tools
Iterative Processing Analysis
Research
6
7. Why not MapReduce?
• Pig Provides a number of standard data operators
– Five different implementations of join (hash, fragment-
replicate, merge, sparse merged, skewed)
– Order by provides total ordering across reducers in a balanced
way
• Provides optimizations that are hard to do by hand
– Multi-query: Pig will combine certain types of operations
together in a single pipeline to reduce the number of times data
is scanned
• User Defined Functions provide a way to inject your code
into the data transformation
– can be written in Java or Python
– can do column transformation (TOUPPER) and aggregation
(SUM)
– can be written to take advantage of the combiner
• Control flow can be done via Python or Java
7
8. Embedding Example: Compute Pagerank
PageRank:
A system of linear equations (as many as there
are pages on the web, yeah, a lot):
It can be approximated iteratively: compute the
new page rank based on the page ranks of
the previous iteration. Start with some value.
Ref: http://en.wikipedia.org/wiki/PageRank
Slide courtesy of Julien Le Dem
9. Or more visually
Each page sends a fraction of its
PageRank to the pages linked to.
Inversely proportional to the
number of links.
Slide courtesy of Julien Le Dem
11. Let’s zoom in
pig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... +
PR(Tn)/C(Tn))
Iterate 10 times
Pass parameters
as a dictionary
Just run P, that was
declared above
The output
becomes the new
input
Slide courtesy of Julien Le Dem
12. Recently Added Features
• New in 0.9 (released July 2011):
– Embedding in Python
– Macros and Imports
• New in 0.10 (should release in Dec 2011)
– Boolean data type
– Hash based aggregation for aggregates with
low cardinality keys
– UDFs to build and apply bloom filters
– UDFs in JRuby (may slip to next release)
14
13. Learn More
• Read the online documentation:
http://pig.apache.org/
• Programming Pig from O’Reilly
Press
• Join the mailing lists:
– user@pig.apache.org for user
questions
– dev@pig.apache.com for developer
issues
• Follow me on
Twitter, @alanfgates
SQL is a query languageDeclarative, what not howOriented around answering a questionRequires uniform schemaRequires metadataKnown by everyoneA great choice for answering queries, building reports, use with automated toolsPig Latin is a data flow languageScript defines a data flowIntended for pipelines where there may be tens or hundreds of stepsBuilt for raw world of Hadoop where schemas are optional, data may not be clean, etc.Can operate with or without metadataA great choice for ETL pipelines, data models, iterative processing, and research on raw data