SlideShare une entreprise Scribd logo
1  sur  14
Pig
      Dataflow Scripting for Hadoop


      Alan F. Gates
      @alanfgates




© Hortonworks, Inc 2011
                                      Page 1
Who Am I?

•   Pig committer and PMC Member
•   HCatalog committer and mentor
•   Member of ASF and Incubator PMC
•   Co-founder of Hortonworks
•   Author of Programming Pig from O’Reilly




           Photo credit: Steven Guarnaccia, The Three Little Pigs
Who Are You?




               3
Example


For all of your           Load Users                Load Logs
registered users, you
                                        Semi-join
want to count how
many came to your site
                         Count by zip                Count by
this month. You want                                age, gender
this count both by
geography (zip code)        Store                      Store
                           results                    results
and by demographic
group (age and
gender)
In Pig Latin
-- Load web server logs
logs      = load 'server_logs' using HCatLoader();
thismonth = filter logs by date >= '20110801'
            and date < '20110901';

-- Load users
users     = load 'users' using HCatLoader();

-- Remove   any users that did not visit this month
grpd        = cogroup thismonth by userid, users by userid;
fltrd       = filter grpd by not IsEmpty(logs);
visited     = foreach fltrd generate flatten(users);

-- Count by zip code
grpbyzip = group visited by zip;
cntzip    = foreach grpbyzip generate group, COUNT(visited);
store cntzip into 'by_zip' using HCatStorer('date=201108');

-- Count by demographics
grpbydemo = group visited by (age, gender);
cntdemo   = foreach grpbydemo
             generate flatten(group), COUNT(visited);
store cntdemo into 'by_demo' using HCatStorer('date=201108');
Pig’s Place in the Data World




 Data Collection   Data Factory           Data Warehouse
                   Pig                    Hive

                   Pipelines              BI Tools
                   Iterative Processing   Analysis
                   Research



                    6
Why not MapReduce?

• Pig Provides a number of standard data operators
   – Five different implementations of join (hash, fragment-
     replicate, merge, sparse merged, skewed)
   – Order by provides total ordering across reducers in a balanced
     way
• Provides optimizations that are hard to do by hand
   – Multi-query: Pig will combine certain types of operations
     together in a single pipeline to reduce the number of times data
     is scanned
• User Defined Functions provide a way to inject your code
  into the data transformation
   – can be written in Java or Python
   – can do column transformation (TOUPPER) and aggregation
     (SUM)
   – can be written to take advantage of the combiner
• Control flow can be done via Python or Java

                            7
Embedding Example: Compute Pagerank


PageRank:
A system of linear equations (as many as there
  are pages on the web, yeah, a lot):


It can be approximated iteratively: compute the
   new page rank based on the page ranks of
   the previous iteration. Start with some value.

Ref: http://en.wikipedia.org/wiki/PageRank


                                   Slide courtesy of Julien Le Dem
Or more visually




Each page sends a fraction of its
 PageRank to the pages linked to.
 Inversely proportional to the
 number of links.
              Slide courtesy of Julien Le Dem
Slide courtesy of Julien Le Dem
Let’s zoom in



           pig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... +
                             PR(Tn)/C(Tn))



                                    Iterate 10 times

                                                                 Pass parameters
                                                                  as a dictionary


                                                             Just run P, that was
                                                               declared above
                                              The output
                                           becomes the new
                                                input
      Slide courtesy of Julien Le Dem
Recently Added Features

• New in 0.9 (released July 2011):
  – Embedding in Python
  – Macros and Imports
• New in 0.10 (should release in Dec 2011)
  – Boolean data type
  – Hash based aggregation for aggregates with
    low cardinality keys
  – UDFs to build and apply bloom filters
  – UDFs in JRuby (may slip to next release)

                   14
Learn More

• Read the online documentation:
  http://pig.apache.org/
• Programming Pig from O’Reilly
  Press
• Join the mailing lists:
  – user@pig.apache.org for user
    questions
  – dev@pig.apache.com for developer
    issues
• Follow me on
  Twitter, @alanfgates
Questions




            16

Contenu connexe

Tendances

apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
Thejas Nair
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
Hadoop User Group
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock Analysis
Vaibhav Jain
 

Tendances (20)

apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Practical pig
Practical pigPractical pig
Practical pig
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
 
最終発表
最終発表最終発表
最終発表
 
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Confitura 2018 — Apache Beam — Promyk Nadziei Data EngineeraConfitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
 
Simple ETL in python 3.5+ with Bonobo - PyParis 2017
Simple ETL in python 3.5+ with Bonobo - PyParis 2017Simple ETL in python 3.5+ with Bonobo - PyParis 2017
Simple ETL in python 3.5+ with Bonobo - PyParis 2017
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
GPars in Saga Groovy Study
GPars in Saga Groovy StudyGPars in Saga Groovy Study
GPars in Saga Groovy Study
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
 
Nov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In PythonNov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In Python
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4L
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock Analysis
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 

Similaire à TriHUG November Pig Talk by Alan Gates

Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
Prophet - Beijing Perl Workshop
Prophet - Beijing Perl WorkshopProphet - Beijing Perl Workshop
Prophet - Beijing Perl Workshop
Jesse Vincent
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
H2O World - What's New in H2O with Cliff Click
H2O World - What's New in H2O with Cliff ClickH2O World - What's New in H2O with Cliff Click
H2O World - What's New in H2O with Cliff Click
Sri Ambati
 

Similaire à TriHUG November Pig Talk by Alan Gates (20)

Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Prophet - Beijing Perl Workshop
Prophet - Beijing Perl WorkshopProphet - Beijing Perl Workshop
Prophet - Beijing Perl Workshop
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Python ml
Python mlPython ml
Python ml
 
Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian Huston
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
The devops approach to monitoring, Open Source and Infrastructure as Code Style
The devops approach to monitoring, Open Source and Infrastructure as Code StyleThe devops approach to monitoring, Open Source and Infrastructure as Code Style
The devops approach to monitoring, Open Source and Infrastructure as Code Style
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Mashup University 4: Intro To Mashups
Mashup University 4: Intro To MashupsMashup University 4: Intro To Mashups
Mashup University 4: Intro To Mashups
 
H2O World - What's New in H2O with Cliff Click
H2O World - What's New in H2O with Cliff ClickH2O World - What's New in H2O with Cliff Click
H2O World - What's New in H2O with Cliff Click
 
Learn How to Run Python on Redshift
Learn How to Run Python on RedshiftLearn How to Run Python on Redshift
Learn How to Run Python on Redshift
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
 
Pyhton-1a-Basics.pdf
Pyhton-1a-Basics.pdfPyhton-1a-Basics.pdf
Pyhton-1a-Basics.pdf
 
Allegograph
AllegographAllegograph
Allegograph
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 

Plus de trihug

TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gates
trihug
 

Plus de trihug (10)

TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Ranger
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Production
 
TriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache SentryTriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache Sentry
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Financial services trihug
Financial services trihugFinancial services trihug
Financial services trihug
 
TriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris ShainTriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris Shain
 
TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gates
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integration
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

TriHUG November Pig Talk by Alan Gates

  • 1. Pig Dataflow Scripting for Hadoop Alan F. Gates @alanfgates © Hortonworks, Inc 2011 Page 1
  • 2. Who Am I? • Pig committer and PMC Member • HCatalog committer and mentor • Member of ASF and Incubator PMC • Co-founder of Hortonworks • Author of Programming Pig from O’Reilly Photo credit: Steven Guarnaccia, The Three Little Pigs
  • 4. Example For all of your Load Users Load Logs registered users, you Semi-join want to count how many came to your site Count by zip Count by this month. You want age, gender this count both by geography (zip code) Store Store results results and by demographic group (age and gender)
  • 5. In Pig Latin -- Load web server logs logs = load 'server_logs' using HCatLoader(); thismonth = filter logs by date >= '20110801' and date < '20110901'; -- Load users users = load 'users' using HCatLoader(); -- Remove any users that did not visit this month grpd = cogroup thismonth by userid, users by userid; fltrd = filter grpd by not IsEmpty(logs); visited = foreach fltrd generate flatten(users); -- Count by zip code grpbyzip = group visited by zip; cntzip = foreach grpbyzip generate group, COUNT(visited); store cntzip into 'by_zip' using HCatStorer('date=201108'); -- Count by demographics grpbydemo = group visited by (age, gender); cntdemo = foreach grpbydemo generate flatten(group), COUNT(visited); store cntdemo into 'by_demo' using HCatStorer('date=201108');
  • 6. Pig’s Place in the Data World Data Collection Data Factory Data Warehouse Pig Hive Pipelines BI Tools Iterative Processing Analysis Research 6
  • 7. Why not MapReduce? • Pig Provides a number of standard data operators – Five different implementations of join (hash, fragment- replicate, merge, sparse merged, skewed) – Order by provides total ordering across reducers in a balanced way • Provides optimizations that are hard to do by hand – Multi-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned • User Defined Functions provide a way to inject your code into the data transformation – can be written in Java or Python – can do column transformation (TOUPPER) and aggregation (SUM) – can be written to take advantage of the combiner • Control flow can be done via Python or Java 7
  • 8. Embedding Example: Compute Pagerank PageRank: A system of linear equations (as many as there are pages on the web, yeah, a lot): It can be approximated iteratively: compute the new page rank based on the page ranks of the previous iteration. Start with some value. Ref: http://en.wikipedia.org/wiki/PageRank Slide courtesy of Julien Le Dem
  • 9. Or more visually Each page sends a fraction of its PageRank to the pages linked to. Inversely proportional to the number of links. Slide courtesy of Julien Le Dem
  • 10. Slide courtesy of Julien Le Dem
  • 11. Let’s zoom in pig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Iterate 10 times Pass parameters as a dictionary Just run P, that was declared above The output becomes the new input Slide courtesy of Julien Le Dem
  • 12. Recently Added Features • New in 0.9 (released July 2011): – Embedding in Python – Macros and Imports • New in 0.10 (should release in Dec 2011) – Boolean data type – Hash based aggregation for aggregates with low cardinality keys – UDFs to build and apply bloom filters – UDFs in JRuby (may slip to next release) 14
  • 13. Learn More • Read the online documentation: http://pig.apache.org/ • Programming Pig from O’Reilly Press • Join the mailing lists: – user@pig.apache.org for user questions – dev@pig.apache.com for developer issues • Follow me on Twitter, @alanfgates
  • 14. Questions 16

Notes de l'éditeur

  1. Say a little about Hortonworks
  2. SQL is a query languageDeclarative, what not howOriented around answering a questionRequires uniform schemaRequires metadataKnown by everyoneA great choice for answering queries, building reports, use with automated toolsPig Latin is a data flow languageScript defines a data flowIntended for pipelines where there may be tens or hundreds of stepsBuilt for raw world of Hadoop where schemas are optional, data may not be clean, etc.Can operate with or without metadataA great choice for ETL pipelines, data models, iterative processing, and research on raw data