SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
Developing Pig on TezDeveloping Pig on Tez
Mark WagnerMark Wagner
Committer, Apache PigCommitter, Apache Pig
LinkedInLinkedIn
Cheolsoo ParkCheolsoo Park
VP, Apache PigVP, Apache Pig
NetflixNetflix
What is Pig
●
Apache project since 2008
●
Higher level language for Hadoop that provides a dataflow language
with a MapReduce based execution engine
A = LOAD 'input.txt';
B = FOREACH A GENERATE flatten(TOKENIZE((chararray)$0))
AS word;
C = GROUP B BY word;
D = FOREACH C GENERATE group, COUNT(B);
STORE D INTO './output.txt';
Pig Concepts
●
LOAD
●
STORE
●
FOREACH ___ GENERATE ___
●
FILTER ___ BY ___
Pig Concepts
GROUP ___ BY ___
●
'Blocking' operator
●
Translates to a MapReduce shuffle
Pig Concepts
Joins:
●
Hash Join
●
Replicated Join
●
Skewed Join
Pig Latin
A = LOAD 'input.txt';
B = FOREACH A GENERATE
flatten(TOKENIZE((chararray)$0))
AS word;
C = GROUP B BY word;
D = FOREACH C GENERATE group, COUNT(B);
STORE D INTO './output.txt';
Logical Plan
Physical Plan
Map Reduce Plan Map
Reduce
What's the problem
●
Extra intermediate output
●
Artificial synchronization barriers
●
Inefficient use of resources
●
Multiquery Optimizer
●
Alleviates some problems
●
Has its own
Apache Tez
●
Incubating project
●
Express data processing as a directed acyclic graph
●
Runs on YARN
●
Aims for lower latency and higher throughput than Map Reduce
Tez Concepts
●
Job expressed as directed acyclic graph (DAG)
●
Processing done at vertices
●
Data flows along edges
Mapper
Reducer
Processor Processor
Processor
Processor
Benefits & Optimizations
●
Fewer synchronization barriers
●
Container Reuse
●
Object caches at the vertices
●
Dynamic parallelism estimation
●
Custom data transfer between processors
What we've done for Pig
●
New execution engine based on Tez
●
Physical Plan translated to Tez Plan instead of Map Reduce Plan
●
Same Physical Plan and operators
●
Custom processors run the execution plan on Tez
Along the way
●
New pluggable execution backend
●
Made operator set more generic
●
Motivated Tez improvements
Group By
LOAD
GROUP BY, SUM
Identity
GROUP BY
HDFS
LOAD
GROUP BY, STORE
GROUP BY, SUM
f = LOAD ‘foo’
AS (x:int, y:int);
g = GROUP f BY x;
h = FOREACH g GENERATE
group AS r,
SUM(f.y) as s;
i = GROUP h BY s;
Join
LOAD l, r
JOIN, STORE
LOAD r
JOIN, STORE
LOAD l
l = LOAD ‘left’ AS (x, y);
r = LOAD ‘right’ AS (x, z);
j = JOIN l BY x, r BY x;
Group By
LOAD
GROUP f BY x,
GROUP f BY y
LOAD g, h
JOIN
HDFS
LOAD
JOIN
GROUP BYGROUP BY
f = LOAD ‘foo’
AS (x:int, y:int);
g = GROUP f BY x;
h = GROUP f BY y;
i = JOIN g BY group,
h BY group;
Order By
SAMPLE
AGGREGATE
PARTITION
SORT
HDFS
LOAD, SAMPLE
SORT
PARTITION
AGGREGATE
f = LOAD ‘foo’ AS (x, y);
o = ORDER f BY x;
Performance Comparison
Replicated Join
(2.8x)
Join + Group By
(1.5x)
Join + Group By +
Order By (1.5x)
3 way Split + Join
+ Group By (2.6x)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Map Reduce
Tez
How it started
Shared interests across organizations
• Similar data platform architecture.
• Pig for ETL jobs
• Hive for ad-hoc queries
How it started
Shared interests across organizations
• Hortonworks wants Tez to succeed.
Community meet-ups helped
• Twitter presented summer intern’s POC work at Tez meet-up.
• Pig devs exchanged interests.
Organizing team
Organizing team
Community meet-ups helped
• Tez team hosted tutorial sessions for Pig devs.
• Pig team got together to brainstorm implementation design.
Companies showed commitment to the project
• Hortonworks: Daniel Dai
• LinkedIn: Alex Bain, Mark Wagner
• Netflix: Cheolsoo Park
• Yahoo: Olga Natkovich, Rohini Palaniswamy
Building trust
Make Pig 2x faster within 6 months
• Hive-on-Tez showed 2x performance gain.
• Rewriting the Pig backend within 6 months seemed reasonable.
Setting goals
Acting as team
Sprint
• Monthly planning meetings
• Twice-a-week stand-up conference calls
Issues / discussions
• PIG-3446 umbrella jira for Pig on Tez
• Whiteboard discussions at meetings
• Pig old timer Daniel Dai acted as mentor.
• Everyone got to work on core functionalities.
• Everyone became an expert on the Pig backend.
Knowledge transfer
Sharing credit
• Elected as a new committer and PMC chair.
• Gave talks at Hadoop User Group and Pig User Group meet-ups.
• Speaking at ApacheCon and upcoming Hadoop Summit.
Further collaborations
Looking for more collaborations
• Parquet Hive SerDe improvements.
• Sharing experiences with SQL-on-Hadoop solutions.
Mind shift
“If we can’t hire all these good people, why don’t we use them in a
collaboration?”
• Collaboration instead of competition.
Mind shift
“Why do we reinvent the wheel?”
• Share the same technologies while creating different services.
Believe in the Apache way

Contenu connexe

Similaire à Developing Pig on Tez (ApacheCon 2014)

Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latinknowbigdata
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 
TriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan GatesTriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan Gatestrihug
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
KOTI_RESUME_(1) (2)
KOTI_RESUME_(1) (2)KOTI_RESUME_(1) (2)
KOTI_RESUME_(1) (2)ch koti
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingYahoo Developer Network
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjugDavid Morin
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data CertificationAdam Doyle
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsHadoop User Group
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to HadoopHortonworks
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
When JHipster meets Microsoft-JHipster and Microsoft products
When JHipster meets Microsoft-JHipster and Microsoft productsWhen JHipster meets Microsoft-JHipster and Microsoft products
When JHipster meets Microsoft-JHipster and Microsoft productsAnthony Viard
 

Similaire à Developing Pig on Tez (ApacheCon 2014) (20)

Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
TriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan GatesTriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan Gates
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
KOTI_RESUME_(1) (2)
KOTI_RESUME_(1) (2)KOTI_RESUME_(1) (2)
KOTI_RESUME_(1) (2)
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data Processing
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjug
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data Certification
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to Hadoop
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
When JHipster meets Microsoft-JHipster and Microsoft products
When JHipster meets Microsoft-JHipster and Microsoft productsWhen JHipster meets Microsoft-JHipster and Microsoft products
When JHipster meets Microsoft-JHipster and Microsoft products
 

Dernier

Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 

Dernier (20)

Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 

Developing Pig on Tez (ApacheCon 2014)

  • 1. Developing Pig on TezDeveloping Pig on Tez Mark WagnerMark Wagner Committer, Apache PigCommitter, Apache Pig LinkedInLinkedIn Cheolsoo ParkCheolsoo Park VP, Apache PigVP, Apache Pig NetflixNetflix
  • 2. What is Pig ● Apache project since 2008 ● Higher level language for Hadoop that provides a dataflow language with a MapReduce based execution engine A = LOAD 'input.txt'; B = FOREACH A GENERATE flatten(TOKENIZE((chararray)$0)) AS word; C = GROUP B BY word; D = FOREACH C GENERATE group, COUNT(B); STORE D INTO './output.txt';
  • 3. Pig Concepts ● LOAD ● STORE ● FOREACH ___ GENERATE ___ ● FILTER ___ BY ___
  • 4. Pig Concepts GROUP ___ BY ___ ● 'Blocking' operator ● Translates to a MapReduce shuffle
  • 6. Pig Latin A = LOAD 'input.txt'; B = FOREACH A GENERATE flatten(TOKENIZE((chararray)$0)) AS word; C = GROUP B BY word; D = FOREACH C GENERATE group, COUNT(B); STORE D INTO './output.txt';
  • 9. Map Reduce Plan Map Reduce
  • 10. What's the problem ● Extra intermediate output ● Artificial synchronization barriers ● Inefficient use of resources ● Multiquery Optimizer ● Alleviates some problems ● Has its own
  • 11. Apache Tez ● Incubating project ● Express data processing as a directed acyclic graph ● Runs on YARN ● Aims for lower latency and higher throughput than Map Reduce
  • 12. Tez Concepts ● Job expressed as directed acyclic graph (DAG) ● Processing done at vertices ● Data flows along edges Mapper Reducer Processor Processor Processor Processor
  • 13. Benefits & Optimizations ● Fewer synchronization barriers ● Container Reuse ● Object caches at the vertices ● Dynamic parallelism estimation ● Custom data transfer between processors
  • 14. What we've done for Pig ● New execution engine based on Tez ● Physical Plan translated to Tez Plan instead of Map Reduce Plan ● Same Physical Plan and operators ● Custom processors run the execution plan on Tez
  • 15. Along the way ● New pluggable execution backend ● Made operator set more generic ● Motivated Tez improvements
  • 16. Group By LOAD GROUP BY, SUM Identity GROUP BY HDFS LOAD GROUP BY, STORE GROUP BY, SUM f = LOAD ‘foo’ AS (x:int, y:int); g = GROUP f BY x; h = FOREACH g GENERATE group AS r, SUM(f.y) as s; i = GROUP h BY s;
  • 17. Join LOAD l, r JOIN, STORE LOAD r JOIN, STORE LOAD l l = LOAD ‘left’ AS (x, y); r = LOAD ‘right’ AS (x, z); j = JOIN l BY x, r BY x;
  • 18. Group By LOAD GROUP f BY x, GROUP f BY y LOAD g, h JOIN HDFS LOAD JOIN GROUP BYGROUP BY f = LOAD ‘foo’ AS (x:int, y:int); g = GROUP f BY x; h = GROUP f BY y; i = JOIN g BY group, h BY group;
  • 20. Performance Comparison Replicated Join (2.8x) Join + Group By (1.5x) Join + Group By + Order By (1.5x) 3 way Split + Join + Group By (2.6x) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Map Reduce Tez
  • 21. How it started Shared interests across organizations • Similar data platform architecture. • Pig for ETL jobs • Hive for ad-hoc queries
  • 22. How it started Shared interests across organizations • Hortonworks wants Tez to succeed.
  • 23. Community meet-ups helped • Twitter presented summer intern’s POC work at Tez meet-up. • Pig devs exchanged interests. Organizing team
  • 24. Organizing team Community meet-ups helped • Tez team hosted tutorial sessions for Pig devs. • Pig team got together to brainstorm implementation design.
  • 25. Companies showed commitment to the project • Hortonworks: Daniel Dai • LinkedIn: Alex Bain, Mark Wagner • Netflix: Cheolsoo Park • Yahoo: Olga Natkovich, Rohini Palaniswamy Building trust
  • 26. Make Pig 2x faster within 6 months • Hive-on-Tez showed 2x performance gain. • Rewriting the Pig backend within 6 months seemed reasonable. Setting goals
  • 27. Acting as team Sprint • Monthly planning meetings • Twice-a-week stand-up conference calls Issues / discussions • PIG-3446 umbrella jira for Pig on Tez • Whiteboard discussions at meetings
  • 28. • Pig old timer Daniel Dai acted as mentor. • Everyone got to work on core functionalities. • Everyone became an expert on the Pig backend. Knowledge transfer
  • 29. Sharing credit • Elected as a new committer and PMC chair. • Gave talks at Hadoop User Group and Pig User Group meet-ups. • Speaking at ApacheCon and upcoming Hadoop Summit.
  • 30. Further collaborations Looking for more collaborations • Parquet Hive SerDe improvements. • Sharing experiences with SQL-on-Hadoop solutions.
  • 31. Mind shift “If we can’t hire all these good people, why don’t we use them in a collaboration?” • Collaboration instead of competition.
  • 32. Mind shift “Why do we reinvent the wheel?” • Share the same technologies while creating different services.
  • 33. Believe in the Apache way