SlideShare une entreprise Scribd logo
1  sur  31
Embedding Pig in scripting languages What happens when you feed a Pig to a Python? Julien Le Dem – Principal Engineer - Content Platforms at Yahoo! Pig committer julien@ledem.net @julienledem
Disclaimer No animals were hurt in the making of this presentation I’m cute I’m hungry Picture credits: OZinOH: http://www.flickr.com/photos/75905404@N00/5421543577/ Stephen & Claire Farnsworth: http://www.flickr.com/photos/the_farnsworths/4720850597/
What for ? Simplifying the implementation of  iterative algorithms: Loop and exit criteria Simpler User Defined Functions Easier parameter passing
Before The implementation  has the following  artifacts:
Pig Script(s) warshall_n_minus_1 = LOAD '$workDir/warshall_0' 	USING BinStorage AS (id1:chararray, id2:chararray, status:chararray); to_join_n_minus_1 = LOAD '$workDir/to_join_0' USING BinStorage AS (id1:chararray, id2:chararray, status:chararray); joined = COGROUP to_join_n_minus_1 BY id2, warshall_n_minus_1 BY id1; followed = FOREACH joined GENERATE FLATTEN(followRel(to_join_n_minus_1,warshall_n_minus_1)); followed_byid = GROUP followed BY id1; warshall_n = FOREACH followed_byid GENERATE group, FLATTEN(coalesceLine(followed.(id2, status))); to_join_n = FILTER warshall_n BY $2 == 'notfollowed' AND $0!=$1; STORE warshall_n INTO '$workDir/warshall_1' USING BinStorage; STORE to_join_n INTO '$workDir/to_join_1 USING BinStorage;
External loop #!/usr/bin/python  import os num_iter=int(10) for i in range(num_iter): os.system('java -jar ./lib/pig.jar -x local plsi_singleiteration.pig') os.rename('output_results/p_z_u','output_results/p_z_u.'+str(i)) os.system('cpoutput_results/p_z_u.nxtoutput_results/p_z_u'); 	os.rename('output_results/p_z_u.nxt','output_results/p_z_u.'+str(i+1)) os.rename('output_results/p_s_z','output_results/p_s_z.'+str(i)) os.system('cpoutput_results/p_s_z.nxtoutput_results/p_s_z'); 	os.rename('output_results/p_s_z.nxt','output_results/p_s_z.'+str(i+1))
Java UDF(s)
Development Iteration
So… What happens? Credits:  Mango Atchar: http://www.flickr.com/photos/mangoatchar/362439607/
After One script (to rule them all): 	- main program 	- UDFs as script functions 	- embedded Pig statements All the algorithm in one place
References It uses JVM implementations of scripting languages (Jython, Rhino). This is a joint effort, see the following Jiras:  	in Pig 0.8: PIG-928 Python UDFs    in Pig0.9: PIG-1479 embedding, PIG-1794 JavaScript support Doc: http://pig.apache.org/docs/
Examples 1) Simple example: fixed loop iteration 2) Adding convergence criteria and accessing intermediary output 3)More advanced example with UDFs
1) A Simple Example PageRank: A system of linear equations (as many as there are pages on the web, yeah, a lot):  It can be approximated iteratively: compute the new page rank based on the page ranks of the previous iteration. Start with some value. Ref: http://en.wikipedia.org/wiki/PageRank
Or more visually Each page sends a fraction of its PageRank to the pages linked to. Inversely proportional to the number of links.
Let’s zoom in pig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Iterate 10 times Pass parameters as a dictionary Pass parameters as a dictionary Just run P, that was declared above The output becomes the new input
Practical result Applied to the English Wikipedia link graph: http://wiki.dbpedia.org/Downloads36#owikipediapagelinks It turns out that the max PageRank is awarded to: http://en.wikipedia.org/wiki/United_States Thanks @ogrisel for the suggestion
2) Same example, one step further Now let’s say that we define a threshold as a convergence criteria instead of a fixed iteration count.
Same thing as previously Computation of the maximum difference with the previous iteration … continued next slide
The main program The parameter-less bind() uses the variables in the current scope We can easily read the output of Pig from the grid Stop if we reach a threshold
3) Now somethingmore complex Compute a transitive closure:  find the connected components of a graph.  - Useful if you’re doing de-duplication  - Requires iterations and UDFs
Or more visually Turn this:	                Into this:
Convergence Converges in :  log2(max(Diameter of a component)) Diameter = “the longest shortest path” Bottom line: the number of iterations will be reasonable.
UDFs are in the same script as the main program Zoom next slide Page 1/3
Zoom on UDFs The output schema of the UDF is defined using a decorator The native structures of the language can be used directly
Zoom next slides Zoom next slides Page 2/3
Zoom on the Pig script … UDFs are directly available, no extra declaration needed
Zoom on the loop Iterate a maximum of 10 times (2^10 maximum diameter of a component) Convergence criteria: all links have been followed
Final part: formatting Turning the graph representation into a component list representation This is necessary when we have UDFs, so that the script can be evaluated again on the slaves without running main() Page 3/3
One more thing … I presented Python but JavaScript is available as well (experimental). The framework is extensible. Any JVM implementation of a language could be integrated (contribute!). The examples can be found at: https://github.com/julienledem/Pig-scripting-examples
Questions? ? ?

Contenu connexe

Tendances

Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
DataWorks Summit
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
DataWorks Summit
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop User Group
 

Tendances (20)

Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Word Embedding for Nearest Words
Word Embedding for Nearest WordsWord Embedding for Nearest Words
Word Embedding for Nearest Words
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User Group
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
GPU Accelerated Machine Learning
GPU Accelerated Machine LearningGPU Accelerated Machine Learning
GPU Accelerated Machine Learning
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java Professionals
 

En vedette

Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 

En vedette (12)

BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languages
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Reducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networksReducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networks
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big Data
 

Similaire à Embedding Pig in scripting languages

Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
CherryBerry2
 
Easy deployment & management of cloud apps
Easy deployment & management of cloud appsEasy deployment & management of cloud apps
Easy deployment & management of cloud apps
David Cunningham
 

Similaire à Embedding Pig in scripting languages (20)

BRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning TalkBRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning Talk
 
Need to make a horizontal change across 100+ microservices? No worries, Sheph...
Need to make a horizontal change across 100+ microservices? No worries, Sheph...Need to make a horizontal change across 100+ microservices? No worries, Sheph...
Need to make a horizontal change across 100+ microservices? No worries, Sheph...
 
Incredible Machine with Pipelines and Generators
Incredible Machine with Pipelines and GeneratorsIncredible Machine with Pipelines and Generators
Incredible Machine with Pipelines and Generators
 
Pig
PigPig
Pig
 
High Quality Symfony Bundles tutorial - Dutch PHP Conference 2014
High Quality Symfony Bundles tutorial - Dutch PHP Conference 2014High Quality Symfony Bundles tutorial - Dutch PHP Conference 2014
High Quality Symfony Bundles tutorial - Dutch PHP Conference 2014
 
Streaming Data in R
Streaming Data in RStreaming Data in R
Streaming Data in R
 
Introduction to Google App Engine with Python
Introduction to Google App Engine with PythonIntroduction to Google App Engine with Python
Introduction to Google App Engine with Python
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
 
Easy deployment & management of cloud apps
Easy deployment & management of cloud appsEasy deployment & management of cloud apps
Easy deployment & management of cloud apps
 
Who pulls the strings?
Who pulls the strings?Who pulls the strings?
Who pulls the strings?
 
Python1
Python1Python1
Python1
 
High Performance NodeJS
High Performance NodeJSHigh Performance NodeJS
High Performance NodeJS
 
2nd puc computer science chapter 8 function overloading
 2nd puc computer science chapter 8   function overloading 2nd puc computer science chapter 8   function overloading
2nd puc computer science chapter 8 function overloading
 
Pemrograman Python untuk Pemula
Pemrograman Python untuk PemulaPemrograman Python untuk Pemula
Pemrograman Python untuk Pemula
 
The GO Language : From Beginners to Gophers
The GO Language : From Beginners to GophersThe GO Language : From Beginners to Gophers
The GO Language : From Beginners to Gophers
 
python introduction initial lecture unit1.pptx
python introduction initial lecture unit1.pptxpython introduction initial lecture unit1.pptx
python introduction initial lecture unit1.pptx
 
Puppet quick start guide
Puppet quick start guidePuppet quick start guide
Puppet quick start guide
 
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...
 
Network Analysis with networkX : Real-World Example-1
Network Analysis with networkX : Real-World Example-1Network Analysis with networkX : Real-World Example-1
Network Analysis with networkX : Real-World Example-1
 

Plus de Julien Le Dem

Plus de Julien Le Dem (18)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 

Dernier

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Embedding Pig in scripting languages

  • 1. Embedding Pig in scripting languages What happens when you feed a Pig to a Python? Julien Le Dem – Principal Engineer - Content Platforms at Yahoo! Pig committer julien@ledem.net @julienledem
  • 2. Disclaimer No animals were hurt in the making of this presentation I’m cute I’m hungry Picture credits: OZinOH: http://www.flickr.com/photos/75905404@N00/5421543577/ Stephen & Claire Farnsworth: http://www.flickr.com/photos/the_farnsworths/4720850597/
  • 3. What for ? Simplifying the implementation of iterative algorithms: Loop and exit criteria Simpler User Defined Functions Easier parameter passing
  • 4. Before The implementation has the following artifacts:
  • 5. Pig Script(s) warshall_n_minus_1 = LOAD '$workDir/warshall_0' USING BinStorage AS (id1:chararray, id2:chararray, status:chararray); to_join_n_minus_1 = LOAD '$workDir/to_join_0' USING BinStorage AS (id1:chararray, id2:chararray, status:chararray); joined = COGROUP to_join_n_minus_1 BY id2, warshall_n_minus_1 BY id1; followed = FOREACH joined GENERATE FLATTEN(followRel(to_join_n_minus_1,warshall_n_minus_1)); followed_byid = GROUP followed BY id1; warshall_n = FOREACH followed_byid GENERATE group, FLATTEN(coalesceLine(followed.(id2, status))); to_join_n = FILTER warshall_n BY $2 == 'notfollowed' AND $0!=$1; STORE warshall_n INTO '$workDir/warshall_1' USING BinStorage; STORE to_join_n INTO '$workDir/to_join_1 USING BinStorage;
  • 6. External loop #!/usr/bin/python import os num_iter=int(10) for i in range(num_iter): os.system('java -jar ./lib/pig.jar -x local plsi_singleiteration.pig') os.rename('output_results/p_z_u','output_results/p_z_u.'+str(i)) os.system('cpoutput_results/p_z_u.nxtoutput_results/p_z_u'); os.rename('output_results/p_z_u.nxt','output_results/p_z_u.'+str(i+1)) os.rename('output_results/p_s_z','output_results/p_s_z.'+str(i)) os.system('cpoutput_results/p_s_z.nxtoutput_results/p_s_z'); os.rename('output_results/p_s_z.nxt','output_results/p_s_z.'+str(i+1))
  • 9. So… What happens? Credits: Mango Atchar: http://www.flickr.com/photos/mangoatchar/362439607/
  • 10. After One script (to rule them all): - main program - UDFs as script functions - embedded Pig statements All the algorithm in one place
  • 11. References It uses JVM implementations of scripting languages (Jython, Rhino). This is a joint effort, see the following Jiras: in Pig 0.8: PIG-928 Python UDFs in Pig0.9: PIG-1479 embedding, PIG-1794 JavaScript support Doc: http://pig.apache.org/docs/
  • 12. Examples 1) Simple example: fixed loop iteration 2) Adding convergence criteria and accessing intermediary output 3)More advanced example with UDFs
  • 13. 1) A Simple Example PageRank: A system of linear equations (as many as there are pages on the web, yeah, a lot): It can be approximated iteratively: compute the new page rank based on the page ranks of the previous iteration. Start with some value. Ref: http://en.wikipedia.org/wiki/PageRank
  • 14. Or more visually Each page sends a fraction of its PageRank to the pages linked to. Inversely proportional to the number of links.
  • 15.
  • 16. Let’s zoom in pig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Iterate 10 times Pass parameters as a dictionary Pass parameters as a dictionary Just run P, that was declared above The output becomes the new input
  • 17. Practical result Applied to the English Wikipedia link graph: http://wiki.dbpedia.org/Downloads36#owikipediapagelinks It turns out that the max PageRank is awarded to: http://en.wikipedia.org/wiki/United_States Thanks @ogrisel for the suggestion
  • 18. 2) Same example, one step further Now let’s say that we define a threshold as a convergence criteria instead of a fixed iteration count.
  • 19. Same thing as previously Computation of the maximum difference with the previous iteration … continued next slide
  • 20. The main program The parameter-less bind() uses the variables in the current scope We can easily read the output of Pig from the grid Stop if we reach a threshold
  • 21. 3) Now somethingmore complex Compute a transitive closure: find the connected components of a graph. - Useful if you’re doing de-duplication - Requires iterations and UDFs
  • 22. Or more visually Turn this: Into this:
  • 23. Convergence Converges in : log2(max(Diameter of a component)) Diameter = “the longest shortest path” Bottom line: the number of iterations will be reasonable.
  • 24. UDFs are in the same script as the main program Zoom next slide Page 1/3
  • 25. Zoom on UDFs The output schema of the UDF is defined using a decorator The native structures of the language can be used directly
  • 26. Zoom next slides Zoom next slides Page 2/3
  • 27. Zoom on the Pig script … UDFs are directly available, no extra declaration needed
  • 28. Zoom on the loop Iterate a maximum of 10 times (2^10 maximum diameter of a component) Convergence criteria: all links have been followed
  • 29. Final part: formatting Turning the graph representation into a component list representation This is necessary when we have UDFs, so that the script can be evaluated again on the slaves without running main() Page 3/3
  • 30. One more thing … I presented Python but JavaScript is available as well (experimental). The framework is extensible. Any JVM implementation of a language could be integrated (contribute!). The examples can be found at: https://github.com/julienledem/Pig-scripting-examples