SlideShare a Scribd company logo
1 of 15
PIG: High Level Data 
Flow Language 
COPYRIGHT (C) CHIRAG AHUJA
Outline 
Map-Reduce and the need for Pig Latin 
Pig Latin 
Compilation into Map-Reduce 
Implementation 
Comparison with Map-Reduce 
Optimization in Pig 
COPYRIGHT (C) CHIRAG AHUJA
The Map-Reduce Appeal 
COPYRIGHT (C) CHIRAG AHUJA 
Scale 
Scalable due to simpler design 
• Only parallelizable operations 
• No transactions 
$ Runs on cheap commodity hardware 
SQL Procedural Control- a processing “pipe”
Map-Reduce 
COPYRIGHT (C) CHIRAG AHUJA 
k1 v1 
k2 v2 
k1 v3 
k2 v4 
k1 v5 
map 
map 
k1 v1 
k1 v3 
k1 v5 
k2 v2 
k2 v4 
Output 
records 
reduce 
reduce 
Just a group-by-aggregate? 
Input 
records
Java 
Example 
COPYRIGHT (C) CHIRAG AHUJA 
map 
reduce 
Job conf.
Disadvantages 
COPYRIGHT (C) CHIRAG AHUJA 
1. Extremely rigid data flow 
Other flows constantly hacked in 
Join, Union Split 
M R 
M M R M 
Chains 
2. Common operations must be coded by hand 
• Join, filter, projection, aggregates, sorting, distinct 
3. Semantics hidden inside map-reduce functions 
• Difficult to maintain, extend, and optimize
Pros And Cons 
Need a high-level, general data flow language 
COPYRIGHT (C) CHIRAG AHUJA
Enter Pig Latin 
Need a high-level, general data flow language 
COPYRIGHT (C) CHIRAG AHUJA
What is Pig 
 A platform for analyzing large data sets that consists of a high-level language 
for expressing data analysis programs. 
 Compiles down to Map Reduce jobs 
 Developed by Yahoo! 
 Open-source language 
COPYRIGHT (C) CHIRAG AHUJA
Data Flow 
COPYRIGHT (C) CHIRAG AHUJA 
Load Visits 
Group by url 
Foreach url 
generate count 
Load Url Info 
Join on url 
Group by category 
Foreach category 
generate top10 urls
In Pig Latin 
visits = load ‘/data/visits’ as (user, url, time); 
gVisits = group visits by url; 
visitCounts = foreach gVisits generate url, count(visits); 
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); 
visitCounts = join visitCounts by url, urlInfo by url; 
gCategories = group visitCounts by category; 
topUrls = foreach gCategories generate top(visitCounts,10); 
store topUrls into ‘/data/topUrls’; 
COPYRIGHT (C) CHIRAG AHUJA
Pig Compilation 
COPYRIGHT (C) CHIRAG AHUJA
Implementation 
COPYRIGHT (C) CHIRAG AHUJA 
SQL 
Pig 
Hadoop 
Map-Reduce 
cluster 
automatic 
rewrite + 
optimize 
or 
or 
user
Java vs. Pig 
300 
250 
200 
150 
100 
50 
COPYRIGHT (C) CHIRAG AHUJA 
180 
160 
140 
120 
100 
80 
60 
40 
20 
0 
1/20 the lines of code 
Hadoop Pig 
0 
Hadoop Pig 
Minutes 
1/16 the development time 
Performance is comparable (Java is slightly better)
Summary 
Big demand for parallel data processing 
◦ Emerging tools that do not look like SQL DBMS 
◦ Programmers like dataflow pipes over static files 
Hence the excitement about Map-Reduce 
But, Map-Reduce is too low-level and rigid 
COPYRIGHT (C) CHIRAG AHUJA 
Pig Latin 
Sweet spot between map-reduce and SQL

More Related Content

What's hot

Reliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on Kubernetes
Databricks
 
GATI_SAP_WMS_Integration
GATI_SAP_WMS_IntegrationGATI_SAP_WMS_Integration
GATI_SAP_WMS_Integration
Kapil Joshi
 

What's hot (20)

Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
 
Setting Up a TIG Stack for Your Testing
Setting Up a TIG Stack for Your TestingSetting Up a TIG Stack for Your Testing
Setting Up a TIG Stack for Your Testing
 
Sprint 95
Sprint 95Sprint 95
Sprint 95
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til Piffl
 
InfluxDb
InfluxDbInfluxDb
InfluxDb
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
 
Introduction to Time Series: The Fastest Growing Database Category
 Introduction to Time Series: The Fastest Growing Database Category Introduction to Time Series: The Fastest Growing Database Category
Introduction to Time Series: The Fastest Growing Database Category
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache Pinot
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Storm
 
Superset druid realtime
Superset druid realtimeSuperset druid realtime
Superset druid realtime
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on Kubernetes
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
 
GATI_SAP_WMS_Integration
GATI_SAP_WMS_IntegrationGATI_SAP_WMS_Integration
GATI_SAP_WMS_Integration
 
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
 
InfluxDB 2.0: Dashboarding 101 by David G. Simmons
InfluxDB 2.0: Dashboarding 101 by David G. SimmonsInfluxDB 2.0: Dashboarding 101 by David G. Simmons
InfluxDB 2.0: Dashboarding 101 by David G. Simmons
 
Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017
 
ArangoDB 3.9 - Further Powering Graphs at Scale
ArangoDB 3.9 - Further Powering Graphs at ScaleArangoDB 3.9 - Further Powering Graphs at Scale
ArangoDB 3.9 - Further Powering Graphs at Scale
 

Viewers also liked

Viewers also liked (20)

Cabo Honeymoon Options
Cabo Honeymoon OptionsCabo Honeymoon Options
Cabo Honeymoon Options
 
Maddi
MaddiMaddi
Maddi
 
Beaches Turks and Caicos
Beaches Turks and CaicosBeaches Turks and Caicos
Beaches Turks and Caicos
 
Shahrukh Riviera Maya Honeymoon Options
Shahrukh Riviera Maya Honeymoon OptionsShahrukh Riviera Maya Honeymoon Options
Shahrukh Riviera Maya Honeymoon Options
 
Shelly Revised Jamaica Options
Shelly Revised Jamaica OptionsShelly Revised Jamaica Options
Shelly Revised Jamaica Options
 
Kacey Great Exuma Option
Kacey Great Exuma OptionKacey Great Exuma Option
Kacey Great Exuma Option
 
Brenda St. Lucia
Brenda St. LuciaBrenda St. Lucia
Brenda St. Lucia
 
Justin Costa Rica Options
Justin Costa Rica OptionsJustin Costa Rica Options
Justin Costa Rica Options
 
2010 02 15 practica_anatomia
2010 02 15 practica_anatomia2010 02 15 practica_anatomia
2010 02 15 practica_anatomia
 
Sandals Grande Riviera
Sandals Grande RivieraSandals Grande Riviera
Sandals Grande Riviera
 
Liz Puerto Vallarta
Liz Puerto VallartaLiz Puerto Vallarta
Liz Puerto Vallarta
 
Joel
JoelJoel
Joel
 
Travis & Allison
Travis & AllisonTravis & Allison
Travis & Allison
 
Jamaica
JamaicaJamaica
Jamaica
 
Креатиный капитал. история 3-хлетия
Креатиный капитал. история 3-хлетияКреатиный капитал. история 3-хлетия
Креатиный капитал. история 3-хлетия
 
2011 lotus lantern_chinese_broucher
2011 lotus lantern_chinese_broucher2011 lotus lantern_chinese_broucher
2011 lotus lantern_chinese_broucher
 
практический Pr
практический Prпрактический Pr
практический Pr
 
Excellence Riviera Cancun
Excellence Riviera CancunExcellence Riviera Cancun
Excellence Riviera Cancun
 
Públic o privat en sanitat a catalunya 1a part2011 bm86
Públic o privat en sanitat a catalunya 1a part2011 bm86Públic o privat en sanitat a catalunya 1a part2011 bm86
Públic o privat en sanitat a catalunya 1a part2011 bm86
 
2011 12 19 islaverdeypadulresumido
2011 12 19 islaverdeypadulresumido2011 12 19 islaverdeypadulresumido
2011 12 19 islaverdeypadulresumido
 

Similar to Pig

Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Yahoo Developer Network
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
IndicThreads
 

Similar to Pig (20)

Pig on spark
Pig on sparkPig on spark
Pig on spark
 
AfterGlow
AfterGlowAfterGlow
AfterGlow
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Intuitive CLIs for gRPC APIs
Intuitive CLIs for gRPC APIsIntuitive CLIs for gRPC APIs
Intuitive CLIs for gRPC APIs
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
 
GraphQL - A query language to empower your API consumers (NDC Sydney 2017)
GraphQL - A query language to empower your API consumers (NDC Sydney 2017)GraphQL - A query language to empower your API consumers (NDC Sydney 2017)
GraphQL - A query language to empower your API consumers (NDC Sydney 2017)
 
MapReduce@DirectI
MapReduce@DirectIMapReduce@DirectI
MapReduce@DirectI
 
Scalable, Fast Analytics with Graph - Why and How
Scalable, Fast Analytics with Graph - Why and HowScalable, Fast Analytics with Graph - Why and How
Scalable, Fast Analytics with Graph - Why and How
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge
 
Connecting the Dots: Kong for GraphQL Endpoints
Connecting the Dots: Kong for GraphQL EndpointsConnecting the Dots: Kong for GraphQL Endpoints
Connecting the Dots: Kong for GraphQL Endpoints
 
Tutorial: Building a GraphQL API in PHP
Tutorial: Building a GraphQL API in PHPTutorial: Building a GraphQL API in PHP
Tutorial: Building a GraphQL API in PHP
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 

More from Chirag Ahuja (10)

Deploy hadoop cluster
Deploy hadoop clusterDeploy hadoop cluster
Deploy hadoop cluster
 
Word count example in hadoop mapreduce using java
Word count example in hadoop mapreduce using javaWord count example in hadoop mapreduce using java
Word count example in hadoop mapreduce using java
 
Big data introduction
Big data introductionBig data introduction
Big data introduction
 
Flume
FlumeFlume
Flume
 
Hbase
HbaseHbase
Hbase
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Hdfs
HdfsHdfs
Hdfs
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

Recently uploaded

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 

Recently uploaded (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 

Pig

  • 1. PIG: High Level Data Flow Language COPYRIGHT (C) CHIRAG AHUJA
  • 2. Outline Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Implementation Comparison with Map-Reduce Optimization in Pig COPYRIGHT (C) CHIRAG AHUJA
  • 3. The Map-Reduce Appeal COPYRIGHT (C) CHIRAG AHUJA Scale Scalable due to simpler design • Only parallelizable operations • No transactions $ Runs on cheap commodity hardware SQL Procedural Control- a processing “pipe”
  • 4. Map-Reduce COPYRIGHT (C) CHIRAG AHUJA k1 v1 k2 v2 k1 v3 k2 v4 k1 v5 map map k1 v1 k1 v3 k1 v5 k2 v2 k2 v4 Output records reduce reduce Just a group-by-aggregate? Input records
  • 5. Java Example COPYRIGHT (C) CHIRAG AHUJA map reduce Job conf.
  • 6. Disadvantages COPYRIGHT (C) CHIRAG AHUJA 1. Extremely rigid data flow Other flows constantly hacked in Join, Union Split M R M M R M Chains 2. Common operations must be coded by hand • Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions • Difficult to maintain, extend, and optimize
  • 7. Pros And Cons Need a high-level, general data flow language COPYRIGHT (C) CHIRAG AHUJA
  • 8. Enter Pig Latin Need a high-level, general data flow language COPYRIGHT (C) CHIRAG AHUJA
  • 9. What is Pig  A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.  Compiles down to Map Reduce jobs  Developed by Yahoo!  Open-source language COPYRIGHT (C) CHIRAG AHUJA
  • 10. Data Flow COPYRIGHT (C) CHIRAG AHUJA Load Visits Group by url Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls
  • 11. In Pig Latin visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; COPYRIGHT (C) CHIRAG AHUJA
  • 12. Pig Compilation COPYRIGHT (C) CHIRAG AHUJA
  • 13. Implementation COPYRIGHT (C) CHIRAG AHUJA SQL Pig Hadoop Map-Reduce cluster automatic rewrite + optimize or or user
  • 14. Java vs. Pig 300 250 200 150 100 50 COPYRIGHT (C) CHIRAG AHUJA 180 160 140 120 100 80 60 40 20 0 1/20 the lines of code Hadoop Pig 0 Hadoop Pig Minutes 1/16 the development time Performance is comparable (Java is slightly better)
  • 15. Summary Big demand for parallel data processing ◦ Emerging tools that do not look like SQL DBMS ◦ Programmers like dataflow pipes over static files Hence the excitement about Map-Reduce But, Map-Reduce is too low-level and rigid COPYRIGHT (C) CHIRAG AHUJA Pig Latin Sweet spot between map-reduce and SQL