SlideShare une entreprise Scribd logo
1  sur  46
Pig Latin: A Not-So-Foreign
Language for Data Processing
Christopher Olsten, Benjamin Reed, Utkarsh Srivastava,
Ravi Kumar, Andrew Tomkins
Acknowledgement : Modified the slides from University of Waterloo
Motivation
 You‟re a procedural programmer
 You have huge data
 You want to analyze it
2
Motivation
 As a procedural programmer…
 May find writing queries in SQL unnatural and too restrictive
 More comfortable with writing code; a series of statements as
opposed to a long query. (Ex: MapReduce is so successful).
3
Motivation
 Data analysis goals
 Quick
 Exploit parallel processing power of a distributed system
 Easy
 Be able to write a program or query without a huge learning curve
 Have some common analysis tasks predefined
 Flexible
 Transform a data set(s) into a workable structure without much
overhead
 Perform customized processing
 Transparent
 Have a say in how the data processing is executed on the system
5
Motivation
 Relational Distributed Databases
 Parallel database products expensive
 Rigid schemas
 Processing requires declarative SQL query construction
 Map-Reduce
 Relies on custom code for even common operations
 Need to do workarounds for tasks that have different data
flows other than the expected MapCombineReduce
6
Motivation
 Relational Distributed Databases
 Sweet Spot: Take the best of both SQL and Map-Reduce;
combine high-level declarative querying with low-level
procedural programming…Pig Latin!
 Map-Reduce
7
Pig Latin Example
Table urls: (url,category, pagerank)
Find for each suffciently large category, the average pagerank of high-
pagerank urls in that category
SQL:
SELECT category, AVG(pagerank)
FROM urls WHERE pagerank > 0.2
GROUP BY category HAVING COUNT(*) > 10^6
Pig Latin:
good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
big_groups = FILTER groups BY COUNT(good_urls)>10^6;
output = FOREACH big_groups GENERATE category,
AVG(good_urls.pagerank);
Outline
 System Overview
 Pig Latin (The Language)
 Data Structures
 Commands
 Pig (The Compiler)
 Logical & Physical Plans
 Optimization
 Efficiency
 Pig Pen (The Debugger)
 Conclusion
8
Big Picture
Pig Latin
Script
User-
Defined
Functions
Pig
Map-Reduce
Statements
Compile
Optimize
Write Results Read Data
10
Data Model
 Atom - simple atomic value (ie: number or string)
 Tuple
 Bag
 Map
11
Data Model
 Atom
 Tuple - sequence of fields; each field any type
 Bag
 Map
12
Data Model
 Atom
 Tuple
 Bag - collection of tuples
 Duplicates possible
 Tuples in a bag can have different field lengths and field types
 Map
13
Data Model
 Atom
 Tuple
 Bag
 Map - collection of key-value pairs
 Key is an atom; value can be any type
14
Data Model
 Control over dataflow
Ex 1 (less efficient)
spam_urls = FILTER urls BY isSpam(url);
culprit_urls = FILTER spam_urls BY pagerank > 0.8;
Ex 2 (most efficient)
highpgr_urls = FILTER urls BY pagerank > 0.8;
spam_urls = FILTER highpgr_urls BY isSpam(url);
 Fully nested
 More natural for procedural programmers (target user) than
normalization
 Data is often stored on disk in a nested fashion
 Facilitates ease of writing user-defined functions
 No schema required
15
Data Model
 User-Defined Functions (UDFs)
 Can be used in many Pig Latin statements
 Useful for custom processing tasks
 Can use non-atomic values for input and output
 Currently must be written in Java
16
 Ex: spam_urls = FILTER urls BY isSpam(url);
Speaking Pig Latin
 LOAD
 Input is assumed to be a bag (sequence of tuples)
 Can specify a deserializer with “USING‟
 Can provide a schema with “AS‟
newBag = LOAD ‘filename’
<USING functionName() >
<AS (fieldName1, fieldName2,…)>;
17
Queries = LOAD ‘query_log.txt’
USING myLoad()
AS (userID,queryString, timeStamp)
Speaking Pig Latin
 FOREACH
 Apply some processing to each tuple in a bag
 Each field can be:
 A fieldname of the bag
 A constant
 A simple expression (ie: f1+f2)
 A predefined function (ie: SUM, AVG, COUNT, FLATTEN)
 A UDF (ie: sumTaxes(gst, pst) )
newBag =
FOREACH bagName
GENERATE field1, field2, …;
18
Speaking Pig Latin
 FILTER
 Select a subset of the tuples in a bag
newBag = FILTER bagName
BY expression;
 Expression uses simple comparison operators (==, !=, <, >, …)
and Logical connectors (AND, NOT, OR)
some_apples =
FILTER apples BY colour != ‘red’;
 Can use UDFs
some_apples =
FILTER apples BY NOT isRed(colour);
19
Speaking Pig Latin
 COGROUP
 Group two datasets together by a common attribute
 Groups data into nested bags
grouped_data = COGROUP results BY queryString,
revenue BY queryString;
20
Speaking Pig Latin
 Why COGROUP and not JOIN?
url_revenues =
FOREACH grouped_data GENERATE
FLATTEN(distributeRev(results, revenue));
21
Speaking Pig Latin
 Why COGROUP and not JOIN?
 May want to process nested bags of tuples before taking the
cross product.
 Keeps to the goal of a single high-level data transformation per
pig-latin statement.
 However, JOIN keyword is still available:
JOIN results BY queryString,
revenue BY queryString;
Equivalent
temp = COGROUP results BY queryString,
revenue BY queryString;
join_result = FOREACH temp GENERATE
FLATTEN(results), FLATTEN(revenue);
22
Speaking Pig Latin
 STORE (& DUMP)
 Output data to a file (or screen)
STORE bagName INTO ‘filename’
<USING deserializer ()>;
 Other Commands (incomplete)
 UNION - return the union of two or more bags
 CROSS - take the cross product of two or more bags
 ORDER - order tuples by a specified field(s)
 DISTINCT - eliminate duplicate tuples in a bag
 LIMIT - Limit results to a subset
23
Compilation
 Pig system does two tasks:
 Builds a Logical Plan from a Pig Latin script
 Supports execution platform independence
 No processing of data performed at this stage
 Compiles the Logical Plan to a Physical Plan and Executes
 Convert the Logical Plan into a series of Map-Reduce statements to
be executed (in this case) by Hadoop Map-Reduce
24
Compilation
 Building a Logical Plan
 Verify input files and bags referred to are valid
 Create a logical plan for each bag(variable) defined
25
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
26
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
27
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Foreach
28
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Foreach
Filter
29
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Filter
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Group
Foreach
30
Compilation
 Building a Physical Plan
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Filter
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
Group
Only happens when output is
specified by STORE or DUMP
Foreach
32
Compilation
 Building a Physical Plan
 Step 1: Create a map-reduce job for each
COGROUP
Map
Reduce
Load(user.dat)
Filter
Group
Foreach
33
Compilation
 Building a Physical Plan
 Step 1: Create a map-reduce job for each
COGROUP
 Step 2: Push other commands into the
map and reduce functions where Map
possible
 May be the case certain commands
require their own map-reduce
Reduce
Load(user.dat)
Filter
Group
job (ie: ORDER needs separate map-
reduce jobs)
Foreach
34
Compilation
 Efficiency in Execution
 Parallelism
 Loading data - Files are loaded from HDFS
 Statements are compiled into map-reduce jobs
35
Compilation
 Efficiency with Nested Bags
 In many cases, the nested bags created in each tuple of a COGROUP
statement never need to physically materialize
 Generally perform aggregation after a COGROUP and the
statements for said aggregation are pushed into the reduce function
 Applies to algebraic functions (ie: COUNT, MAX, MIN, SUM, AVG)
36
Compilation
 Efficiency with Nested Bags
Map
Load(user.dat)
Filter
Group
Foreach
37
Compilation
 Efficiency with Nested Bags Load(user.dat)
Filter
Group
Combine
Foreach
38
Compilation
 Efficiency with Nested Bags
Reduce
Load(user.dat)
Filter
Group
Foreach
39
Compilation
 Efficiency with Nested Bags
 Why this works:
 COUNT is an algebraic function; it can be structured as a tree of sub-
functions with each leaf working on a subset of the data
Reduce SUM
Combine COUNT COUNT
40
Compilation
 Efficiency with Nested Bags
 Pig provides an interface for writing algebraic UDFs so they can take
advantage of this optimization as well.
 Inefficiencies
 Non-algebraic aggregate functions (ie: MEDIAN) need entire bag to
materialize; may cause a very large bag to spill to disk if it doesn‟t fit
in memory
 Every map-reduce job requires data be written and replicated to the
HDFS (although this is offset by parallelism achieved)
41
Debugging
 How to verify the semantics of an analysis program
 Run the program against whole data set. Might take hours!
 Generate sample dataset
 Empty result set may occur on few operations like join, filter
 Generally, testing with sample dataset is difficult
 Pig-Pen
 Samples data from large dataset for Pig statements
 Apply individual Pig-Latin commands against the dataset
 In case of empty result, pig system resamples
 Remove redundant samples
42
Debugging
 Pig-Pen
42
Debugging
 Pig-Latin command window and command generator
43
Debugging
 Sand Box Dataset (generated automatically!)
44
Debugging
 Pig-Pen
 Provides sample data that is:
 Real - taken from actual data
 Concise - as small as possible
 Complete - collectively illustrate the key semantics of each command
 Helps with schema definition
 Facilitates incremental program writing
45
Conclusion
 Pig is a data processing environment in Hadoop that is
specifically targeted towards procedural programmers
who perform large-scale data analysis.
 Pig-Latin offers high-level data manipulation in a
procedural style.
 Pig-Pen is a debugging environment for Pig-Latin
commands that generates samples from real data.
47
More Info
 Pig, http://hadoop.apache.org/pig/
 Hadoop, http://hadoop.apache.org
Anks-
Thay!
48

Contenu connexe

Similaire à 4.1-Pig.pptx

Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLScaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLJim Mlodgenski
 
Phil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonPhil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonRoss McDonald
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Mapfilterreducepresentation
MapfilterreducepresentationMapfilterreducepresentation
MapfilterreducepresentationManjuKumara GH
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineeringJulian Hyde
 
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Piotr Wikiel
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsDilum Bandara
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performancesource{d}
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 

Similaire à 4.1-Pig.pptx (20)

Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLScaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQL
 
Phil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonPhil Bartie QGIS PLPython
Phil Bartie QGIS PLPython
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Mapfilterreducepresentation
MapfilterreducepresentationMapfilterreducepresentation
Mapfilterreducepresentation
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performance
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 

Plus de akhileshyadav718837

Plus de akhileshyadav718837 (10)

Chapter 1 Lesson 2 A Streams Rivers Lakes.ppt
Chapter 1 Lesson 2 A Streams Rivers Lakes.pptChapter 1 Lesson 2 A Streams Rivers Lakes.ppt
Chapter 1 Lesson 2 A Streams Rivers Lakes.ppt
 
Chapter 1 Lesson 2 A Streams Rivers Lakes.ppt
Chapter 1 Lesson 2 A Streams Rivers Lakes.pptChapter 1 Lesson 2 A Streams Rivers Lakes.ppt
Chapter 1 Lesson 2 A Streams Rivers Lakes.ppt
 
fish.ppt
fish.pptfish.ppt
fish.ppt
 
PRECIPITATION - Rain.pptx
PRECIPITATION - Rain.pptxPRECIPITATION - Rain.pptx
PRECIPITATION - Rain.pptx
 
Electronic Ink.ppt
Electronic Ink.pptElectronic Ink.ppt
Electronic Ink.ppt
 
part_i_what_is_rice_ppt.pptx
part_i_what_is_rice_ppt.pptxpart_i_what_is_rice_ppt.pptx
part_i_what_is_rice_ppt.pptx
 
TruzoneOutsourcing.pptx
TruzoneOutsourcing.pptxTruzoneOutsourcing.pptx
TruzoneOutsourcing.pptx
 
50321.ppt
50321.ppt50321.ppt
50321.ppt
 
soul1_appl-server-instl..ppt
soul1_appl-server-instl..pptsoul1_appl-server-instl..ppt
soul1_appl-server-instl..ppt
 
Simple_Machines_PPT.pptx
Simple_Machines_PPT.pptxSimple_Machines_PPT.pptx
Simple_Machines_PPT.pptx
 

Dernier

4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxElton John Embodo
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxJanEmmanBrigoli
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 

Dernier (20)

4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docx
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 

4.1-Pig.pptx

  • 1. Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement : Modified the slides from University of Waterloo
  • 2. Motivation  You‟re a procedural programmer  You have huge data  You want to analyze it 2
  • 3. Motivation  As a procedural programmer…  May find writing queries in SQL unnatural and too restrictive  More comfortable with writing code; a series of statements as opposed to a long query. (Ex: MapReduce is so successful). 3
  • 4. Motivation  Data analysis goals  Quick  Exploit parallel processing power of a distributed system  Easy  Be able to write a program or query without a huge learning curve  Have some common analysis tasks predefined  Flexible  Transform a data set(s) into a workable structure without much overhead  Perform customized processing  Transparent  Have a say in how the data processing is executed on the system 5
  • 5. Motivation  Relational Distributed Databases  Parallel database products expensive  Rigid schemas  Processing requires declarative SQL query construction  Map-Reduce  Relies on custom code for even common operations  Need to do workarounds for tasks that have different data flows other than the expected MapCombineReduce 6
  • 6. Motivation  Relational Distributed Databases  Sweet Spot: Take the best of both SQL and Map-Reduce; combine high-level declarative querying with low-level procedural programming…Pig Latin!  Map-Reduce 7
  • 7. Pig Latin Example Table urls: (url,category, pagerank) Find for each suffciently large category, the average pagerank of high- pagerank urls in that category SQL: SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10^6 Pig Latin: good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>10^6; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);
  • 8. Outline  System Overview  Pig Latin (The Language)  Data Structures  Commands  Pig (The Compiler)  Logical & Physical Plans  Optimization  Efficiency  Pig Pen (The Debugger)  Conclusion 8
  • 10. Data Model  Atom - simple atomic value (ie: number or string)  Tuple  Bag  Map 11
  • 11. Data Model  Atom  Tuple - sequence of fields; each field any type  Bag  Map 12
  • 12. Data Model  Atom  Tuple  Bag - collection of tuples  Duplicates possible  Tuples in a bag can have different field lengths and field types  Map 13
  • 13. Data Model  Atom  Tuple  Bag  Map - collection of key-value pairs  Key is an atom; value can be any type 14
  • 14. Data Model  Control over dataflow Ex 1 (less efficient) spam_urls = FILTER urls BY isSpam(url); culprit_urls = FILTER spam_urls BY pagerank > 0.8; Ex 2 (most efficient) highpgr_urls = FILTER urls BY pagerank > 0.8; spam_urls = FILTER highpgr_urls BY isSpam(url);  Fully nested  More natural for procedural programmers (target user) than normalization  Data is often stored on disk in a nested fashion  Facilitates ease of writing user-defined functions  No schema required 15
  • 15. Data Model  User-Defined Functions (UDFs)  Can be used in many Pig Latin statements  Useful for custom processing tasks  Can use non-atomic values for input and output  Currently must be written in Java 16  Ex: spam_urls = FILTER urls BY isSpam(url);
  • 16. Speaking Pig Latin  LOAD  Input is assumed to be a bag (sequence of tuples)  Can specify a deserializer with “USING‟  Can provide a schema with “AS‟ newBag = LOAD ‘filename’ <USING functionName() > <AS (fieldName1, fieldName2,…)>; 17 Queries = LOAD ‘query_log.txt’ USING myLoad() AS (userID,queryString, timeStamp)
  • 17. Speaking Pig Latin  FOREACH  Apply some processing to each tuple in a bag  Each field can be:  A fieldname of the bag  A constant  A simple expression (ie: f1+f2)  A predefined function (ie: SUM, AVG, COUNT, FLATTEN)  A UDF (ie: sumTaxes(gst, pst) ) newBag = FOREACH bagName GENERATE field1, field2, …; 18
  • 18. Speaking Pig Latin  FILTER  Select a subset of the tuples in a bag newBag = FILTER bagName BY expression;  Expression uses simple comparison operators (==, !=, <, >, …) and Logical connectors (AND, NOT, OR) some_apples = FILTER apples BY colour != ‘red’;  Can use UDFs some_apples = FILTER apples BY NOT isRed(colour); 19
  • 19. Speaking Pig Latin  COGROUP  Group two datasets together by a common attribute  Groups data into nested bags grouped_data = COGROUP results BY queryString, revenue BY queryString; 20
  • 20. Speaking Pig Latin  Why COGROUP and not JOIN? url_revenues = FOREACH grouped_data GENERATE FLATTEN(distributeRev(results, revenue)); 21
  • 21. Speaking Pig Latin  Why COGROUP and not JOIN?  May want to process nested bags of tuples before taking the cross product.  Keeps to the goal of a single high-level data transformation per pig-latin statement.  However, JOIN keyword is still available: JOIN results BY queryString, revenue BY queryString; Equivalent temp = COGROUP results BY queryString, revenue BY queryString; join_result = FOREACH temp GENERATE FLATTEN(results), FLATTEN(revenue); 22
  • 22. Speaking Pig Latin  STORE (& DUMP)  Output data to a file (or screen) STORE bagName INTO ‘filename’ <USING deserializer ()>;  Other Commands (incomplete)  UNION - return the union of two or more bags  CROSS - take the cross product of two or more bags  ORDER - order tuples by a specified field(s)  DISTINCT - eliminate duplicate tuples in a bag  LIMIT - Limit results to a subset 23
  • 23. Compilation  Pig system does two tasks:  Builds a Logical Plan from a Pig Latin script  Supports execution platform independence  No processing of data performed at this stage  Compiles the Logical Plan to a Physical Plan and Executes  Convert the Logical Plan into a series of Map-Reduce statements to be executed (in this case) by Hadoop Map-Reduce 24
  • 24. Compilation  Building a Logical Plan  Verify input files and bags referred to are valid  Create a logical plan for each bag(variable) defined 25
  • 25. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; 26
  • 26. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Group OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; 27
  • 27. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Group OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Foreach 28
  • 28. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Group OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Foreach Filter 29
  • 29. Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Filter OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Group Foreach 30
  • 30. Compilation  Building a Physical Plan A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Filter OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Group Only happens when output is specified by STORE or DUMP Foreach 32
  • 31. Compilation  Building a Physical Plan  Step 1: Create a map-reduce job for each COGROUP Map Reduce Load(user.dat) Filter Group Foreach 33
  • 32. Compilation  Building a Physical Plan  Step 1: Create a map-reduce job for each COGROUP  Step 2: Push other commands into the map and reduce functions where Map possible  May be the case certain commands require their own map-reduce Reduce Load(user.dat) Filter Group job (ie: ORDER needs separate map- reduce jobs) Foreach 34
  • 33. Compilation  Efficiency in Execution  Parallelism  Loading data - Files are loaded from HDFS  Statements are compiled into map-reduce jobs 35
  • 34. Compilation  Efficiency with Nested Bags  In many cases, the nested bags created in each tuple of a COGROUP statement never need to physically materialize  Generally perform aggregation after a COGROUP and the statements for said aggregation are pushed into the reduce function  Applies to algebraic functions (ie: COUNT, MAX, MIN, SUM, AVG) 36
  • 35. Compilation  Efficiency with Nested Bags Map Load(user.dat) Filter Group Foreach 37
  • 36. Compilation  Efficiency with Nested Bags Load(user.dat) Filter Group Combine Foreach 38
  • 37. Compilation  Efficiency with Nested Bags Reduce Load(user.dat) Filter Group Foreach 39
  • 38. Compilation  Efficiency with Nested Bags  Why this works:  COUNT is an algebraic function; it can be structured as a tree of sub- functions with each leaf working on a subset of the data Reduce SUM Combine COUNT COUNT 40
  • 39. Compilation  Efficiency with Nested Bags  Pig provides an interface for writing algebraic UDFs so they can take advantage of this optimization as well.  Inefficiencies  Non-algebraic aggregate functions (ie: MEDIAN) need entire bag to materialize; may cause a very large bag to spill to disk if it doesn‟t fit in memory  Every map-reduce job requires data be written and replicated to the HDFS (although this is offset by parallelism achieved) 41
  • 40. Debugging  How to verify the semantics of an analysis program  Run the program against whole data set. Might take hours!  Generate sample dataset  Empty result set may occur on few operations like join, filter  Generally, testing with sample dataset is difficult  Pig-Pen  Samples data from large dataset for Pig statements  Apply individual Pig-Latin commands against the dataset  In case of empty result, pig system resamples  Remove redundant samples 42
  • 42. Debugging  Pig-Latin command window and command generator 43
  • 43. Debugging  Sand Box Dataset (generated automatically!) 44
  • 44. Debugging  Pig-Pen  Provides sample data that is:  Real - taken from actual data  Concise - as small as possible  Complete - collectively illustrate the key semantics of each command  Helps with schema definition  Facilitates incremental program writing 45
  • 45. Conclusion  Pig is a data processing environment in Hadoop that is specifically targeted towards procedural programmers who perform large-scale data analysis.  Pig-Latin offers high-level data manipulation in a procedural style.  Pig-Pen is a debugging environment for Pig-Latin commands that generates samples from real data. 47
  • 46. More Info  Pig, http://hadoop.apache.org/pig/  Hadoop, http://hadoop.apache.org Anks- Thay! 48