Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
EEDC Apache Pig Language
1. EEDC
34330
Execution Apache Pig
Environments for
Distributed
Computing
Master in Computer Architecture,
Networks and Systems - CANS
Homework number: 3
Group number: EEDC-3
Group members:
Javier Álvarez – javicid@gmail.com
Francesc Lordan – francesc.lordan@gmail.com
Roger Rafanell – rogerrafanell@gmail.com
3. EEDC
34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture, Part 1
Networks and Systems - CANS Introduction
4. Why Apache Pig?
Today’s Internet companies needs to process hugh data sets:
– Parallel databases can be prohibitively expensive at this scale.
– Programmers tend to find declarative languages such as SQL very
unnatural.
– Other approaches such map-reduce are low-level and rigid.
4
5. What is Apache Pig?
A platform for analyzing large data sets that:
– It is based in Pig Latin which lies between declarative (SQL) and
procedural (C++) programming languages.
– At the same time, enables the construction of programs with an easy
parallelizable structure.
5
6. Which features does it have?
Dataflow Language
– Data processing is expressed step-by-step.
Quick Start & Interoperability
– Pig can work over any kind of input and produce any kind of output.
Nested Data Model
– Pig works with complex types like tuples, bags, ...
User Defined Functions (UDFs)
– Potentially in any programming language (only Java for the moment).
Only parallel
– Pig Latin forces to use directives that are parallelizable in a direct way.
Debugging environment
– Debugging at programming time.
6
7. EEDC
34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture, Part 2
Networks and Systems - CANS Pig Latin
8. EEDC
34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture, Section 2.1
Networks and Systems - CANS Data model
9. Data Model
Very rich data model consisting on 4 simple data types:
Atom: Simple atomic value such as strings or numbers.
‘Alice’
Tuple: Sequence of fields of any type of data.
(‘Alice’, ‘Apple’)
(‘Alice’, (‘Barça’, ‘football’))
Bag: collection of tuples with possible duplicates.
(‘Alice’, ‘Apple’)
(‘Alice’, (‘Barça’, ‘football’))
Map: collection of data items with an associated key (always an atom).
‘Fan of’ (‘Apple’)
(‘Barça’, ‘football’)
9
10. EEDC
34330
Execution
Environments for
Distributed
Computing Section 2.2
Master in Computer Architecture,
Networks and Systems - CANS Relational
commands
13. Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user
user: (‘Amy’, { (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’),
(‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)})
(‘Bob’, {‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)})
13
14. Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user
useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr
user: (‘Amy’, ‘0.7’)
(‘Bob’, ‘0.2’)
14
15. Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user
useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr
answer = FILTER useravg BY avgpr > ‘0.5’
answer: (‘Amy’, ‘0.7’)
15
16. Relational commands
Other relational operators:
– STORE : exports data into a file.
STORE var1_name INTO 'output.txt‘;
– COGROUP : groups together tuples from diferent datasets.
COGROUP var1_name BY field_id, var2_name BY field_id
– UNION : computes the union of two variables.
– CROSS : computes the cross product.
– ORDER : sorts a data set by one or more fields.
– DISTINCT : removes replicated tuples in a dataset.
16
17. EEDC
34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture, Part 3
Networks and Systems - CANS Implementation
18. Implementation: Highlights
Works on top of Hadoop ecosystem:
– Current implementation uses Hadoop as execution platform.
On-the-fly compilation:
– Pig translates the Pig Latin commands to Map and Reduce methods.
Lazy style language:
– Pig try to pospone the data materialization (on disk writes) as much as
possible.
18
19. Implementation: Building the logical plan
Query parsing:
– Pig interpreter parses the commands verifying that the input files and
bags referenced are valid.
On-the-fly compilation:
– Pig compiles the logical plan for that bag into physical plan (Map-Reduce
statements) when the command cannot be more delayed and must be
executed.
Lazy characteristics:
– No processing are carried out when the logical plan are build up.
– Processing is triggered only when the user invokes STORE command
on a bag.
– Lazy style execution permits in-memory pipelining and other interesting
optimizations.
19
20. Implementation: Map-Reduce plan compilation
CO(GROUP):
– Each command is compiled in a distinct map-reduce job with its own
map and reduce functions.
– Parallelism is achieved since the output of multiple map instances is
repartitioned in parallel to multiple reduce instances.
LOAD:
– Parallelism is obtained since Pig operates over files residing in the
Hadoop distributed file system.
FILTER/FOREACH:
– Automatic parallelism is given since for a map-reduce job several map
and reduce instances are run in parallel.
ORDER (compiled in two map-reduce jobs):
– First: Determine quantiles of the sort key
– Second: Chops the job according the quantiles and performs a local
sorting in the reduce phase resulting in a global sorted file.
20
21. EEDC
34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture, Part 4
Networks and Systems - CANS Conclusions
22. Conclusions
Advantages:
– Step-by-step syntaxis.
– Flexible: UDFs, not locked to a fixed schema (allows schema changes over the time).
– Exposes a set of widely used functions: FOREACH, FILTER, ORDER, GROUP, …
– Takes advantage of Hadoop native properties such: parallelism, load-balancing, fault-tolerance.
– Debugging environment.
– Open Source (IMPORTANT!!)
Disadvantages:
– UDFs methods could be a source of performance loss (the control relies on user).
– Overhead while compiling Pig Latin into map-reduce jobs.
Usage Scenarios:
– Temporal analysis: search logs mainly involves studying how search query distribution changes
over time.
– Session analysis: web user sessions, i.e, sequences of page views and clicks made by users are
analized to calculate some metrics such:
– how long is the average user session?
– how many links does a user click on before leaving a website?
– Others, ...
22