SlideShare une entreprise Scribd logo
1  sur  24
EEDC
                          34330
Execution                                   Apache Pig
Environments for
Distributed
Computing
Master in Computer Architecture,
Networks and Systems - CANS



                                           Homework number: 3
                                          Group number: EEDC-3
                                             Group members:
                                       Javier Álvarez – javicid@gmail.com
                                   Francesc Lordan – francesc.lordan@gmail.com
                                     Roger Rafanell – rogerrafanell@gmail.com
Outline

1.- Introduction

2.- Pig Latin
    2.1.- Data Model
    2.2.- Programming Model

3.- Implementation

4.- Conclusions




                              2
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,      Part 1
Networks and Systems - CANS        Introduction
Why Apache Pig?

Today’s Internet companies needs to process hugh data sets:

   – Parallel databases can be prohibitively expensive at this scale.

   – Programmers tend to find declarative languages such as SQL very
     unnatural.

   – Other approaches such map-reduce are low-level and rigid.




                                       4
What is Apache Pig?

A platform for analyzing large data sets that:

   – It is based in Pig Latin which lies between declarative (SQL) and
     procedural (C++) programming languages.

   – At the same time, enables the construction of programs with an easy
     parallelizable structure.




                                      5
Which features does it have?
 Dataflow Language
   – Data processing is expressed step-by-step.

 Quick Start & Interoperability
   – Pig can work over any kind of input and produce any kind of output.

 Nested Data Model
   – Pig works with complex types like tuples, bags, ...

 User Defined Functions (UDFs)
   – Potentially in any programming language (only Java for the moment).

 Only parallel
   – Pig Latin forces to use directives that are parallelizable in a direct way.

 Debugging environment
   – Debugging at programming time.
                                        6
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,     Part 2
Networks and Systems - CANS        Pig Latin
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,    Section 2.1
Networks and Systems - CANS        Data Model
Data Model
Very rich data model consisting on 4 simple data types:

 Atom: Simple atomic value such as strings or numbers.
                                        ‘Alice’
 Tuple: Sequence of fields of any type of data.
                                   (‘Alice’, ‘Apple’)
                            (‘Alice’, (‘Barça’, ‘football’))
 Bag: collection of tuples with possible duplicates.
                                      (‘Alice’, ‘Apple’)
                               (‘Alice’, (‘Barça’, ‘football’))
 Map: collection of data items with an associated key (always an atom).
                              ‘Fan of’           (‘Apple’)
                                                (‘Barça’, ‘football’)

                                    ‘Age’  ’20’


                                            9
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,     Section 2.2
Networks and Systems - CANS        Programming
Programming Model
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);




visits: (‘Amy’, ‘cnn.com’, ‘8am’)
     (‘Amy’, ‘nytimes.com’, ‘9am’)
     (‘Bob’, ‘elmundotoday.com’, ’11am’)

pages: (‘cnn.com’, ‘0.8’)
    (‘nytimes.com’, ‘0.6’)
    (‘elmundotoday’, ‘0.2’)


                                      11
Programming Model
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url




v_p:(‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’)
    (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)
    (‘Bob’, ‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)




                                      12
Programming Model
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user




user:   (‘Amy’, { (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’),
        (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)})

    (‘Bob’,   {‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)})


                                        13
Programming Model
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user
useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr




user:   (‘Amy’, ‘0.7’)
    (‘Bob’, ‘0.2’)




                             14
Programming Model
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user
useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr
answer = FILTER useravg BY avgpr > ‘0.5’




answer: (‘Amy’, ‘0.7’)




                             15
Programming Model
Other relational operators:

    – STORE : exports data into a file.
    STORE var1_name INTO 'output.txt‘;

    – COGROUP : groups together tuples from diferent datasets.
    COGROUP var1_name BY field_id, var2_name BY field_id

    –   UNION : computes the union of two variables.
    –   CROSS : computes the cross product.
    –   ORDER : sorts a data set by one or more fields.
    –   DISTINCT : removes replicated tuples in a dataset.




                                           16
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,        Part 3
Networks and Systems - CANS        Implementation
Implementation: Highlights

 Works on top of Hadoop ecosystem:
   – Current implementation uses Hadoop as execution platform.

 On-the-fly compilation:
   – Pig translates the Pig Latin commands to Map and Reduce methods.

 Lazy style language:
   – Pig try to pospone the data materialization (on disk writes) as much as
     possible.




                                     18
Implementation: Building the logical
plan
 Query parsing:
   – Pig interpreter parses the commands verifying that the input files and
     bags referenced are valid.

 On-the-fly compilation:
   – Pig compiles the logical plan for that bag into physical plan (Map-Reduce
     statements) when the command cannot be more delayed and must be
     executed.

 Lazy characteristics:
   – No processing are carried out when the logical plan are constructed.
   – Processing is triggered only when the user invokes STORE command
     on a bag.
   – Lazy style execution permits in-memory pipelining and other interesting
     optimizations.



                                      5
                                      19
Implementation: Map-Reduce plan
compilation
   CO(GROUP):
     – Each command is compiled in a distinct map-reduce job with its own map and reduce functions.
     – Parallelism is achieved since the output of multiple map instances is repartitioned in parallel to
        multiple reduce instances.


   LOAD:
     – Parallelism is obtained since Pig operates over files residing in the Hadoop distributed file system.


   FILTER/FOREACH:
     – Automatic parallelism is given since for a map-reduce job several map and reduce instances are run
         in parallel.


   ORDER (compiled in two map-reduce jobs):
     – First: Determine quantiles of the sort key
     – Second: Chops the job according the quantiles and performs a local sorting in the reduce phase
       resulting in a global sorted file.




                                                     20
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,      Part 4
Networks and Systems - CANS        Conclusions
Conclusions
   Advantages:
     –   Step-by-step syntaxis.
     –   Flexible: UDFs, not locked to a fixed schema (allows schema changes over the time).
     –   Exposes a set of widely used functions: FOREACH, FILTER, ORDER, GROUP, …
     –   Takes advantage of Hadoop native properties such: parallelism, load-balancing, fault-tolerance.
     –   Debugging environment.
     –   Open Source (IMPORTANT!!)


   Disadvantages:
     –   UDFs methods could be a source of performance loss (the control relies on the user side).
     –   Overhead while compiling Pig Latin into map-reduce jobs.


   Usage Scenarios:
     –   Temporal analysis: search logs mainly involves studying how search query distribution changes
         over time.
     –   Session analysis: web user sessions, i.e, sequences of page views and clicks made by users are
         analized to calculate some metrics such:
     –   how long is the average user session?
     –   how many links does a user click on before leaving a website?
     –   Others, ...


                                                    22
Q&A




      23
FLATTERED




            24

Contenu connexe

Tendances

Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHanborq Inc.
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Profile hadoop apps
Profile hadoop appsProfile hadoop apps
Profile hadoop appsBasant Verma
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performanceDataWorks Summit
 
Pro PostgreSQL, OSCon 2008
Pro PostgreSQL, OSCon 2008Pro PostgreSQL, OSCon 2008
Pro PostgreSQL, OSCon 2008Robert Treat
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14jijukjoseph
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
Data Storage Formats in Hadoop
Data Storage Formats in HadoopData Storage Formats in Hadoop
Data Storage Formats in HadoopBotond Balázs
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance ComputersDave Hiltbrand
 
Distributed Tracing, from internal SAAS insights
Distributed Tracing, from internal SAAS insightsDistributed Tracing, from internal SAAS insights
Distributed Tracing, from internal SAAS insightsHuy Do
 

Tendances (20)

Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep Insight
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Profile hadoop apps
Profile hadoop appsProfile hadoop apps
Profile hadoop apps
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 
Pro PostgreSQL, OSCon 2008
Pro PostgreSQL, OSCon 2008Pro PostgreSQL, OSCon 2008
Pro PostgreSQL, OSCon 2008
 
mesos-devoxx14
mesos-devoxx14mesos-devoxx14
mesos-devoxx14
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Data Storage Formats in Hadoop
Data Storage Formats in HadoopData Storage Formats in Hadoop
Data Storage Formats in Hadoop
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
20080528dublinpt3
20080528dublinpt320080528dublinpt3
20080528dublinpt3
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Distributed Tracing, from internal SAAS insights
Distributed Tracing, from internal SAAS insightsDistributed Tracing, from internal SAAS insights
Distributed Tracing, from internal SAAS insights
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 

En vedette

Everything as a Service
Everything as a ServiceEverything as a Service
Everything as a Servicejavicid
 
The fighting forties life in and around brighouse over 70 years ago
The fighting forties   life in and around brighouse over 70 years agoThe fighting forties   life in and around brighouse over 70 years ago
The fighting forties life in and around brighouse over 70 years agoChris Helme
 
Lightcliffe Cemetery and some of its residents - by Chris Helme
Lightcliffe Cemetery and some of its residents - by Chris HelmeLightcliffe Cemetery and some of its residents - by Chris Helme
Lightcliffe Cemetery and some of its residents - by Chris HelmeChris Helme
 
Welholme Park History Project
Welholme Park History Project   Welholme Park History Project
Welholme Park History Project Chris Helme
 
Brighouse and christmas past + 1
Brighouse and christmas past + 1Brighouse and christmas past + 1
Brighouse and christmas past + 1Chris Helme
 
The 1940s & 50s remembered
The  1940s & 50s rememberedThe  1940s & 50s remembered
The 1940s & 50s rememberedChris Helme
 

En vedette (6)

Everything as a Service
Everything as a ServiceEverything as a Service
Everything as a Service
 
The fighting forties life in and around brighouse over 70 years ago
The fighting forties   life in and around brighouse over 70 years agoThe fighting forties   life in and around brighouse over 70 years ago
The fighting forties life in and around brighouse over 70 years ago
 
Lightcliffe Cemetery and some of its residents - by Chris Helme
Lightcliffe Cemetery and some of its residents - by Chris HelmeLightcliffe Cemetery and some of its residents - by Chris Helme
Lightcliffe Cemetery and some of its residents - by Chris Helme
 
Welholme Park History Project
Welholme Park History Project   Welholme Park History Project
Welholme Park History Project
 
Brighouse and christmas past + 1
Brighouse and christmas past + 1Brighouse and christmas past + 1
Brighouse and christmas past + 1
 
The 1940s & 50s remembered
The  1940s & 50s rememberedThe  1940s & 50s remembered
The 1940s & 50s remembered
 

Similaire à FLATTERED - Pig Latin for Large Data Analysis

A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialRoger Rafanell Mas
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1Hassy Veldstra
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
Research computing at ILRI
Research computing at ILRIResearch computing at ILRI
Research computing at ILRIILRI
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 

Similaire à FLATTERED - Pig Latin for Large Data Analysis (20)

Eedc.apache.pig last
Eedc.apache.pig lastEedc.apache.pig last
Eedc.apache.pig last
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorial
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Research computing at ILRI
Research computing at ILRIResearch computing at ILRI
Research computing at ILRI
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Scala+data
Scala+dataScala+data
Scala+data
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 

Dernier

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Dernier (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

FLATTERED - Pig Latin for Large Data Analysis

  • 1. EEDC 34330 Execution Apache Pig Environments for Distributed Computing Master in Computer Architecture, Networks and Systems - CANS Homework number: 3 Group number: EEDC-3 Group members: Javier Álvarez – javicid@gmail.com Francesc Lordan – francesc.lordan@gmail.com Roger Rafanell – rogerrafanell@gmail.com
  • 2. Outline 1.- Introduction 2.- Pig Latin 2.1.- Data Model 2.2.- Programming Model 3.- Implementation 4.- Conclusions 2
  • 3. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 1 Networks and Systems - CANS Introduction
  • 4. Why Apache Pig? Today’s Internet companies needs to process hugh data sets: – Parallel databases can be prohibitively expensive at this scale. – Programmers tend to find declarative languages such as SQL very unnatural. – Other approaches such map-reduce are low-level and rigid. 4
  • 5. What is Apache Pig? A platform for analyzing large data sets that: – It is based in Pig Latin which lies between declarative (SQL) and procedural (C++) programming languages. – At the same time, enables the construction of programs with an easy parallelizable structure. 5
  • 6. Which features does it have?  Dataflow Language – Data processing is expressed step-by-step.  Quick Start & Interoperability – Pig can work over any kind of input and produce any kind of output.  Nested Data Model – Pig works with complex types like tuples, bags, ...  User Defined Functions (UDFs) – Potentially in any programming language (only Java for the moment).  Only parallel – Pig Latin forces to use directives that are parallelizable in a direct way.  Debugging environment – Debugging at programming time. 6
  • 7. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 2 Networks and Systems - CANS Pig Latin
  • 8. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Section 2.1 Networks and Systems - CANS Data Model
  • 9. Data Model Very rich data model consisting on 4 simple data types:  Atom: Simple atomic value such as strings or numbers. ‘Alice’  Tuple: Sequence of fields of any type of data. (‘Alice’, ‘Apple’) (‘Alice’, (‘Barça’, ‘football’))  Bag: collection of tuples with possible duplicates. (‘Alice’, ‘Apple’) (‘Alice’, (‘Barça’, ‘football’))  Map: collection of data items with an associated key (always an atom). ‘Fan of’  (‘Apple’) (‘Barça’, ‘football’) ‘Age’  ’20’ 9
  • 10. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Section 2.2 Networks and Systems - CANS Programming
  • 11. Programming Model visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); visits: (‘Amy’, ‘cnn.com’, ‘8am’) (‘Amy’, ‘nytimes.com’, ‘9am’) (‘Bob’, ‘elmundotoday.com’, ’11am’) pages: (‘cnn.com’, ‘0.8’) (‘nytimes.com’, ‘0.6’) (‘elmundotoday’, ‘0.2’) 11
  • 12. Programming Model visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url v_p:(‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’) (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’) (‘Bob’, ‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’) 12
  • 13. Programming Model visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url users = GROUP vp BY user user: (‘Amy’, { (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’), (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)}) (‘Bob’, {‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)}) 13
  • 14. Programming Model visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url users = GROUP vp BY user useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr user: (‘Amy’, ‘0.7’) (‘Bob’, ‘0.2’) 14
  • 15. Programming Model visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url users = GROUP vp BY user useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr answer = FILTER useravg BY avgpr > ‘0.5’ answer: (‘Amy’, ‘0.7’) 15
  • 16. Programming Model Other relational operators: – STORE : exports data into a file. STORE var1_name INTO 'output.txt‘; – COGROUP : groups together tuples from diferent datasets. COGROUP var1_name BY field_id, var2_name BY field_id – UNION : computes the union of two variables. – CROSS : computes the cross product. – ORDER : sorts a data set by one or more fields. – DISTINCT : removes replicated tuples in a dataset. 16
  • 17. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 3 Networks and Systems - CANS Implementation
  • 18. Implementation: Highlights  Works on top of Hadoop ecosystem: – Current implementation uses Hadoop as execution platform.  On-the-fly compilation: – Pig translates the Pig Latin commands to Map and Reduce methods.  Lazy style language: – Pig try to pospone the data materialization (on disk writes) as much as possible. 18
  • 19. Implementation: Building the logical plan  Query parsing: – Pig interpreter parses the commands verifying that the input files and bags referenced are valid.  On-the-fly compilation: – Pig compiles the logical plan for that bag into physical plan (Map-Reduce statements) when the command cannot be more delayed and must be executed.  Lazy characteristics: – No processing are carried out when the logical plan are constructed. – Processing is triggered only when the user invokes STORE command on a bag. – Lazy style execution permits in-memory pipelining and other interesting optimizations. 5 19
  • 20. Implementation: Map-Reduce plan compilation  CO(GROUP): – Each command is compiled in a distinct map-reduce job with its own map and reduce functions. – Parallelism is achieved since the output of multiple map instances is repartitioned in parallel to multiple reduce instances.  LOAD: – Parallelism is obtained since Pig operates over files residing in the Hadoop distributed file system.  FILTER/FOREACH: – Automatic parallelism is given since for a map-reduce job several map and reduce instances are run in parallel.  ORDER (compiled in two map-reduce jobs): – First: Determine quantiles of the sort key – Second: Chops the job according the quantiles and performs a local sorting in the reduce phase resulting in a global sorted file. 20
  • 21. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 4 Networks and Systems - CANS Conclusions
  • 22. Conclusions  Advantages: – Step-by-step syntaxis. – Flexible: UDFs, not locked to a fixed schema (allows schema changes over the time). – Exposes a set of widely used functions: FOREACH, FILTER, ORDER, GROUP, … – Takes advantage of Hadoop native properties such: parallelism, load-balancing, fault-tolerance. – Debugging environment. – Open Source (IMPORTANT!!)  Disadvantages: – UDFs methods could be a source of performance loss (the control relies on the user side). – Overhead while compiling Pig Latin into map-reduce jobs.  Usage Scenarios: – Temporal analysis: search logs mainly involves studying how search query distribution changes over time. – Session analysis: web user sessions, i.e, sequences of page views and clicks made by users are analized to calculate some metrics such: – how long is the average user session? – how many links does a user click on before leaving a website? – Others, ... 22
  • 23. Q&A 23
  • 24. FLATTERED 24