SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
How to Implement MapReduce Using




                              Presented By
                               Jamie Pitts

Thursday, December 9, 2010
A Problem Seeking A Solution

                  Given a corpus of html-stripped financial filings:
                        Identify and count unique subjects.


                                    Possible Solutions:
                    1. Use the unix toolbox.
                    2. Create a set of unwieldy perl scripts.
                    3. Solve it with MapReduce, perl, and Gearman!




Thursday, December 9, 2010
What is MapReduce?

                             A programming model for processing
                                and generating large datasets.

                                   The original 2004 paper by Dean and Ghemawat




Thursday, December 9, 2010
M.R. Execution Model

                       A large set of Input key/value pairs is reduced
                        into a smaller set of Output key/value pairs

                             1.   A Master process splits a corpus of documents into sets.
                                  Each is assigned to a separate Mapper worker.

                             2. A Mapper worker iterates over the contents of the split. A
                                Map function processes each. Keys are extracted and key/
                                value pairs are stored as intermediate data.

                             3.   When the mapping for a split is complete, the Master process
                                  assigns a Reducer worker to process the intermediate data
                                  for that split.

                             4. A Reducer worker iterates over the intermediate data. A
                                Reduce function accepts a key and a set of values and
                                returns a value. Unique key/value pairs are stored as
                                output data.



Thursday, December 9, 2010
M.R. Execution Model Diagram




Thursday, December 9, 2010
Applying MapReduce
                             Counting Unique Subjects in a Corpus of Financial Filings



                                                     Mapper                       Reducer




                                                Intermediate Data:             Output Data:
            Corpus of Filings,
           Split by Company ID                Redundant records of            Unique Records
                                              Subjects and Counts         of Subjects and Counts



Thursday, December 9, 2010
What is Gearman?

                             A protocol and server for distributing work
                                       across multiple servers.

                                      Job Server implemented in pure
                                      perl and in C

                                      Client and Workers can be
                                      implemented in any language

                                      Not tied to any particular
                                      processing model

                                      No single point of failure


                               Current Maintainers: Brian Aker and Eric Day


Thursday, December 9, 2010
Gearman Architecture

                             Code that Organizes   Job Server is the intermediary
                                   the Work        between Clients and Workers

                                                   Clients organize the work

                                                   Workers define and register
                                                   work functions with Job Servers.

                                                   Servers are aware of all available
                                                   Workers.

                                                   Clients and Workers are aware
                                                   of all available Servers.

                                                   Multiple Servers can be run on
                                                   one machine.
                       Code that Does the Work



Thursday, December 9, 2010
Perl and Gearman
                                Use Gearman::XS to Create a Pipeline Process




                             http://search.cpan.org/~dschoen/Gearman-XS/



Thursday, December 9, 2010
Perl and Gearman
                                         Client-Ser ver-Worker Architecture


                                               Gearman Job Servers




                              Each computer has two Gearman Job Servers running:
                      one each for handing mapping and reducing jobs to available workers.




                             Gearman::XS::Client                   Gearman::XS::Worker
                                                             Mappers           Reducers
              Split > Do Mapping > Do Reducing


                The Master uses Gearman clients to            Mapper and Reducer workers are
                  control the MapReduce pipeline.            perl processes, ready to accept jobs.



Thursday, December 9, 2010
Example
                             Processing Filings from Two Companies




                                   Mapper                     Reducer


                                             Mapper                     Reducer


             Filings from
            Two Companies




                                   Intermediate Data:            Output Data:
                                  Redundant records of          Unique Records
                                  Subjects and Counts       of Subjects and Counts



Thursday, December 9, 2010
Example: The Master Process
                              Setting Up the Mapper and Reducer Gearman Clients


                                                                     Connect the mapper client to
                                                                     the gearman server listening
                                                                     on 4730. More than one can
                                                                     be defined here.

                                                                     Connect the reducer client to
                                                                     the gearman server listening
                                                                     on 4731. More than one can
                                                                     be defined here.

                                                                     A unique reducer id is
                                                                     generated for each reducer
                                                                     client. This is used to
                                                                     identify the output.




Thursday, December 9, 2010
Example: The Master Process
                                  Submitting a Mapper Job For Each Split


                                                                    The company IDs are pushed
                                                                    into an array of splits. A jobs
                                                                    hash for both mappers and
                                                                    reducers is defined.

                                                                    A background mapper job for
                                                                    each split is submitted to
                                                                    gearman.

                                                                    The job handle is a unique ID
                                                                    generated by gearman when
                                                                    the job is submitted. This is
                                                                    used to identify the job in the
                                                                    jobs hash.

                                                                    The split is passed to the job as
                                                                    frozen workload data.




Thursday, December 9, 2010
Example: The Master Process
                      Job Monitoring: Submitting Reducer Jobs When Mapper Jobs Are Done


                                                                     Mapper and Reducer jobs are
                                                                     monitored inside of a loop.

                                                                     The job status is queried from
                                                                     the client using the job
                                                                     handle.

                                                                     If the job is completed and if it
                                                                     is a mapper job, a reducer is
                                                                     now run (detailed in the next
                                                                     slide).

                                                                     Completed jobs are deleted
                                                                     from the jobs hash.




Thursday, December 9, 2010
Example: The Master Process
                                  Detail on Submitting the Reducer Job


                                                                   A background reducer job is
                                                                   submitted to gearman.

                                                                   The job handle is a unique ID
                                                                   generated by gearman when
                                                                   the job is submitted. This is
                                                                   used to identify the job in the
                                                                   jobs hash.

                                                                   The split and reducer_id is
                                                                   passed to the job as frozen
                                                                   workload data.




Thursday, December 9, 2010
Example
                               The Result of Processing Filings from Two Companies




                                           Mapper                      Reducer


                                                          Mapper                         Reducer


         Our AdWords agreements are
         generally terminable at any
         time by our advertisers.
         Google AdSense is the program
         through which we distribute
         our advertisers' AdWords ads
         for display on the web sites
         of our Google Network
         members. Our AdSense program
         includes AdSense for search
         and AdSense for content.
         AdSense for search is our
         service for distributing
         relevant ads from our             1   AdWords
         advertisers for display with      1   Google Adsense
         search results on our Google      1   Adwords
         Network members' sites.           1   Google Network
                                           1   AdSense                  1     Ad Words
               ... Filings Text ...        1   AdSense                  14    AdMob
                                           1   AdSense                  299   AdSense
                                           1   Google Network           152   AdWords
                                           ... Redundant Counts ...     ... Aggregated Counts ...




Thursday, December 9, 2010
That’s It For Now
                             To Be Continued...



                                   Presented By
                                    Jamie Pitts

Thursday, December 9, 2010
Credits

                             Diagrams and other content were used
                                  from the following sources:

                             MapReduce: Simplified Data Processing on Large Clusters
                                      Jeffrey Dean and Sanjay Ghemawat

                                         Gearman (http://gearman.org)
                                           Brian Aker and Eric Day




Thursday, December 9, 2010

Contenu connexe

Tendances

Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross KingScalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross KingDigitalPreservationEurope
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
[AzurePT] Desenvolvimento para o Windows Azure: Diferença para o developer
[AzurePT] Desenvolvimento para o Windows Azure: Diferença para o developer[AzurePT] Desenvolvimento para o Windows Azure: Diferença para o developer
[AzurePT] Desenvolvimento para o Windows Azure: Diferença para o developerVitor Tomaz
 
Lightspeed Preprint
Lightspeed PreprintLightspeed Preprint
Lightspeed Preprintjustanimate
 
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...kcitp
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureKyong-Ha Lee
 
Preventing multi master conflicts with tungsten
Preventing multi master conflicts with tungstenPreventing multi master conflicts with tungsten
Preventing multi master conflicts with tungstenGiuseppe Maxia
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
 
Solving MySQL replication problems with Tungsten
Solving MySQL replication problems with TungstenSolving MySQL replication problems with Tungsten
Solving MySQL replication problems with TungstenGiuseppe Maxia
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 
What's new in the OSGi 4.2 Enterprise Release
What's new in the OSGi 4.2 Enterprise ReleaseWhat's new in the OSGi 4.2 Enterprise Release
What's new in the OSGi 4.2 Enterprise ReleaseDavid Bosschaert
 
Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution ISSGC Summer School
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintijccsa
 
Borderless Per Face Texture Mapping
Borderless Per Face Texture MappingBorderless Per Face Texture Mapping
Borderless Per Face Texture Mappingbasisspace
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
High Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and FutureHigh Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and Futurekarl.barnes
 

Tendances (19)

H04502048051
H04502048051H04502048051
H04502048051
 
Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross KingScalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross King
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
[AzurePT] Desenvolvimento para o Windows Azure: Diferença para o developer
[AzurePT] Desenvolvimento para o Windows Azure: Diferença para o developer[AzurePT] Desenvolvimento para o Windows Azure: Diferença para o developer
[AzurePT] Desenvolvimento para o Windows Azure: Diferença para o developer
 
Lightspeed Preprint
Lightspeed PreprintLightspeed Preprint
Lightspeed Preprint
 
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing Architecture
 
Preventing multi master conflicts with tungsten
Preventing multi master conflicts with tungstenPreventing multi master conflicts with tungsten
Preventing multi master conflicts with tungsten
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
Solving MySQL replication problems with Tungsten
Solving MySQL replication problems with TungstenSolving MySQL replication problems with Tungsten
Solving MySQL replication problems with Tungsten
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
What's new in the OSGi 4.2 Enterprise Release
What's new in the OSGi 4.2 Enterprise ReleaseWhat's new in the OSGi 4.2 Enterprise Release
What's new in the OSGi 4.2 Enterprise Release
 
Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
 
Borderless Per Face Texture Mapping
Borderless Per Face Texture MappingBorderless Per Face Texture Mapping
Borderless Per Face Texture Mapping
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
High Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and FutureHigh Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and Future
 

En vedette

Gearman Introduction
Gearman IntroductionGearman Introduction
Gearman IntroductionGreen Wang
 
Scalability In PHP
Scalability In PHPScalability In PHP
Scalability In PHPIan Selby
 
Distributed Applications with Perl & Gearman
Distributed Applications with Perl & GearmanDistributed Applications with Perl & Gearman
Distributed Applications with Perl & GearmanIssac Goldstand
 
Scale like an ant, distribute the workload - DPC, Amsterdam, 2011
Scale like an ant, distribute the workload - DPC, Amsterdam,  2011Scale like an ant, distribute the workload - DPC, Amsterdam,  2011
Scale like an ant, distribute the workload - DPC, Amsterdam, 2011Helgi Þormar Þorbjörnsson
 
Gearman work queue in php
Gearman work queue in phpGearman work queue in php
Gearman work queue in phpBo-Yi Wu
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 

En vedette (9)

Gearman Introduction
Gearman IntroductionGearman Introduction
Gearman Introduction
 
Gearman For Beginners
Gearman For BeginnersGearman For Beginners
Gearman For Beginners
 
Scalability In PHP
Scalability In PHPScalability In PHP
Scalability In PHP
 
Gearman and Perl
Gearman and PerlGearman and Perl
Gearman and Perl
 
Distributed Applications with Perl & Gearman
Distributed Applications with Perl & GearmanDistributed Applications with Perl & Gearman
Distributed Applications with Perl & Gearman
 
Scale like an ant, distribute the workload - DPC, Amsterdam, 2011
Scale like an ant, distribute the workload - DPC, Amsterdam,  2011Scale like an ant, distribute the workload - DPC, Amsterdam,  2011
Scale like an ant, distribute the workload - DPC, Amsterdam, 2011
 
Gearman work queue in php
Gearman work queue in phpGearman work queue in php
Gearman work queue in php
 
MapReduce入門
MapReduce入門MapReduce入門
MapReduce入門
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 

Similaire à MapReduce Using Perl and Gearman

Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
Java Abs Grid Information Retrival System
Java Abs   Grid Information Retrival SystemJava Abs   Grid Information Retrival System
Java Abs Grid Information Retrival Systemncct
 
Hadoop and Big Data Overview
Hadoop and Big Data OverviewHadoop and Big Data Overview
Hadoop and Big Data OverviewPrabhu Thukkaram
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaData Con LA
 
Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? DataWorks Summit
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Where to Deploy Hadoop: Bare-metal or Cloud?
Where to Deploy Hadoop:  Bare-metal or Cloud?Where to Deploy Hadoop:  Bare-metal or Cloud?
Where to Deploy Hadoop: Bare-metal or Cloud?Mike Wendt
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processinghitesh1892
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
Building Applications on YARN
Building Applications on YARNBuilding Applications on YARN
Building Applications on YARNChris Riccomini
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 

Similaire à MapReduce Using Perl and Gearman (20)

Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Java Abs Grid Information Retrival System
Java Abs   Grid Information Retrival SystemJava Abs   Grid Information Retrival System
Java Abs Grid Information Retrival System
 
Hadoop and Big Data Overview
Hadoop and Big Data OverviewHadoop and Big Data Overview
Hadoop and Big Data Overview
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud?
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Where to Deploy Hadoop: Bare-metal or Cloud?
Where to Deploy Hadoop:  Bare-metal or Cloud?Where to Deploy Hadoop:  Bare-metal or Cloud?
Where to Deploy Hadoop: Bare-metal or Cloud?
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Hadoop
HadoopHadoop
Hadoop
 
Building Applications on YARN
Building Applications on YARNBuilding Applications on YARN
Building Applications on YARN
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 

Dernier

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Dernier (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

MapReduce Using Perl and Gearman

  • 1. How to Implement MapReduce Using Presented By Jamie Pitts Thursday, December 9, 2010
  • 2. A Problem Seeking A Solution Given a corpus of html-stripped financial filings: Identify and count unique subjects. Possible Solutions: 1. Use the unix toolbox. 2. Create a set of unwieldy perl scripts. 3. Solve it with MapReduce, perl, and Gearman! Thursday, December 9, 2010
  • 3. What is MapReduce? A programming model for processing and generating large datasets. The original 2004 paper by Dean and Ghemawat Thursday, December 9, 2010
  • 4. M.R. Execution Model A large set of Input key/value pairs is reduced into a smaller set of Output key/value pairs 1. A Master process splits a corpus of documents into sets. Each is assigned to a separate Mapper worker. 2. A Mapper worker iterates over the contents of the split. A Map function processes each. Keys are extracted and key/ value pairs are stored as intermediate data. 3. When the mapping for a split is complete, the Master process assigns a Reducer worker to process the intermediate data for that split. 4. A Reducer worker iterates over the intermediate data. A Reduce function accepts a key and a set of values and returns a value. Unique key/value pairs are stored as output data. Thursday, December 9, 2010
  • 5. M.R. Execution Model Diagram Thursday, December 9, 2010
  • 6. Applying MapReduce Counting Unique Subjects in a Corpus of Financial Filings Mapper Reducer Intermediate Data: Output Data: Corpus of Filings, Split by Company ID Redundant records of Unique Records Subjects and Counts of Subjects and Counts Thursday, December 9, 2010
  • 7. What is Gearman? A protocol and server for distributing work across multiple servers. Job Server implemented in pure perl and in C Client and Workers can be implemented in any language Not tied to any particular processing model No single point of failure Current Maintainers: Brian Aker and Eric Day Thursday, December 9, 2010
  • 8. Gearman Architecture Code that Organizes Job Server is the intermediary the Work between Clients and Workers Clients organize the work Workers define and register work functions with Job Servers. Servers are aware of all available Workers. Clients and Workers are aware of all available Servers. Multiple Servers can be run on one machine. Code that Does the Work Thursday, December 9, 2010
  • 9. Perl and Gearman Use Gearman::XS to Create a Pipeline Process http://search.cpan.org/~dschoen/Gearman-XS/ Thursday, December 9, 2010
  • 10. Perl and Gearman Client-Ser ver-Worker Architecture Gearman Job Servers Each computer has two Gearman Job Servers running: one each for handing mapping and reducing jobs to available workers. Gearman::XS::Client Gearman::XS::Worker Mappers Reducers Split > Do Mapping > Do Reducing The Master uses Gearman clients to Mapper and Reducer workers are control the MapReduce pipeline. perl processes, ready to accept jobs. Thursday, December 9, 2010
  • 11. Example Processing Filings from Two Companies Mapper Reducer Mapper Reducer Filings from Two Companies Intermediate Data: Output Data: Redundant records of Unique Records Subjects and Counts of Subjects and Counts Thursday, December 9, 2010
  • 12. Example: The Master Process Setting Up the Mapper and Reducer Gearman Clients Connect the mapper client to the gearman server listening on 4730. More than one can be defined here. Connect the reducer client to the gearman server listening on 4731. More than one can be defined here. A unique reducer id is generated for each reducer client. This is used to identify the output. Thursday, December 9, 2010
  • 13. Example: The Master Process Submitting a Mapper Job For Each Split The company IDs are pushed into an array of splits. A jobs hash for both mappers and reducers is defined. A background mapper job for each split is submitted to gearman. The job handle is a unique ID generated by gearman when the job is submitted. This is used to identify the job in the jobs hash. The split is passed to the job as frozen workload data. Thursday, December 9, 2010
  • 14. Example: The Master Process Job Monitoring: Submitting Reducer Jobs When Mapper Jobs Are Done Mapper and Reducer jobs are monitored inside of a loop. The job status is queried from the client using the job handle. If the job is completed and if it is a mapper job, a reducer is now run (detailed in the next slide). Completed jobs are deleted from the jobs hash. Thursday, December 9, 2010
  • 15. Example: The Master Process Detail on Submitting the Reducer Job A background reducer job is submitted to gearman. The job handle is a unique ID generated by gearman when the job is submitted. This is used to identify the job in the jobs hash. The split and reducer_id is passed to the job as frozen workload data. Thursday, December 9, 2010
  • 16. Example The Result of Processing Filings from Two Companies Mapper Reducer Mapper Reducer Our AdWords agreements are generally terminable at any time by our advertisers. Google AdSense is the program through which we distribute our advertisers' AdWords ads for display on the web sites of our Google Network members. Our AdSense program includes AdSense for search and AdSense for content. AdSense for search is our service for distributing relevant ads from our 1 AdWords advertisers for display with 1 Google Adsense search results on our Google 1 Adwords Network members' sites. 1 Google Network 1 AdSense 1 Ad Words ... Filings Text ... 1 AdSense 14 AdMob 1 AdSense 299 AdSense 1 Google Network 152 AdWords ... Redundant Counts ... ... Aggregated Counts ... Thursday, December 9, 2010
  • 17. That’s It For Now To Be Continued... Presented By Jamie Pitts Thursday, December 9, 2010
  • 18. Credits Diagrams and other content were used from the following sources: MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Gearman (http://gearman.org) Brian Aker and Eric Day Thursday, December 9, 2010