SlideShare une entreprise Scribd logo
1  sur  39
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Improving MySQL Performance with
Hadoop
Sagar Jauhari, Manish Kumar
  Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
India
                                                                       May 03 – May 04, 2012

                                                                       San Francisco
                                                                       September 30 – October 4, 2012




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Program Agenda

●   Introduction
●   Inside Hadoop!
●   Integration with MySQL
●   Facebook's usage of MySQL & Hadoop
●   Twitter's usage of MySQL &Hadoop




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Introduction
MySQL
   ●          12 million product installations
   ●          65,000 downloads each day
   ●          Part of the rapidly growing open source LAMP stack
   ●          MySQL Commercial Editions Available




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Introduction
Hadoop
   ●          Highly scalable Distributed Framework
                 ○          Yahoo! has a 4000 node cluster!
   ●          Extremely powerful in terms of computation
                 ○          Sorts a TB of random integers in 62 seconds!




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Introduction
Hadoop is ..
   ●          A scalable system for data storage and processing.
   ●          Fault tolerant
   ●          Parallelizes data processing across many nodes
   ●          Leverages its distributed file system (HDFS)* to
              cheaply and reliably replicate chunks of data.




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Introduction
Who uses Hadoop?
 ● Yahoo:
                          ■         Ad Systems and Web Search.
 ● Facebook:
                          ■         Reporting/analytics and machine learning.
 ● Twitter:
                          ■         Data warehousing, data analysis.
 ● Netflix:
                          ■         Movie recommendation algorithm uses Hive ( which uses
                                    Hadoop, HDFS & MapReduce underneath)


Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Introduction
MySQL Vs Hadoop
                                                                       MySQL                        Hadoop

Data Capacity                                                          TB+ (may require sharding)   PB+

Data per query                                                         GB?                          PB+

Read/Write                                                             Random read/write            Sequential scans, Append - only

Query Language                                                         SQL                          Java MapReduce, scripting
                                                                                                    languages, Hive QL

Transaction                                                            Yes                          No

Indexes                                                                Yes                          No

Latence                                                                Sub-second (hopefully)       Minutes to hours

Data structure                                                         Structured                   Structured or unstructured
Courtesy: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop


                                                                       A shallow Deep Dive


Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
HDFS
    ●         A distributed, scalable,                                       Name Node

              and portable file system
              written in Java
    ●         Each node in a Hadoop                                             HDFS

              instance typically has a
              single name-node; a
              cluster of data-nodes form
              the HDFS cluster.
                                                                       Map / Reduce Workers



Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
HDFS
    ●         Uses the TCP/IP layer for                                      Name Node

              communication
    ●         Stores large files across
              multiple machines                                                 HDFS

    ●         Single name node stores
              metadata in-memory.


                                                                       Map / Reduce Workers



Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
HDFS




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
Map Reduce
    ●         Design Goals
                  ○         Scalability
                  ○         Cost Efficiency
    ●         Implementation
                  ○         User Jobs are executed as 'map' and 'reduce' functions
                  ○         Work distribution and fault tolerance are managed


            Input                                         Map          Shuffle and sort   Reduce   Output




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
Map Reduce
    ●         Map
                  ○         Map Reduce job splits input data into independent chunks
                  ○         Each chunk is processed by the map task in a parallel
                            manner
                  ○         Generic key-value computation




            Input                                         Map          Shuffle and sort   Reduce   Output




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
Map Reduce
    ●         Reduce
                  ○         Data from data nodes is merge sorted so that the key-value
                            pairs for a given key are contiguous
                  ○         The merged data is read sequentially and the values are
                            passed to the reduce method with an iterator reading the
                            input file until the next key value is encountered



            Input                                         Map          Shuffle and sort   Reduce   Output




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
Map Reduce
          Input                                        Map             Shuffle and sort      Reduce         Output




     Word
                                                                                                      Word           Count
     Hadoop
                                                               Map                                    Hadoop         2
                                                                                          Reduce
     MySQL
                                                                                                      MySQL          1
     Hive
                                                               Map                                    Hive           1
     Sqoop
                                                                                          Reduce      Sqoop          1
     Pig
                                                                 Map
                                                                                                      Pig            1
     Hadoop


Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
How does hadoop use Map-Reduce
    ●         Framework consists of a single master JobTracker
              and one slave TaskTracker per cluster-node.
    ●         Master
                  ○         Schedules the jobs' component tasks on the slaves
                  ○         Monitors the jobs
                  ○         Re-executes the failed tasks
    ●         Slave
                  ○         Executes the tasks as directed by the master.



Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
Why Map Reduce ?
    ●         Language support
                  ○            Java, PHP, Hive, Pig, Python, Wukong (Ruby), Rhipe (R) .
    ●         Scales Horizontally
    ●         Programmer is isolated from individual failed tasks
             ○         Tasks are restarted on another node




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Inside Hadoop
Map Reduce Limitations
    ●         Not a good fit for problems that exhibit task-driven
              parallelism.
    ●         Requires a particular form of input - a set of (key,
              pair) pairs.
    ●         A lot of MapReduce applications end up sharing data
              one way or another.



Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL

                                                                           Leveraging Hadoop to
                                                                                Improve MySQL
                                                                                    performance


Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL

●     The benefits of MySQL to developers is the speed,
      reliability, data integrity and scalability it provides.
●     It can successfully process large amounts of data (in
      petabytes).
●     But for applications that require a massive parallel
      processing we may need the benefits of a parallel
      processing system, such as hadoop.



    Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL




Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010



Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL
  Problem Statement
Word Count Problem
 ● In a large set of
   documents, find the
   number of occurrences
   of each word.




  Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL
Word count problem
          Input                                        Map             Shuffle and sort      Reduce         Output




     Word
                                                                                                      Word           Count
     Hadoop
                                                               Map                                    Hadoop         2
                                                                                          Reduce
     MySQL
                                                                                                      MySQL          1
     Hive
                                                               Map                                    Hive           1
     Sqoop
                                                                                          Reduce      Sqoop          1
     Pig
                                                                 Map
                                                                                                      Pig            1
     Hadoop


Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL
Mapping

                                                                         Key and Value represent a row of data:
Map
                                                                           key is the byte office, value in a line.
(key,
value)
                                                                        Intermediate Output
foreach                                                                <word1>, 1
(word in                                                               <word2>, 1
the                                                                    <word3>, 1
value)

output
(word,1)

Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL
Reducing
                                                                       Hadoop aggregates the keys
Reduce                                                                 and calls reduce for each
(key, list)                                                            unique key:
  sum                                                                   <word1>, (1,1,1,1,1,1…1)
the list                                                                <word2>, (1,1,1)
  Output                                                                <word3>, (1,1,1,1,1,1) .
(key,
                                              Final result:
sum)
                                          <word1>, 45823
                                          <word2>, 1204
                                          <word3>, 2693



Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL

                                                                                       Demo




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Integration with MySQL
Video




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Facebook's usage of MySQL & Hadoop

● Facebook collects TB of data everyday from around 800 million
  users.
● MySQL handles pretty much every user interaction: likes,
  shares, status updates, alerts, requests, etc.
● Hadoop/Hive Warehouse
  – 4800 cores, 2 PetaBytes (July 2009)
  – 4800 cores, 12 PetaBytes (Sept 2009)
● Hadoop Archival Store
  – 200 TB



 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Facebook's usage of MySQL & Hadoop
Hive
    ●         Data warehouse system for Hadoop.
    ●         Facilitates easy data summarization.
    ●         Hive translates HiveQL to MapReduce code.
    ●         Querying
                  ○         Provides a mechanism to project structure onto this data
                  ○         Allows querying the data using a SQL-like language called HiveQL




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Facebook's usage of MySQL & Hadoop




Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010


 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Hive Vs SQL

                                                                             RDBMS                        HIVE

                                                                             SQL-92 standard (maybe)      Subset of SQL-92 plus Hive-
           Language
                                                                                                          specific extension
                                                                             INSERT, UPDATE and           INSERT but not UPDATE or
           Update Capabilities
                                                                             DELETE                       DELETE

                                                                             Yes                          No
           Transactions

                                                                             Sub-Second                   Minutes or more
           Latency

                                                                             Any number of indexes,       No indexes, data is always
           Indexes
                                                                             very                         scanned (in parallel)
                                                                             important for performance
                                                                             TBs                          PBs
           Data size
           Data per query                                                    GBs
          Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010   PBs


Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Hadoop Implementation
At Twitter
    ●         > 12 terabytes of new data per day!
    ●         Most stored data is LZ0 compressed
    ●         Uses Scribe to write logs to Hadoop
                  ○         Scribe: a log collection framework created and open-
                            sourced by Facebook.
    ●         Hadoop used for data warehousing, data analysis.




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
References

    ●         Leveraging Hadoop to Augment MySQL Deployments - Sarah
              Sproehnle, Cloudera
    ●         http://engineering.twitter.com/2010/04/hadoop-at-twitter.html
    ●         http://semanticvoid.com
    ●         http://michael-noll.com
    ●         http://hadoop.apache.org/




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Legal Disclaimer

    ●         All other products, company names, brand names,
              trademarks and logos are the property of their
              respective owners.




Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Thank You


Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

Contenu connexe

Tendances

Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Guy Harrison
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryTsz-Wo (Nicholas) Sze
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudDataWorks Summit/Hadoop Summit
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataCloudera, Inc.
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2DataWorks Summit
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 

Tendances (20)

Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft Library
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
Spark + HBase
Spark + HBase Spark + HBase
Spark + HBase
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 

Similaire à Improving MySQL performance with Hadoop

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopJoey Jablonski
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
WHAT IS HADOOP AND ITS COMPONENTS?
WHAT IS HADOOP AND ITS COMPONENTS? WHAT IS HADOOP AND ITS COMPONENTS?
WHAT IS HADOOP AND ITS COMPONENTS? nakshatraL
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoopOmar Jaber
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Big Data Training in Amritsar
Big Data Training in AmritsarBig Data Training in Amritsar
Big Data Training in AmritsarE2MATRIX
 
Big Data Training in Mohali
Big Data Training in MohaliBig Data Training in Mohali
Big Data Training in MohaliE2MATRIX
 
Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in LudhianaE2MATRIX
 
Introduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleIntroduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleSpringPeople
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training Keylabs
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introductiondewang_mistry
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 

Similaire à Improving MySQL performance with Hadoop (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Tutorial for Beginners
Hadoop Tutorial for BeginnersHadoop Tutorial for Beginners
Hadoop Tutorial for Beginners
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
WHAT IS HADOOP AND ITS COMPONENTS?
WHAT IS HADOOP AND ITS COMPONENTS? WHAT IS HADOOP AND ITS COMPONENTS?
WHAT IS HADOOP AND ITS COMPONENTS?
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Hw09 Hadoop Db
Hw09   Hadoop DbHw09   Hadoop Db
Hw09 Hadoop Db
 
Big Data Training in Amritsar
Big Data Training in AmritsarBig Data Training in Amritsar
Big Data Training in Amritsar
 
Big Data Training in Mohali
Big Data Training in MohaliBig Data Training in Mohali
Big Data Training in Mohali
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in Ludhiana
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Introduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleIntroduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeople
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Dernier (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Improving MySQL performance with Hadoop

  • 1. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 2. Improving MySQL Performance with Hadoop Sagar Jauhari, Manish Kumar Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 3. India May 03 – May 04, 2012 San Francisco September 30 – October 4, 2012 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 4. Program Agenda ● Introduction ● Inside Hadoop! ● Integration with MySQL ● Facebook's usage of MySQL & Hadoop ● Twitter's usage of MySQL &Hadoop Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 5. Introduction MySQL ● 12 million product installations ● 65,000 downloads each day ● Part of the rapidly growing open source LAMP stack ● MySQL Commercial Editions Available Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 6. Introduction Hadoop ● Highly scalable Distributed Framework ○ Yahoo! has a 4000 node cluster! ● Extremely powerful in terms of computation ○ Sorts a TB of random integers in 62 seconds! Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 7. Introduction Hadoop is .. ● A scalable system for data storage and processing. ● Fault tolerant ● Parallelizes data processing across many nodes ● Leverages its distributed file system (HDFS)* to cheaply and reliably replicate chunks of data. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 8. Introduction Who uses Hadoop? ● Yahoo: ■ Ad Systems and Web Search. ● Facebook: ■ Reporting/analytics and machine learning. ● Twitter: ■ Data warehousing, data analysis. ● Netflix: ■ Movie recommendation algorithm uses Hive ( which uses Hadoop, HDFS & MapReduce underneath) Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 9. Introduction MySQL Vs Hadoop MySQL Hadoop Data Capacity TB+ (may require sharding) PB+ Data per query GB? PB+ Read/Write Random read/write Sequential scans, Append - only Query Language SQL Java MapReduce, scripting languages, Hive QL Transaction Yes No Indexes Yes No Latence Sub-second (hopefully) Minutes to hours Data structure Structured Structured or unstructured Courtesy: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 10. Inside Hadoop A shallow Deep Dive Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 11. Inside Hadoop HDFS ● A distributed, scalable, Name Node and portable file system written in Java ● Each node in a Hadoop HDFS instance typically has a single name-node; a cluster of data-nodes form the HDFS cluster. Map / Reduce Workers Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 12. Inside Hadoop HDFS ● Uses the TCP/IP layer for Name Node communication ● Stores large files across multiple machines HDFS ● Single name node stores metadata in-memory. Map / Reduce Workers Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 13. Inside Hadoop HDFS Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 14. Inside Hadoop Map Reduce ● Design Goals ○ Scalability ○ Cost Efficiency ● Implementation ○ User Jobs are executed as 'map' and 'reduce' functions ○ Work distribution and fault tolerance are managed Input Map Shuffle and sort Reduce Output Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 15. Inside Hadoop Map Reduce ● Map ○ Map Reduce job splits input data into independent chunks ○ Each chunk is processed by the map task in a parallel manner ○ Generic key-value computation Input Map Shuffle and sort Reduce Output Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 16. Inside Hadoop Map Reduce ● Reduce ○ Data from data nodes is merge sorted so that the key-value pairs for a given key are contiguous ○ The merged data is read sequentially and the values are passed to the reduce method with an iterator reading the input file until the next key value is encountered Input Map Shuffle and sort Reduce Output Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 17. Inside Hadoop Map Reduce Input Map Shuffle and sort Reduce Output Word Word Count Hadoop Map Hadoop 2 Reduce MySQL MySQL 1 Hive Map Hive 1 Sqoop Reduce Sqoop 1 Pig Map Pig 1 Hadoop Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 18. Inside Hadoop How does hadoop use Map-Reduce ● Framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. ● Master ○ Schedules the jobs' component tasks on the slaves ○ Monitors the jobs ○ Re-executes the failed tasks ● Slave ○ Executes the tasks as directed by the master. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 19. Inside Hadoop Why Map Reduce ? ● Language support ○ Java, PHP, Hive, Pig, Python, Wukong (Ruby), Rhipe (R) . ● Scales Horizontally ● Programmer is isolated from individual failed tasks ○ Tasks are restarted on another node Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 20. Inside Hadoop Map Reduce Limitations ● Not a good fit for problems that exhibit task-driven parallelism. ● Requires a particular form of input - a set of (key, pair) pairs. ● A lot of MapReduce applications end up sharing data one way or another. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 21. Integration with MySQL Leveraging Hadoop to Improve MySQL performance Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 22. Integration with MySQL ● The benefits of MySQL to developers is the speed, reliability, data integrity and scalability it provides. ● It can successfully process large amounts of data (in petabytes). ● But for applications that require a massive parallel processing we may need the benefits of a parallel processing system, such as hadoop. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 23. Integration with MySQL Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 24. Integration with MySQL Problem Statement Word Count Problem ● In a large set of documents, find the number of occurrences of each word. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 25. Integration with MySQL Word count problem Input Map Shuffle and sort Reduce Output Word Word Count Hadoop Map Hadoop 2 Reduce MySQL MySQL 1 Hive Map Hive 1 Sqoop Reduce Sqoop 1 Pig Map Pig 1 Hadoop Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 26. Integration with MySQL Mapping Key and Value represent a row of data: Map key is the byte office, value in a line. (key, value) Intermediate Output foreach <word1>, 1 (word in <word2>, 1 the <word3>, 1 value) output (word,1) Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 27. Integration with MySQL Reducing Hadoop aggregates the keys Reduce and calls reduce for each (key, list) unique key: sum <word1>, (1,1,1,1,1,1…1) the list <word2>, (1,1,1) Output <word3>, (1,1,1,1,1,1) . (key, Final result: sum) <word1>, 45823 <word2>, 1204 <word3>, 2693 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 28. Integration with MySQL Demo Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 29. Integration with MySQL Video Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 30. Facebook's usage of MySQL & Hadoop ● Facebook collects TB of data everyday from around 800 million users. ● MySQL handles pretty much every user interaction: likes, shares, status updates, alerts, requests, etc. ● Hadoop/Hive Warehouse – 4800 cores, 2 PetaBytes (July 2009) – 4800 cores, 12 PetaBytes (Sept 2009) ● Hadoop Archival Store – 200 TB Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 31. Facebook's usage of MySQL & Hadoop Hive ● Data warehouse system for Hadoop. ● Facilitates easy data summarization. ● Hive translates HiveQL to MapReduce code. ● Querying ○ Provides a mechanism to project structure onto this data ○ Allows querying the data using a SQL-like language called HiveQL Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 32. Facebook's usage of MySQL & Hadoop Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 33. Hive Vs SQL RDBMS HIVE SQL-92 standard (maybe) Subset of SQL-92 plus Hive- Language specific extension INSERT, UPDATE and INSERT but not UPDATE or Update Capabilities DELETE DELETE Yes No Transactions Sub-Second Minutes or more Latency Any number of indexes, No indexes, data is always Indexes very scanned (in parallel) important for performance TBs PBs Data size Data per query GBs Image Source: Leveraging Hadoop to Augment MySQL Deployments, Sarah Sproehnle, Cloudera, 2010 PBs Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 34. Hadoop Implementation At Twitter ● > 12 terabytes of new data per day! ● Most stored data is LZ0 compressed ● Uses Scribe to write logs to Hadoop ○ Scribe: a log collection framework created and open- sourced by Facebook. ● Hadoop used for data warehousing, data analysis. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 35. References ● Leveraging Hadoop to Augment MySQL Deployments - Sarah Sproehnle, Cloudera ● http://engineering.twitter.com/2010/04/hadoop-at-twitter.html ● http://semanticvoid.com ● http://michael-noll.com ● http://hadoop.apache.org/ Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 36. Legal Disclaimer ● All other products, company names, brand names, trademarks and logos are the property of their respective owners. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 37. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 38. Thank You Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  • 39. Copyright © 2012, Oracle and/or its affiliates. All rights reserved.