SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
Hadoop User Group
                           Ashish Thusoo
                            Jan 16, 2013




Qubole Inc., Proprietary
About Me


             •        Big Data Veteran

             •      Ran the data infrastructure team at Facebook
                   before starting Qubole

             •        Co-created Hive in 2007 @ Facebook



Qubole Inc., Proprietary
What is Qubole?
            •      A comprehensive cloud data platform based
                  on Hadoop and Hive for data in the cloud

                 -         Turnkey Infrastructure


                 -         Cloud Optimized Stack


                 -         Open Data Formats


            •      Useful for exploring data and creating batch
                  processing applications/data pipelines
Qubole Inc., Proprietary
Why Qubole?
                                                  BOTTLENECK




                                                                 End Users
     Heterogenous Data
                                                         (User Ops, Product Managers
(Structured & Unstructured)                                          etc.)

                                  The Intermediaries
                                 (Data Scientists and
                                      Engineers)



Qubole Inc., Proprietary
Qubole Service

                                           Cloud Data Service

                                 Explore        Schedule        SDK


                                                  API                        ODBC

    Logs
                                                    Cloud Data Platform
                    Connectors




    Events
                                                   Elastic . Robust . Fast
                                                                                    Data
                                                                                    Marts
           DBs
                                                Big Data Technology Stack
      Metrics

                                                           EC2 / S3
    Cloud Sources




Qubole Inc., Proprietary
Cloud vs Bare Metal


            •        Dynamic vs Fixed Provisioning

            •        Separation between Compute and Storage

            •        Purchasing and Budgeting




Qubole Inc., Proprietary
Dynamic Provisioning



            •        Advantage: Transient Clusters

            •        Burden: How big of a cluster do I need?

            •        Solution: Auto-scaled Hadoop



Qubole Inc., Proprietary
Challenges:Auto-scaled
                              Hadoop
            http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/




            •        Adapting to Burstiness

                 -       Current load is not enough, also need to predict future
                       load

            •        Adapting State-fully

                 -       Removing HDFS nodes is risky without
                       decommissioning

Qubole Inc., Proprietary
Implementation:Auto-scaled
                        Hadoop
            http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/




            •       TaskTrackers report launch times of
                  JobTracker

            •       JT computes amount of time required to
                  finish existing workloads

            •      If the time is above a certain threshold then
                  more nodes are added

            •        At hourly boundaries the nodes are removed
Qubole Inc., Proprietary
Implementation:Auto-scaled
                  Hadoop
       http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/




      •       Restrictions on Deleting Nodes:

          -      Nodes Containing Task Outputs of Current Jobs


          -      Fast Decommissioning Done for Data Nodes


          -      Minimum Cluster Size Threshold


      •               Fast Decommissioning - possible because
                  HDFS is a cache for us
Qubole Inc., Proprietary
Compute & Storage on the
                     Cloud (EC2/S3)

            •      On the cloud Compute and Storage are
                  Separate!!

            •        Advantage: Don’t Pay for CPU for Storing Data

            •      Burden: Separation Can Cause Slowness &
                  Variability

            •        Solutions:

                 -
Qubole Inc., Proprietary
                           Caching File System
Caching File System
                       http://www.qubole.com/blog/index.php/columnar-cloud-cache/




Qubole Inc., Proprietary
Caching File System
                       http://www.qubole.com/blog/index.php/columnar-cloud-cache/




         •        Benefits:
            -      Masks the performance variance associated with S3 while
                  reading data

            -      Columnar caching on the fly enables data to be persisted in
                  open formats while still giving the benefits of performance



Qubole Inc., Proprietary
Masking S3 Latency
             http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/




       •      File Operations in S3 are much slower than
             HDFS

       •      Problem: This leads to bad performance when
             data is distributed in a lot of files

       •      Solution:
         -     Fast Split Generation Algorithm

         -           Pipelined File Opens
Qubole Inc., Proprietary
Faster Split Generation
                http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/




         •      Directory operations with merging instead of
               per file metadata (upto 8x speedup)




Qubole Inc., Proprietary
Pipelined File Opens
                http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/




         •      Open S3 files before they are read (30%
               improvements in simple queries)




Qubole Inc., Proprietary
Purchasing Instances

       •      Buying Instances on Spot Prices vs On-
             Demand Prices

       •      Benefits: Cheaper on average by 50-60%

       •      Problems: Spot instances are not guaranteed
             and can be taken away anytime
         -     Bad for MapReduce

         -           Disastrous for HDFS
Qubole Inc., Proprietary
Spotted Hadoop Clusters
          http://www.qubole.com/blog/index.php/hadoop-auto-scale-ec2-spot-instances/




      •      Simplified Spot Bidding Strategy
       -       Configuring Bidding Timeouts

       -       Configuring % of instances through spot

       -       Configuring bid pricses

      •      Spot Instance Aware HDFS Block Placement
       -             Ensures One Replica of the Blocks Reside On On-Demand
                  Nodes
Qubole Inc., Proprietary
Conclusion


            •        Cloud is Different from Bare Metal

            •       Check out more optimizations that we have
                  made to run Hadoop and Hive optimally in the
                  cloud at our blog
                           http://www.qubole.com/blog/


Qubole Inc., Proprietary
Thank you.
         Free Sign up for Qubole at https://api.qubole.com/users/sign_up
                    Careers at http://www.qubole.com/careers




Qubole Inc., Proprietary

Contenu connexe

Tendances

Flexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant TwinFlexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant TwinDmitriy Ryaboy
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hortonworks
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Jonathan Seidman
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Jonathan Seidman
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handyPraveen Sripati
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopCloudera, Inc.
 
Understanding Big Data And Hadoop
Understanding Big Data And HadoopUnderstanding Big Data And Hadoop
Understanding Big Data And HadoopEdureka!
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with ExamplesJoe McTee
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 

Tendances (20)

Flexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant TwinFlexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant Twin
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Understanding Big Data And Hadoop
Understanding Big Data And HadoopUnderstanding Big Data And Hadoop
Understanding Big Data And Hadoop
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with Examples
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 

Similaire à Hadoop User Group Ashish Thusoo Jan 16 2013

Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01eimhee
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
 
Consolidate and prepare for cloud efficiencies
Consolidate and prepare for cloud efficienciesConsolidate and prepare for cloud efficiencies
Consolidate and prepare for cloud efficienciesDLT Solutions
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesDataWorks Summit
 
Oracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureOracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureRiccardo Romani
 
Hello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopHello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopDataWorks Summit
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 

Similaire à Hadoop User Group Ashish Thusoo Jan 16 2013 (20)

Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Consolidate and prepare for cloud efficiencies
Consolidate and prepare for cloud efficienciesConsolidate and prepare for cloud efficiencies
Consolidate and prepare for cloud efficiencies
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual Machines
 
Oracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureOracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and Architecture
 
Hello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopHello OpenStack, Meet Hadoop
Hello OpenStack, Meet Hadoop
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 

Plus de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

Plus de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Hadoop User Group Ashish Thusoo Jan 16 2013

  • 1. Hadoop User Group Ashish Thusoo Jan 16, 2013 Qubole Inc., Proprietary
  • 2. About Me • Big Data Veteran • Ran the data infrastructure team at Facebook before starting Qubole • Co-created Hive in 2007 @ Facebook Qubole Inc., Proprietary
  • 3. What is Qubole? • A comprehensive cloud data platform based on Hadoop and Hive for data in the cloud - Turnkey Infrastructure - Cloud Optimized Stack - Open Data Formats • Useful for exploring data and creating batch processing applications/data pipelines Qubole Inc., Proprietary
  • 4. Why Qubole? BOTTLENECK End Users Heterogenous Data (User Ops, Product Managers (Structured & Unstructured) etc.) The Intermediaries (Data Scientists and Engineers) Qubole Inc., Proprietary
  • 5. Qubole Service Cloud Data Service Explore Schedule SDK API ODBC Logs Cloud Data Platform Connectors Events Elastic . Robust . Fast Data Marts DBs Big Data Technology Stack Metrics EC2 / S3 Cloud Sources Qubole Inc., Proprietary
  • 6. Cloud vs Bare Metal • Dynamic vs Fixed Provisioning • Separation between Compute and Storage • Purchasing and Budgeting Qubole Inc., Proprietary
  • 7. Dynamic Provisioning • Advantage: Transient Clusters • Burden: How big of a cluster do I need? • Solution: Auto-scaled Hadoop Qubole Inc., Proprietary
  • 8. Challenges:Auto-scaled Hadoop http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/ • Adapting to Burstiness - Current load is not enough, also need to predict future load • Adapting State-fully - Removing HDFS nodes is risky without decommissioning Qubole Inc., Proprietary
  • 9. Implementation:Auto-scaled Hadoop http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/ • TaskTrackers report launch times of JobTracker • JT computes amount of time required to finish existing workloads • If the time is above a certain threshold then more nodes are added • At hourly boundaries the nodes are removed Qubole Inc., Proprietary
  • 10. Implementation:Auto-scaled Hadoop http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/ • Restrictions on Deleting Nodes: - Nodes Containing Task Outputs of Current Jobs - Fast Decommissioning Done for Data Nodes - Minimum Cluster Size Threshold • Fast Decommissioning - possible because HDFS is a cache for us Qubole Inc., Proprietary
  • 11. Compute & Storage on the Cloud (EC2/S3) • On the cloud Compute and Storage are Separate!! • Advantage: Don’t Pay for CPU for Storing Data • Burden: Separation Can Cause Slowness & Variability • Solutions: - Qubole Inc., Proprietary Caching File System
  • 12. Caching File System http://www.qubole.com/blog/index.php/columnar-cloud-cache/ Qubole Inc., Proprietary
  • 13. Caching File System http://www.qubole.com/blog/index.php/columnar-cloud-cache/ • Benefits: - Masks the performance variance associated with S3 while reading data - Columnar caching on the fly enables data to be persisted in open formats while still giving the benefits of performance Qubole Inc., Proprietary
  • 14. Masking S3 Latency http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/ • File Operations in S3 are much slower than HDFS • Problem: This leads to bad performance when data is distributed in a lot of files • Solution: - Fast Split Generation Algorithm - Pipelined File Opens Qubole Inc., Proprietary
  • 15. Faster Split Generation http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/ • Directory operations with merging instead of per file metadata (upto 8x speedup) Qubole Inc., Proprietary
  • 16. Pipelined File Opens http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/ • Open S3 files before they are read (30% improvements in simple queries) Qubole Inc., Proprietary
  • 17. Purchasing Instances • Buying Instances on Spot Prices vs On- Demand Prices • Benefits: Cheaper on average by 50-60% • Problems: Spot instances are not guaranteed and can be taken away anytime - Bad for MapReduce - Disastrous for HDFS Qubole Inc., Proprietary
  • 18. Spotted Hadoop Clusters http://www.qubole.com/blog/index.php/hadoop-auto-scale-ec2-spot-instances/ • Simplified Spot Bidding Strategy - Configuring Bidding Timeouts - Configuring % of instances through spot - Configuring bid pricses • Spot Instance Aware HDFS Block Placement - Ensures One Replica of the Blocks Reside On On-Demand Nodes Qubole Inc., Proprietary
  • 19. Conclusion • Cloud is Different from Bare Metal • Check out more optimizations that we have made to run Hadoop and Hive optimally in the cloud at our blog http://www.qubole.com/blog/ Qubole Inc., Proprietary
  • 20. Thank you. Free Sign up for Qubole at https://api.qubole.com/users/sign_up Careers at http://www.qubole.com/careers Qubole Inc., Proprietary