Hadoop User Group Ashish Thusoo Jan 16 2013

•

1 j'aime•2,761 vues

This document discusses Qubole, a cloud data platform for Hadoop and Hive. It describes challenges in running big data technologies in the cloud like dynamic provisioning and separation of compute and storage. Qubole addresses these through techniques such as auto-scaling Hadoop clusters, caching file systems, faster split generation and pipelined file opens to optimize performance for cloud storage like S3. It also discusses using spot instances to lower costs through strategies to make Hadoop resilient to spot interruptions.

Hadoop User Group
Ashish Thusoo
Jan 16, 2013

Qubole Inc., Proprietary

About Me

• Big Data Veteran

• Ran the data infrastructure team at Facebook
before starting Qubole

• Co-created Hive in 2007 @ Facebook

Qubole Inc., Proprietary

What is Qubole?
• A comprehensive cloud data platform based
on Hadoop and Hive for data in the cloud

- Turnkey Infrastructure

- Cloud Optimized Stack

- Open Data Formats

• Useful for exploring data and creating batch
processing applications/data pipelines
Qubole Inc., Proprietary

Why Qubole?
BOTTLENECK

End Users
Heterogenous Data
(User Ops, Product Managers
(Structured & Unstructured) etc.)

The Intermediaries
(Data Scientists and
Engineers)

Qubole Inc., Proprietary

Qubole Service

Cloud Data Service

Explore Schedule SDK

API ODBC

Logs
Cloud Data Platform
Connectors

Events
Elastic . Robust . Fast
Data
Marts
DBs
Big Data Technology Stack
Metrics

EC2 / S3
Cloud Sources

Qubole Inc., Proprietary

Cloud vs Bare Metal

• Dynamic vs Fixed Provisioning

• Separation between Compute and Storage

• Purchasing and Budgeting

Qubole Inc., Proprietary

Dynamic Provisioning

• Advantage: Transient Clusters

• Burden: How big of a cluster do I need?

• Solution: Auto-scaled Hadoop

Qubole Inc., Proprietary

Challenges:Auto-scaled
Hadoop
http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/

• Adapting to Burstiness

- Current load is not enough, also need to predict future
load

• Adapting State-fully

- Removing HDFS nodes is risky without
decommissioning

Qubole Inc., Proprietary

Implementation:Auto-scaled
Hadoop
http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/

• TaskTrackers report launch times of
JobTracker

• JT computes amount of time required to
finish existing workloads

• If the time is above a certain threshold then
more nodes are added

• At hourly boundaries the nodes are removed
Qubole Inc., Proprietary

Implementation:Auto-scaled
Hadoop
http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/

• Restrictions on Deleting Nodes:

- Nodes Containing Task Outputs of Current Jobs

- Fast Decommissioning Done for Data Nodes

- Minimum Cluster Size Threshold

• Fast Decommissioning - possible because
HDFS is a cache for us
Qubole Inc., Proprietary

Compute & Storage on the
Cloud (EC2/S3)

• On the cloud Compute and Storage are
Separate!!

• Advantage: Don’t Pay for CPU for Storing Data

• Burden: Separation Can Cause Slowness &
Variability

• Solutions:

-
Qubole Inc., Proprietary
Caching File System

Caching File System
http://www.qubole.com/blog/index.php/columnar-cloud-cache/

Qubole Inc., Proprietary

Caching File System
http://www.qubole.com/blog/index.php/columnar-cloud-cache/

• Benefits:
- Masks the performance variance associated with S3 while
reading data

- Columnar caching on the fly enables data to be persisted in
open formats while still giving the benefits of performance

Qubole Inc., Proprietary

Masking S3 Latency
http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/

• File Operations in S3 are much slower than
HDFS

• Problem: This leads to bad performance when
data is distributed in a lot of files

• Solution:
- Fast Split Generation Algorithm

- Pipelined File Opens
Qubole Inc., Proprietary

Faster Split Generation
http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/

• Directory operations with merging instead of
per file metadata (upto 8x speedup)

Qubole Inc., Proprietary

Pipelined File Opens
http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/

• Open S3 files before they are read (30%
improvements in simple queries)

Qubole Inc., Proprietary

Purchasing Instances

• Buying Instances on Spot Prices vs On-
Demand Prices

• Benefits: Cheaper on average by 50-60%

• Problems: Spot instances are not guaranteed
and can be taken away anytime
- Bad for MapReduce

- Disastrous for HDFS
Qubole Inc., Proprietary

Spotted Hadoop Clusters
http://www.qubole.com/blog/index.php/hadoop-auto-scale-ec2-spot-instances/

• Simplified Spot Bidding Strategy
- Configuring Bidding Timeouts

- Configuring % of instances through spot

- Configuring bid pricses

• Spot Instance Aware HDFS Block Placement
- Ensures One Replica of the Blocks Reside On On-Demand
Nodes
Qubole Inc., Proprietary

Conclusion

• Cloud is Different from Bare Metal

• Check out more optimizations that we have
made to run Hadoop and Hive optimally in the
cloud at our blog
http://www.qubole.com/blog/

Qubole Inc., Proprietary

Thank you.
Free Sign up for Qubole at https://api.qubole.com/users/sign_up
Careers at http://www.qubole.com/careers

Qubole Inc., Proprietary

Recommandé

Getting started with Hadoop, Hive, and Elastic MapReduceobdit

Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman

SQL in HadoopSven Bayer

Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman

Hadoop Overview EMC

Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans

Big Data Warehousing: Pig vs. Hive ComparisonCaserta

Hadoop online training Keylabs

Recommandé

Getting started with Hadoop, Hive, and Elastic MapReduceobdit

Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman

SQL in HadoopSven Bayer

Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman

Hadoop Overview EMC

Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans

Big Data Warehousing: Pig vs. Hive ComparisonCaserta

Hadoop online training Keylabs

Flexible In-Situ Indexing for Hadoop via Elephant TwinDmitriy Ryaboy

Introduction to HadoopGiovanna Roda

Big Data Performance and Capacity Managementrightsize

Hadoop Administration pdfEdureka!

Hadoop Seminar ReportAtul Kushwaha

Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari

Hdp r-google charttools-webinar-3-5-2013 (2)Hortonworks

Introduction to Hadoop and MapReduceeakasit_dpu

Hadoop and Hive at Orbitz, Hadoop World 2010Jonathan Seidman

Distributed Data Analysis with Hadoop and R - OSCON 2011Jonathan Seidman

Where does hadoop come handyPraveen Sripati

Introduction to PigPrashanth Babu

Impala Unlocks Interactive BI on HadoopCloudera, Inc.

Hadoop EcosystemSandip Darwade

Understanding Big Data And HadoopEdureka!

Hadoop tools with ExamplesJoe McTee

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni

Hadoop: Distributed Data ProcessingCloudera, Inc.

Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh

May 2013 HUG: HCatalog/Hive Data OutYahoo Developer Network

Petabyte scale on commodity infrastructureelliando dias

The power of hadoop in cloud computingJoey Echeverria

Contenu connexe

Tendances

Flexible In-Situ Indexing for Hadoop via Elephant TwinDmitriy Ryaboy

Introduction to HadoopGiovanna Roda

Big Data Performance and Capacity Managementrightsize

Hadoop Administration pdfEdureka!

Hadoop Seminar ReportAtul Kushwaha

Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari

Hdp r-google charttools-webinar-3-5-2013 (2)Hortonworks

Introduction to Hadoop and MapReduceeakasit_dpu

Hadoop and Hive at Orbitz, Hadoop World 2010Jonathan Seidman

Distributed Data Analysis with Hadoop and R - OSCON 2011Jonathan Seidman

Where does hadoop come handyPraveen Sripati

Introduction to PigPrashanth Babu

Impala Unlocks Interactive BI on HadoopCloudera, Inc.

Hadoop EcosystemSandip Darwade

Understanding Big Data And HadoopEdureka!

Hadoop tools with ExamplesJoe McTee

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni

Hadoop: Distributed Data ProcessingCloudera, Inc.

Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh

May 2013 HUG: HCatalog/Hive Data OutYahoo Developer Network

Tendances (20)

Flexible In-Situ Indexing for Hadoop via Elephant Twin

Introduction to Hadoop

Big Data Performance and Capacity Management

Hadoop Administration pdf

Hadoop Seminar Report

Big data Hadoop Analytic and Data warehouse comparison guide

Hdp r-google charttools-webinar-3-5-2013 (2)

Introduction to Hadoop and MapReduce

Hadoop and Hive at Orbitz, Hadoop World 2010

Distributed Data Analysis with Hadoop and R - OSCON 2011

Where does hadoop come handy

Introduction to Pig

Impala Unlocks Interactive BI on Hadoop

Hadoop Ecosystem

Understanding Big Data And Hadoop

Hadoop tools with Examples

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy

Hadoop: Distributed Data Processing

Hadoop Summit San Jose 2014: Costing Your Big Data Operations

May 2013 HUG: HCatalog/Hive Data Out

Similaire à Hadoop User Group Ashish Thusoo Jan 16 2013

Petabyte scale on commodity infrastructureelliando dias

The power of hadoop in cloud computingJoey Echeverria

Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner

Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group

Building a Hadoop Data Warehouse with Impalahuguk

Data Science Day New York: The Platform for Big DataCloudera, Inc.

Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.

Commonanduniqueusecases 110831113310-phpapp01eimhee

Common and unique use cases for Apache HadoopBrock Noland

Hadoop PrimerSteve Staso

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.

Consolidate and prepare for cloud efficienciesDLT Solutions

Hadoop in the Clouds, Virtualization and Virtual MachinesDataWorks Summit

Oracle Cloud : Big Data Use Cases and ArchitectureRiccardo Romani

Hello OpenStack, Meet HadoopDataWorks Summit

Apache hadoop technology : BeginnersShweta Patnaik

Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic

AWS (Hadoop) Meetup 30.04.09Chris Purrington

Similaire à Hadoop User Group Ashish Thusoo Jan 16 2013 (20)

Petabyte scale on commodity infrastructure

The power of hadoop in cloud computing

Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Building a Hadoop Data Warehouse with Impala

Data Science Day New York: The Platform for Big Data

Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Commonanduniqueusecases 110831113310-phpapp01

Common and unique use cases for Apache Hadoop

Hadoop Primer

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

Consolidate and prepare for cloud efficiencies

Hadoop in the Clouds, Virtualization and Virtual Machines

Oracle Cloud : Big Data Use Cases and Architecture

Hello OpenStack, Meet Hadoop

Apache hadoop technology : Beginners

Introducing Apache Kudu (Incubating) - Montreal HUG May 2016

AWS (Hadoop) Meetup 30.04.09

Plus de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network

Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network

CICD at Oath using ScrewdriverYahoo Developer Network

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network

Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network

Architecting Petabyte Scale AI ApplicationsYahoo Developer Network

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network

Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network

February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network

February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network

Plus de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...

Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...

CICD at Oath using Screwdriver

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...

Moving the Oath Grid to Docker, Eric Badger, Oath

Architecting Petabyte Scale AI Applications

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...

Jun 2017 HUG: YARN Scheduling – A Step Beyond

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...

February 2017 HUG: Exactly-once end-to-end processing with Apache Apex

February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

Hadoop User Group Ashish Thusoo Jan 16 2013

1. Hadoop User Group Ashish Thusoo Jan 16, 2013 Qubole Inc., Proprietary

2. About Me • Big Data Veteran • Ran the data infrastructure team at Facebook before starting Qubole • Co-created Hive in 2007 @ Facebook Qubole Inc., Proprietary

3. What is Qubole? • A comprehensive cloud data platform based on Hadoop and Hive for data in the cloud - Turnkey Infrastructure - Cloud Optimized Stack - Open Data Formats • Useful for exploring data and creating batch processing applications/data pipelines Qubole Inc., Proprietary

4. Why Qubole? BOTTLENECK End Users Heterogenous Data (User Ops, Product Managers (Structured & Unstructured) etc.) The Intermediaries (Data Scientists and Engineers) Qubole Inc., Proprietary

5. Qubole Service Cloud Data Service Explore Schedule SDK API ODBC Logs Cloud Data Platform Connectors Events Elastic . Robust . Fast Data Marts DBs Big Data Technology Stack Metrics EC2 / S3 Cloud Sources Qubole Inc., Proprietary

6. Cloud vs Bare Metal • Dynamic vs Fixed Provisioning • Separation between Compute and Storage • Purchasing and Budgeting Qubole Inc., Proprietary

7. Dynamic Provisioning • Advantage: Transient Clusters • Burden: How big of a cluster do I need? • Solution: Auto-scaled Hadoop Qubole Inc., Proprietary

8. Challenges:Auto-scaled Hadoop http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/ • Adapting to Burstiness - Current load is not enough, also need to predict future load • Adapting State-fully - Removing HDFS nodes is risky without decommissioning Qubole Inc., Proprietary

9. Implementation:Auto-scaled Hadoop http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/ • TaskTrackers report launch times of JobTracker • JT computes amount of time required to finish existing workloads • If the time is above a certain threshold then more nodes are added • At hourly boundaries the nodes are removed Qubole Inc., Proprietary

10. Implementation:Auto-scaled Hadoop http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/ • Restrictions on Deleting Nodes: - Nodes Containing Task Outputs of Current Jobs - Fast Decommissioning Done for Data Nodes - Minimum Cluster Size Threshold • Fast Decommissioning - possible because HDFS is a cache for us Qubole Inc., Proprietary

11. Compute & Storage on the Cloud (EC2/S3) • On the cloud Compute and Storage are Separate!! • Advantage: Don’t Pay for CPU for Storing Data • Burden: Separation Can Cause Slowness & Variability • Solutions: - Qubole Inc., Proprietary Caching File System

12. Caching File System http://www.qubole.com/blog/index.php/columnar-cloud-cache/ Qubole Inc., Proprietary

13. Caching File System http://www.qubole.com/blog/index.php/columnar-cloud-cache/ • Benefits: - Masks the performance variance associated with S3 while reading data - Columnar caching on the fly enables data to be persisted in open formats while still giving the benefits of performance Qubole Inc., Proprietary

14. Masking S3 Latency http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/ • File Operations in S3 are much slower than HDFS • Problem: This leads to bad performance when data is distributed in a lot of files • Solution: - Fast Split Generation Algorithm - Pipelined File Opens Qubole Inc., Proprietary

15. Faster Split Generation http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/ • Directory operations with merging instead of per file metadata (upto 8x speedup) Qubole Inc., Proprietary

16. Pipelined File Opens http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/ • Open S3 files before they are read (30% improvements in simple queries) Qubole Inc., Proprietary

17. Purchasing Instances • Buying Instances on Spot Prices vs On- Demand Prices • Benefits: Cheaper on average by 50-60% • Problems: Spot instances are not guaranteed and can be taken away anytime - Bad for MapReduce - Disastrous for HDFS Qubole Inc., Proprietary

18. Spotted Hadoop Clusters http://www.qubole.com/blog/index.php/hadoop-auto-scale-ec2-spot-instances/ • Simplified Spot Bidding Strategy - Configuring Bidding Timeouts - Configuring % of instances through spot - Configuring bid pricses • Spot Instance Aware HDFS Block Placement - Ensures One Replica of the Blocks Reside On On-Demand Nodes Qubole Inc., Proprietary

19. Conclusion • Cloud is Different from Bare Metal • Check out more optimizations that we have made to run Hadoop and Hive optimally in the cloud at our blog http://www.qubole.com/blog/ Qubole Inc., Proprietary

20. Thank you. Free Sign up for Qubole at https://api.qubole.com/users/sign_up Careers at http://www.qubole.com/careers Qubole Inc., Proprietary