This document discusses Qubole, a cloud data platform for Hadoop and Hive. It describes challenges in running big data technologies in the cloud like dynamic provisioning and separation of compute and storage. Qubole addresses these through techniques such as auto-scaling Hadoop clusters, caching file systems, faster split generation and pipelined file opens to optimize performance for cloud storage like S3. It also discusses using spot instances to lower costs through strategies to make Hadoop resilient to spot interruptions.
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Hadoop User Group Ashish Thusoo Jan 16 2013
1. Hadoop User Group
Ashish Thusoo
Jan 16, 2013
Qubole Inc., Proprietary
2. About Me
• Big Data Veteran
• Ran the data infrastructure team at Facebook
before starting Qubole
• Co-created Hive in 2007 @ Facebook
Qubole Inc., Proprietary
3. What is Qubole?
• A comprehensive cloud data platform based
on Hadoop and Hive for data in the cloud
- Turnkey Infrastructure
- Cloud Optimized Stack
- Open Data Formats
• Useful for exploring data and creating batch
processing applications/data pipelines
Qubole Inc., Proprietary
4. Why Qubole?
BOTTLENECK
End Users
Heterogenous Data
(User Ops, Product Managers
(Structured & Unstructured) etc.)
The Intermediaries
(Data Scientists and
Engineers)
Qubole Inc., Proprietary
5. Qubole Service
Cloud Data Service
Explore Schedule SDK
API ODBC
Logs
Cloud Data Platform
Connectors
Events
Elastic . Robust . Fast
Data
Marts
DBs
Big Data Technology Stack
Metrics
EC2 / S3
Cloud Sources
Qubole Inc., Proprietary
6. Cloud vs Bare Metal
• Dynamic vs Fixed Provisioning
• Separation between Compute and Storage
• Purchasing and Budgeting
Qubole Inc., Proprietary
7. Dynamic Provisioning
• Advantage: Transient Clusters
• Burden: How big of a cluster do I need?
• Solution: Auto-scaled Hadoop
Qubole Inc., Proprietary
8. Challenges:Auto-scaled
Hadoop
http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/
• Adapting to Burstiness
- Current load is not enough, also need to predict future
load
• Adapting State-fully
- Removing HDFS nodes is risky without
decommissioning
Qubole Inc., Proprietary
9. Implementation:Auto-scaled
Hadoop
http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/
• TaskTrackers report launch times of
JobTracker
• JT computes amount of time required to
finish existing workloads
• If the time is above a certain threshold then
more nodes are added
• At hourly boundaries the nodes are removed
Qubole Inc., Proprietary
10. Implementation:Auto-scaled
Hadoop
http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/
• Restrictions on Deleting Nodes:
- Nodes Containing Task Outputs of Current Jobs
- Fast Decommissioning Done for Data Nodes
- Minimum Cluster Size Threshold
• Fast Decommissioning - possible because
HDFS is a cache for us
Qubole Inc., Proprietary
11. Compute & Storage on the
Cloud (EC2/S3)
• On the cloud Compute and Storage are
Separate!!
• Advantage: Don’t Pay for CPU for Storing Data
• Burden: Separation Can Cause Slowness &
Variability
• Solutions:
-
Qubole Inc., Proprietary
Caching File System
12. Caching File System
http://www.qubole.com/blog/index.php/columnar-cloud-cache/
Qubole Inc., Proprietary
13. Caching File System
http://www.qubole.com/blog/index.php/columnar-cloud-cache/
• Benefits:
- Masks the performance variance associated with S3 while
reading data
- Columnar caching on the fly enables data to be persisted in
open formats while still giving the benefits of performance
Qubole Inc., Proprietary
14. Masking S3 Latency
http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/
• File Operations in S3 are much slower than
HDFS
• Problem: This leads to bad performance when
data is distributed in a lot of files
• Solution:
- Fast Split Generation Algorithm
- Pipelined File Opens
Qubole Inc., Proprietary
15. Faster Split Generation
http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/
• Directory operations with merging instead of
per file metadata (upto 8x speedup)
Qubole Inc., Proprietary
16. Pipelined File Opens
http://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/
• Open S3 files before they are read (30%
improvements in simple queries)
Qubole Inc., Proprietary
17. Purchasing Instances
• Buying Instances on Spot Prices vs On-
Demand Prices
• Benefits: Cheaper on average by 50-60%
• Problems: Spot instances are not guaranteed
and can be taken away anytime
- Bad for MapReduce
- Disastrous for HDFS
Qubole Inc., Proprietary
18. Spotted Hadoop Clusters
http://www.qubole.com/blog/index.php/hadoop-auto-scale-ec2-spot-instances/
• Simplified Spot Bidding Strategy
- Configuring Bidding Timeouts
- Configuring % of instances through spot
- Configuring bid pricses
• Spot Instance Aware HDFS Block Placement
- Ensures One Replica of the Blocks Reside On On-Demand
Nodes
Qubole Inc., Proprietary
19. Conclusion
• Cloud is Different from Bare Metal
• Check out more optimizations that we have
made to run Hadoop and Hive optimally in the
cloud at our blog
http://www.qubole.com/blog/
Qubole Inc., Proprietary
20. Thank you.
Free Sign up for Qubole at https://api.qubole.com/users/sign_up
Careers at http://www.qubole.com/careers
Qubole Inc., Proprietary