RubiX

RubiX
A caching framework for big data engines in the
cloud
Strata + Hadoop World
March 2017
Shubham Tagra (stagra@qubole.com)

Agenda
● Intro
● Why Caching?
● Path to Rubix
● Rubix Architecture
● Future of Rubix
● QnA

Built for Anyone who Uses Data
Analysts l Data Scientists l Data Engineers l Data Admins
Optimize performance,
cost, and scale through
automation, control and
orchestration of big data
workloads.
A Single Platform for Any Use Case
ETL & Reporting l Ad Hoc Queries l Machine Learning l
Streaming l Vertical Apps
Open Source Engines, Optimized for the Cloud
Native Integration with multiple cloud providers

Qubole operates at Cloud Scale
500 PB
Data Processed in the
Cloud Monthly
6
PB
80
PB
150
PB
500
PB
500 Nodes
Largest Spark Cluster in
the Cloud
2000
Clusters Started per
month

Why Caching
● Popularity of Cloud Stores like S3
+ Near-infinite capacity
+ Inexpensive
+ Ease of use
- Network Latencies
- Back-offs

Rubix ancestors
● File cache

Rubix ancestors
● File cache
○ Benefits: as much as 10x performance improvement
○ Problems
■ Huge warm-ups
■ Cache size
■ Tied to Presto
■ Required Presto scheduler changes

● Improve performance
● Abstracted from user
○ Easy of use
● Support Columnar formats
○ Improves speed
● Work well with autoscaling
○ Saves cost
● Ease of extension to clouds and engines
Requirements for new cache

Alternatives Considered: FUSE FileSystem
● Mount S3 paths on ec2
● OS for page caching, read ahead, etc
● Problems
○ Exclusive control over bucket
○ Data corruptions in external updates
○ Not production ready

Alternatives Considered: HTTP Caching

Alternatives Considered: HTTP Caching
● Worked fine with TXT data
● Problems
○ Columnar formats and Byte-Range based Varnish Keys
■ Poor hit ratio
■ Redundant copies

Tachyon/Alluxio
● More than just a caching system
● We required light weight system
● SQL first

Rubix
● Extendible to many engines
● Columnar format friendly
● Works well with autoscaling
● Share-able across engines/instances

Architecture
● Split ownership assignment system
● Data Caching System
● Plugins

Architecture
● Split ownership assignment
system
○ Used in master node during split computation
○ Calculates which node owns particular split of
file
○ Uses Consistent Hashing to work well with
Autoscaling

● Data Caching System
○ Used in worker nodes when data is read
○ Read from disk or remote as per the
metadata
○ Metadata stored in units of block (1MB each)
○ BookKeeper provides metadata for the block
○ Metadata too Checkpointed to local disk
Architecture

● Plugin
○ Provides two types of information
■ How to get the list of nodes in the
system
■ FileSystem for remote reads
○ E.g. presto plugin, hadoop1 plugin, hadoop2
plugin
Architecture

Plugins: Presto
● Presto provides tight control over scheduling local splits
● This ensured that splits will be always scheduled locally
● Worked well for our customers

Plugins: Hadoop
● Strict local scheduling was not possible with hadoop
● This meant lot of warm-ups and redundant copies of data
● Options:
○ Read directly from remote for non-local read
○ Figure out the correct owner and read from it
○ Implement Non-Local reads for Hadoop support
○ Learnings
■ 100% strict location based scheduling not possible in H2

Using Rubix with Presto
● Configure disk mount point
○ Assumes disks mounted on /media/ephemeral0, /media/ephemeral1, etc by default
● Start BookKeeper
● Place rubix jars in hive-hadoop2 plugin of Presto
● Configure Presto to use Rubix FileSystem for the cloud store

Using Rubix with Presto in Qubole

Using Rubix with Hadoop
● Configure disk mount point
○ Assumes disks mounted on /media/ephemeral0, /media/ephemeral1, etc by default
● Start BookKeeper
● Place rubix jars with hadoop libraries
● Configure Hadoop to use Rubix FileSystem for the cloud store

Extending to other Engines and Clouds

Future Work
● Extend to other clouds and engines
● Table aware objects in Rubix
● Caching policies for Hive Partitions
● Subquery caching

RubiX

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to RubiX

Similar to RubiX (20)

More from Shubham Tagra

More from Shubham Tagra (12)

Recently uploaded

Recently uploaded (20)

RubiX