Unified Big Data Analytics: Any Stack, Any Cloud

Uniﬁed Big Data Analytics:
Any stack, Any Cloud
1
2019/01/22 @ CloudHealth by VMware
Follow us | @alluxio
Download Alluxio | www.alluxio.org
Questions? | http://alluxio.org/slack

About Me
• Bin Fan
• PhD CS@CMU
• Founding Engineer@Alluxio
2
Email: binfan@alluxio.com
Github: apc999
Twitter: @binfan

Company
Overview
• Founded Feb. 2015 – Haoyuan Li
• PhD research at UC Berkeley AMPLab
• Initially Tachyon Nexus
• Venture Backed
• Andreessen Horowitz, Seven Seas etc.
• Open Source Business Model
• Tachyon Open Sourced in Dec. 2012
• Open source v1.0 released Feb. 2016
• Commercial product released Oct. 2016
• Office in San Mateo, CA
• Team: Google, Palantir, Vmware, AMD, Cisco…

Agenda Technology Trends1
Data Access Layer2
Alluxio Architecture3
Use Cases4

Data Transformation
6
• Pressure in all industries to be
“data driven”
• Majority of companies still figuring out
the transformation
• Increased collection of numerous,
low-value data
• Challenge of overcoming data silos to
convert data into business value
• Limited success of Data Warehouse,
Mart, and Lakes – cost of
copying/moving data is substantial
• Single Data Plane for Business
value

Migration to Cloud
7
• Decoupling of compute and
storage
• Enterprise move from turnkey
solution to self managed data
platforms on IaaS
• Lacking agility at Data Storage
level
• Requires Storage Abstraction

Rise of Artificial Intelligence
8
• New workload targeting the
same data used for Hadoop
OLAP
• Potentially can be as
valuable or more valuable
than existing OLAP
workloads
• Challenge of adapting
existing architectures to
this new workload

The Data Access Layer
10
• Abstraction layer between applications and storage systems
• Present a stable storage interface to applications, including
semantics, security, and performance
• Eliminate weakness of data silos instead of data silos
themselves
• Enable transparent migration of underlying storage systems
• Enable application API to storage API translation in a single
layer

Data Access Layer
Data Access Layer
Security Standard APIsHigh Performance
Compatibility Decoupling
Transparent
Migration
11

Alluxio
12
• Our implementation of the data access layer – a virtual
distributed file system
• Open source project with over 900 contributors from 100s of
organizations worldwide
• Deployed in many top internet and financial companies

100+ Known Production Deployments
AND MORE!
11/16/18 13

Data Ecosystem with Alluxio
• Apps only talk to Alluxio
• Simple Add/Remove
• No App Changes
• Highest performance in
Memory
• No Lock in
Alluxio, a Virtual Distributed File System (VDFS)
Java File API
HDFS
Interface
S3 Interface REST API
HDFS Driver S3 Driver Swift Driver NFS Driver
FUSE
Interface
15

Alluxio Architecture
Alluxio
Master
Zookeeper
Standby
Master
Alluxio
Worker
Alluxio
Worker Under Store
RAM / SSD / HDD
RAM / SSD / HDD
Control Path
Data Path
16

Read Data not Cached in Alluxio + Caching
17
RAM / SSD / HDD
Application
Alluxio
Client
Alluxio
WorkerUnder Store 12
3
4
4

Read Cached Data in Alluxio
Alluxio
Worker
RAM / SSD / HDD
Application
Alluxio
Client
18
1
2
3

Write data only to Alluxio
Alluxio
Worker
RAM / SSD / HDD
Application
Alluxio
Client
19
1
2
3

Write to Alluxio and Under Store Synchronously
RAM / SSD / HDD
Application
Alluxio
Client
Alluxio
Worker
Under Store
20
12
2
3

A Common File System Abstraction
21
• Common interface across apps
• HDFS-compatible interface:
change hdfs://foo/bar to
alluxio://foo/bar
• Other interfaces: Native Alluxio Java
FS, POSIX and S3.
• Cloud storage becomes “hidden”
to apps
• Less vendor lock-in!
Compute Zone
Standalone or managed with Mesos or Yarn
Storage in Different Availability Zone
Either on-prem or cloud
TensorflowPrestoMR
HDFS API POSIX API

Data Path: Improved I/O Performance
22
• A New Tier Above Cloud Storage for Compute
• Distributed buffer cache
• Restore locality to compute
• Read:
• Cache-hit read: served by Alluxio workers (local worker preferred)
• Cache-miss read: served by cloud storage, then cache to Alluxio worker
• Write:
• Burst buffer, then async propagate to S3 (Alluxio 2.0)
• Challenges:
• Locality: expose location information to applications; serve local apps
through ramdisk (rather than network)

Data Path: Async Persist to S3 (Alluxio 2.0)
23
RAM / SSD / HDD
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
Under Store
• Async Writes
• Step1: App writes to Alluxio
• Step2: Alluxio writes to UFS
• Benefits
• Apps writes in Alluxio speed
• Data gets persisted
• Challenges
• File rename/delete before
persist: 2PC
• Fault-tolerance: journal async
requests

Metadata Path: Familiar Semantics
24
• Listing / renaming on object store can be expensive
• Common operations for batch or SQL analytics
• Overwriting Put is eventually consistent
• Alluxio loads and manages metadata in master
• Apps can continue assuming HDFS-like semantics and performance
implication
• Challenges
• Data modification bypassing Alluxio: when and how to re-sync
• Slow lists in object store: batch operations
• Too many objects: off-heap metadata (Alluxio 2.0)

Metadata Path: Efficient Renames
25
• Rename files on S3 can be expensive
• Common operations for MR in commit phase
• Write results to tmp paths
• Rename tmp files to final paths (another copy, slow)
• Rename with Alluxio async writes
• t0: writes to tmp paths in Alluxio: near-compute, fast writes
• t1: rename tmp paths to final path in Alluxio: cheap renames
• t2: persist files in final paths in Alluxio to S3: 2PC to avoid partial data
• Speculative execution allowed

Sandbox: TPCDS + Alluxio + S3
27
• Setup
• TPCDS reads from S3, vs
• TPCDS reads from Alluxio backed by S3
• To play with Sandbox, make a request:
https://www.alluxio.org/sandbox/request

Case Study:
- Digital marketing SaaS platform
- Hive metastore: Input files from S3
- ~100 TB data on S3
- Pain Point:
- Slow to list files
- Limited EC2<->S3 bw
- No compute side caching
- No data locality
https://www.slideshare.net/ThaiBui7/hybrid-collaborative-tiered-storage-with-alluxio
28

Solution: Hot & Warm Data on Alluxio
29

Result
- 5 - 10x read improvement
- Enable easier debug with feedback loop for data analysts
30

Case Study:
- Leading Online Retailer (NASDAQ: JD)
- Building Ad-hoc SQL Query Engine
- Pain Point:
- Presto workers may read remotely from HDFS datanodes
- Large query variance
https://www.slideshare.net/Alluxio/alluxio-in-jd
31

Solution: Colocate Alluxio with Presto
32

Case Study:
35
- Leading Online Gaming Service Company (NASDAQ: NTES)
- Partner with Blizzard to operate service of “WoW”, “Hearthstone”
- Coming “Diablo Immortal”
- Building Ad-hoc SQL Query Engine
- Large data volume: ~30 TB raw data daily
- A separate satellite compute cluster
- Pain Point:
- Requirement in response time: < 15s
- Large startup latency on submitting SQL jobs as YARN app

Result: Smoother Response During Peak Time
37
Response time (ms)
Presto w/ Alluxio
Presto w/o Alluxio

- Alluxio: A New Data Access Layer
- Between compute and storage
- Transparent to bigdata analytics (HDFS-compatible, POSIX)
- Improve data and metadata performance on cloud storage
- Architecture and Data Flow
- Master, Worker, Under Storage
- Cache-{hit, miss} reads, Sync/Async writes
- Use Cases on Alluxio
Conclusion
38

zhuanlan.zhihu.com/alluxio
www.alluxio.com
info@alluxio.com
twitter.com/alluxio
linkedIn.com/alluxio
Thank you
binfan@alluxio.com

Unified Big Data Analytics: Any Stack, Any Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unified Big Data Analytics: Any Stack, Any Cloud

Similar to Unified Big Data Analytics: Any Stack, Any Cloud (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Unified Big Data Analytics: Any Stack, Any Cloud