Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead

Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
September 2016 @ Strata & Hadoop World 2016
Haoyuan (HY) Li, Gene Pang

AGENDA
2
•  Alluxio Open Source Status and History

•  Alluxio Overview

•  Alluxio Use Cases and Demos
•  What’s Next?

HISTORY
3
•  Started at UC Berkeley AMPLab In Summer 2012
•  Original named as Tachyon
•  Open Sourced in 2013
•  Apache License 2.0
•  Latest Stable Release: Alluxio 1.2.0
•  Next Release (Alluxio 1.3.0) In Two Weeks
•  Rebranded as Alluxio in 2016

0
50
100
150
200
250
300
350
Year 1 Year 3Year 2
4
OPEN SOURCE ALLUXIO
•  One of the tastest
growing open-
source projects in
the big data
ecosystem
•  Currently over
300 contributors
from over 100
organizations
•  Welcome to join
our community!
Popular Open Source Projects’ Growth
Spark Kafka Cassandra HDFS
Alluxio

About Us
5
•  Team members from Google, Palantir, Uber, Yahoo with
years of distributed systems development experience
•  Graduated from Stanford University, UC Berkeley, CMU,
Peking University, and Tsinghua, with CS masters or PhDs
•  Top 9 committers of the Alluxio open source project
Team
Haoyuan Li, CEO & Founder

Co-creator of Alluxio project
while working towards Ph.D. at
UC Berkeley AMPLab.
Gene Pang, Software Engineer,
Alluxio Maintainer

Ph.D. from UC Berkeley AMPLab
Previously at Google F1 team
•  Andreessen Horowitz
Investors

BIG DATA ECOSYSTEM TODAY
BIG DATA ECOSYSTEM WITH ALLUXIO
6
BIG DATA ECOSYSTEM YESTERDAY
…
…
FUSE Compatible File System
Hadoop Compatible File System
Native Key-Value Interface
Native File System
Enabling any application to access data from
any storage system at memory-speed
BIG DATA ECOSYSTEM ISSUES
GlusterFS Interface
Amazon S3 Interface
Swift Interface
HDFS Interface

•  Memory is getting
Faster, Larger,
and Cheaper
•  Memory price as
halving every 18
months
•  Disk throughput
increasing slowly
7
TECHNOLOGY TRENDS
Top left chart:
https://lazure2.wordpress.com/
2013/07/02/20-years-of-samsung-new-
management-as-manifested-by-the-latest-
june-20th-galaxy-ativ-innovations/

Top right chart:
people.eecs.berkeley.edu/~istoica/classes/
cs294/
15/notes/02-TechnologyTrends.ppt

Bottom chart: jcmit.com/
6.25
12.5
25
18.75
31.25
43.75
37.5
50
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
DDR performance over time
GBs/second
DDR2
DDR4
DDR3

File System API
Software Only
8
ATTRIBUTES
Memory-Speed
Virtual
Distributed
Storage
Scale out
architecture
Virtualized across
diﬀerent storage
types under a
unified namespace
Memory-speed
access to data

Server A
Applications
Server B
Applications
Server Z
Applications
Server C
Applications
Alluxio
Alluxio
Alluxio
Alluxio
9
ALLUXIO SOLUTION DEPLOYMENT
Storage B
Storage C
Storage Z
Storage A

10
BENEFITS
Unification
New workflows
across any data
in any storage
system
Performance
High
performance
data access
Flexibility
Work with the
compute and
storage frameworks
of your choice
Cost
Grow compute
and storage
systems
independently

USE CASE 1 – Accelerate I/O to/from Remote
Storage
11
•  Compute and Storage Separation
•  Advantages
•  Meet different compute and storage hardware
requirements efficiently
•  Scale compute and storage independently
•  Store data in Traditional filers/SANs and object
stores cost effectively
•  Compute on data in existing storage via Big Data
Computational frameworks
•  Disadvantage
•  Accessing data requires remote I/O

Use Case without Alluxio
12
Spark
Storage
Low latency, memory
throughput
High latency, network
throughput

Use Case with Alluxio
13
Spark
Storage
Alluxio
Keeping data in Alluxio
accelerates data access

14
CASE STUDY
Baidu File System
The performance was amazing. With
Spark SQL alone, it took 100-150 seconds
to finish a query; using Alluxio, where data
may hit local or remote Alluxio nodes, it
took 10-15 seconds.

- Shaoshan Liu, Baidu
RESULTS
•  Data queries are now 30x faster with Alluxio
•  Alluxio cluster run stably, providing over 50TB
of RAM space
•  By using Alluxio, batch queries usually lasting
over 15 minutes were transformed into an
interactive query taking less than 30 seconds
Accelerate Access to
Remote Storage
•  200+ nodes deployment
•  2+ petabytes of storage
•  Mix of memory + HDD

USE CASE 2 – Share Data Across Jobs at
Memory Speed
15
•  Architectures Requiring Shared Data
•  Pipelines: output of one job is input of the next job
•  Diﬀerent applications, jobs, or contexts read the
same data
•  Disadvantage
•  Sharing data requires I/O

Use Case without Alluxio
16
Spark
Storage
MapReduce
Spark
Network I/O
Disk I/O
I/O slows down
sharing

Use Case with Alluxio
17
Spark
Storage
MapReduce
Spark
Sharing
data in Alluxio
Alluxio

18
CASE STUDY
Thanks to Alluxio, we now have the raw
data immediately available at every
iteration and we can skip the costs of
loading in terms of time waiting, network
traﬀic, and RDBMS activity.

- Henry Powell, Barclays
RESULTS
•  Barclays workflow iteration time decreased
from hours to seconds
•  Alluxio enabled workflows that were
impossible before
•  By keeping data only in memory, the I/O cost of
loading and storing in Alluxio is now on the
order of seconds
Relational Database
Share Data Across Jobs
at Memory-Speed
•  6 node deployment
•  1TB of storage
•  Memory only

USE CASE 3 - Transparently Manage Data
Across Storage Systems
20
•  Reasons
•  Most enterprises have multiple storage systems
•  New (better, faster, cheaper) storage systems arise
•  Disadvantage
•  Managing data across systems can be diﬀicult

Use Case Explained
21
Storage
Alluxio
Spark
MapReduce
Spark
Storage
Storage
Flexible,
simple
no application
changes,
new mount
point

22
CASE STUDY
We’ve been running Alluxio in production
for over 9 months, resulting in 15x
speedup on average, and 300x speedup at
peak service times.

- Xueyan Li, Qunar
RESULTS
•  Alluxio’s unified namespace enables different
applications and frameworks to easily interact
with their data from different storage systems
•  Improved the performance of their system with
15x – 300x speedups
•  Tiered storage feature manages various
storage resources including memory, SSD and
disk
Transparently Manage Data
Across Different Storage Systems
•  6 billion logs (4.5 TB) daily
•  Mix of Memory + HDD

USE CASE 4 - Compute on Data in Diﬀerent
Storage with Compliance Requirements
© 2016 Alluxio 23
•  Motivation
•  Compliance with local laws restricts data storage
location
•  Global Analytics on this data is not possible

Use Case Explained
© 2016 Alluxio 24
Storage
Alluxio
Spark
MapReduce
Spark
Storage
Storage
Flexible,
simple
no application
changes,
new mount
point

25
CASE STUDY
RESULTS
•  Alluxio’s unified namespace enables
any compute cluster accessing data
from storage systems at diﬀerent
data centers
•  Enables global analytics which was
earlier not possible
•  No local persistent storage of data
Compute on Data in Diﬀerent
Storage with Compliance
Requirement
•  Memory + SSD
A Global Fortune
500 Enterprise

•  Contact: {haoyuan, gene}@alluxio.com or info@alluxio.com
•  Twitter: @Alluxio
•  Websites: www.alluxio.com and www.alluxio.org
Thank you!

Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (10)

Similaire à Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead

Similaire à Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead (20)

Plus de Alluxio, Inc.

Plus de Alluxio, Inc. (20)

Dernier

Dernier (20)

Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead