Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

EFFECTIVE SPARK WITH ALLUXIO
Strata + Hadoop World San Jose 2017
Calvin Jia, Alluxio Inc.

OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
2

BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIO
…
…
FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System
Unifying Data at Memory Speed
BIG DATA ECOSYSTEM ISSUES
GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface
3
BIG DATA ECOSYSTEM YESTERDAY

4
• Fastest growing open-
source project in the
big data ecosystem
• 400+ contributors from
100+ organizations
• Running in large
production clusters
• Community members
are welcome!
FASTEST GROWING BIG DATA PROJECTS
Popular Open Source Projects’ Growth
Months
NumberofContributors

OUTLINE
5

ACCELERATE I/O TO/FROM REMOTE STORAGE
The performance was amazing. With Spark
SQL alone, it took 100-150 seconds to finish a
query; using Alluxio, where data may hit local
or remote Alluxio nodes, it took 10-15 seconds.
- Baidu
RESULTS
• Data queries are now 30x faster with Alluxio
• Alluxio cluster runs stably, providing over
50TB of RAM space
• By using Alluxio, batch queries usually lasting
over 15 minutes were transformed into an
interactive query taking less than 30 seconds
Baidu’s PMs and analysts run
interactive queries to gain insights
into their products and business
• 200+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
ALLUXIO
Baidu File System
6

GUARANTEE PERFORMANCE TO MEET SLAS
RESULTS
• Introducing Alluxio provides the system with
smooth performance in an environment with
highly variable load
• Without Alluxio, the performance of critical
jobs could take more than 4x the expected
time
• With Alluxio’s guaranteed performance,
business logic could be reliably be executed
leading to improved user experience
• Real-time machine learning
• 10+ TB of storage
• Mix of Memory + HDD
A leading online retailer uses
Alluxio to compute conversion
attribution
7
ALLUXIO

SHARE DATA ACROSS JOBS @ MEMORY SPEED
Thanks to Alluxio, we now have the raw data
immediately available at every iteration and
we can skip the costs of loading in terms of
time waiting, network traffic, and RDBMS
activity.
- Barclays
RESULTS
• Barclays workflow iteration time decreased
from hours to seconds
• Alluxio enabled workflows that were
impossible before
• By keeping data only in memory, the I/O cost
of loading and storing in Alluxio is now on the
order of seconds
Barclays uses query and machine
learning to train models for risk
management
• 6 node deployment
• 1TB of storage
• Memory only
ALLUXIO
Relational Database
8

TRANSPARENTLY MANAGE DATA ACROSS
STORAGE SYSTEMS
We’ve been running Alluxio in production for
over 9 months, Alluxio’s unified namespace
enable different applications and
frameworks to easily interact with data from
different storage systems.
- Qunar
RESULTS
• Data sharing among Spark Streaming, Spark
batch and Flink jobs provide efficient data
sharing
• Improved the performance of their system
with 15x – 300x speedups
• Tiered storage feature manages storage
resources including memory, SSD and disk
• 200+ nodes deployment
• 6 billion logs (4.5 TB) daily
• Mix of Memory + HDD
ALLUXIO
Qunar uses real-time machine
learning for their website ads.
9

OUTLINE
10

CONSOLIDATING MEMORY
11
Storage Engine &
Execution Engine
Same Process
• Two copies of data in memory – double the memory used
• Inter-process Sharing Slowed Down by Network / Disk I/O
Spark Compute
Spark
Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Spark Compute
Spark
Storage
block 1
block 3

CONSOLIDATING MEMORY
12
Storage Engine &
Execution Engine
Different process
• Half the memory used
• Inter-process Sharing Happens at Memory Speed
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Spark Compute
Spark Storage

DATA RESILIENCE DURING CRASH
13
Spark Compute
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process

14
CRASH
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
• Process Crash Requires Network and/or Disk I/O to Re-read
Data
Storage Engine &
Execution Engine
Same Process

15
CRASH
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
• Process Crash Requires Network and/or Disk I/O to Re-read
Data

16
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Storage Engine &
Execution Engine
Different process

17
• Process Crash – Data is Re-read at Memory Speed
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
CRASH Storage Engine &
Execution Engine
Different process

ACCESSING ALLUXIO DATA FROM SPARK
18
Writing Data Write to an Alluxio file
Reading Data Read from an Alluxio file

CODE EXAMPLE FOR SPARK RDDS
19
Writing RDD to Alluxio
rdd.saveAsTextFile(alluxioPath)
rdd.saveAsObjectFile(alluxioPath)
Reading RDD from
Alluxio
rdd = sc.textFile(alluxioPath)
rdd = sc.objectFile(alluxioPath)

CODE EXAMPLE FOR SPARK DATAFRAMES
20
Writing to Alluxio df.write.parquet(alluxioPath)
Reading from Alluxio df = sc.read.parquet(alluxioPath)

OUTLINE
21

ENVIRONMENT
Spark 2.0.0 + Alluxio 1.2.0
Single worker: Amazon r3.2xlarge
Comparisons:
• Alluxio
• Spark Storage Level: MEMORY_ONLY
• Spark Storage Level: MEMORY_ONLY_SER
• Spark Storage Level: DISK_ONLY
22

0
50
100
150
200
250
0 10 20 30 40 50
Time[seconds]
RDD Size [GB]
READING CACHEDRDD
Alluxio (textFile) Alluxio (objectFile) DISK_ONLY
MEMORY_ONLY_SER MEMORY_ONLY
23

24
0 50 100 150 200 250
Alluxio
(textFile)
Alluxio
(objectFile)
No Alluxio
Time [seconds]
NEW CONTEXT: READ 50 GB RDD (SSD)

25
0 100 200 300 400 500 600 700 800
Alluxio
(textFile)
Alluxio
(objectFile)
No Alluxio
Time [seconds]
NEW CONTEXT: READ 50 GB RDD (S3)
7x speedup
16x speedup

26
0
50
100
150
200
250
0 10 20 30 40 50
Time[seconds]
DataFrame Size [GB]
READING CACHED DATAFRAME
(PARQUET)
Alluxio (textFile) MEMORY_ONLY_SER MEMORY_ONLY

27
0 50 100 150 200 250
Alluxio
No Alluxio
Time [seconds]
NEW CONTEXT: READ 50 GB DATAFRAME
(SSD)

28
0 250 500 750 1000 1250 1500 1750
Alluxio
No Alluxio
Time [seconds]
NEW CONTEXT: READ 50 GB DATAFRAME
(S3)
10x average speedup, 17x peak speedup

CONCLUSION
• Easy to use with Spark
• Predictable and improved performance
• Easily connect to various storages
29

Contact: calvin@alluxio.com
Twitter: @jiacalvin
Websites: www.alluxio.com and www.alluxio.org
Thank you!
30

31
Unification
New workflows across any
data in any storage system
Orders of magnitude
improvement in run time
Choice in compute and
storage – grow each
independently, buy only
what is needed
Performance Flexibility
BENEFITS

Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (19)

Similaire à Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

Similaire à Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017 (20)

Plus de Alluxio, Inc.

Plus de Alluxio, Inc. (20)

Dernier

Dernier (20)

Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017