Alluxio Bay Area Meetup March 14th
Join the Alluxio Meetup group: https://www.meetup.com/Alluxio
Alluxio Community slack: https://www.alluxio.org/slack
4. ● Release Manager for Alluxio 2.0.0
● Contributor since Tachyon 0.4 (2012)
● Founding Engineer @ Alluxio
About Me
Calvin Jia
5. Alluxio Overview
• Open source, distributed storage system
• Commonly used for data analytics such as OLAP on Hadoop
• Deployed at Huya, Two Sigma, Tencent, and many others
• Largest deployments of over 1000 nodes
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
8. Why 2.0
• Alluxio 1.x target use cases are largely addressed
• Three major types of feedback from users
• Want to support POSIX-based workloads, especially ML
• Want better options for data management
• Want to scale to larger clusters
9. Use Cases
Alluxio 1.x
• Burst compute into cloud with data
on-prem
• Enable object stores for data
analytics platforms
• Accelerate OLAP on Hadoop
Example
• As a data scientist, I want to be able
to spin up my own elastic compute
cluster that can easily and efficiently
access my data stores
New in Alluxio 2.x
• Enable ML/DL frameworks on object
stores
• Data lifecycle management and data
migration
Examples
• As a data scientist, I want to run my
existing simulations on larger
datasets stored in S3.
• As a data infrastructure engineer, I
want to automatically tier data
between Alluxio and the under store.
10. ML/DL Workloads
• Alluxio 1.x focuses primarily on Hadoop based workloads, ie. OLAP
on Hadoop
• Alluxio 2.x will continue to excel for these workloads
• New emphasis on ML frameworks such as Tensorflow
• Primarily accesses the same data set which Alluxio already is serving
• Challenges include new API and file characteristics, such as file access
pattern and file sizes
11. Data Management
• Finer grained control over Alluxio replication
• Automated and scalable async persistence
• Distributed data loading
• Mechanism for cross-mount data operations
12. Scaling
• Namespace scaling - scale to 1 billion files
• Cluster scaling - scale to 3000 worker nodes
• Client scaling - scale to 30,000 concurrent clients
14. Architectural Innovations in 2.0
• Off heap metadata storage (namespace scaling)
• gRPC transport layer (cluster and client scaling)
• Improved POSIX API (new workloads)
• Job Service (enable data management)
• Embedded Journal and Internal Leader Election (better integration
with object stores, fewer external dependencies)
15. Off Heap Metadata Storage
• Uses an embedded RocksDB to store inode tree
• Internal cache for frequently used inodes
• Performance is comparable to previous on-heap option when
working set can fit in cache
16. gRPC Transport Layer
• Switch from Thrift (metadata) + Netty (data) transport to a
consolidated gRPC based transport
• Connection multiplexing to reduce the number of connections from
# of application threads to # of applications
• Threading model enables the master to serve concurrent requests
without being limited by internal threadpool size or open file
descriptors on the master
17. Improved POSIX API
• Alluxio FUSE based POSIX API
• Limitations such as no random write, file cannot be read until
complete
• Validated against Tensorflow’s image recognition and
recommendation workloads
• Taking suggestions for other POSIX-based workloads!
18. Job Service
• New process which serves as a lightweight computation framework
for Alluxio specific tasks
• Enables replication factor control without user input
• Enables faster loading/persisting of data in a distributed manner
• Allows users to do cross-mount operations
• Async through is handled automatically
19. Embedded Journal and Internal Leader Election
• New journaling service reliant only on Alluxio master processes
• No longer need an external distributed storage to store the journal
• Greatly benefits environments without a distributed file system
• Uses Raft as the consensus algorithm
• Consensus is used for journal integrity
• Consensus can also be used for leader election in high availability mode
21. Alluxio 2.0.0 Release
• Alluxio 2.0.0-preview is available now
• Any and all feedback is appreciated!
• File bugs and feature requests on our Github issues
• Alluxio 2.0.0 will be released in ~3 months
26. 26
Overview - Big Data systems
q Separate Streaming and Batch platforms, single data pre-
processing pipeline, no longer a pure Lambda architecture
q Typically streaming data get sinked into hive tables every 5
minutes
q More ETL jobs are moving toward Near Real Time
Lo
g
Kafka
Data
Cleansin
g
Kafka
Augmen
-tation
Kafka
Hive
Delta
Hive
Daily
Streaming(Storm/Flink/Spark)
Batch ETL
(Hive/Spark)
27. 27
The process of identifying a set of user actions (“events”) across screens and touch
points that contribute in some manner to a product sale, and then assigning value to
each of these events.
front
today’s
new
man’s
special
Product A
detail
man’s
special
Product B
detail
add cartorder
28. Near Real-time sales attribution
is a very complex process
• Recompute full day’s data at each iteration:
• ~ 30 minutes, worst case 2-3 hours
• Many data sources involved:
• page view, add cart, order_with_discount, order_cookie_map, sub_order, prepay_order_goods etc
• Several large data sources each contain billions of records and take up 300GB ~
800GB space on Disk
• Sales Path assignment is very CPU intensive computation
• Written by business analysts
• Complex SQL scripts with UDF functions
Business expectation: updated result every 5 - 15 minutes
29. 29
+ + ++
+
Running performance sensitive jobs on current batch platform
not an option
• Around 200K batch jobs executed daily in Hadoop & Spark clusters
• Hdfs 1400+ nodes
• SSD hdfs 50+ nodes
• Spark Clusters( 300+ nodes)
• Cluster usage is above 80% at normal days, resources are even more saturated
during monthly promotion period
• Many issues contribute to the Inconsistent data access time such as NN RPC too
high, slow DataNode response etc
• Scheduling overhead when running M/R jobs
30. 30
1. Adding more compute power
• Too expensive - Not a real option
2. Improve ETL job to process updates incrementally
3. Create a new, relatively isolated environment
• consistent computing resource allocation
• intermediate data caching
• faster read/write
31. • Recompute the click paths for the active users in current window
• Merge active user paths with previous full path result
• Less data in computation but one more read on history data
2.Improve ETL Job to process
updates incrementally
33. 33
q A Satellite Spark + Alluxio 1.8.1 cluster with 27 nodes (48 cores,
256G Memory)
q Alluxio colocated with Spark
qVery consistent read/write I/O time over iterations
q Alluxio Mem + HDD
qDisable multiple copies to save space
qLeave enough memory to OS, improve stability
34. 34
A. Remote HDFS cluster: 1-2 times slow than Alluxio, the biggest problem is there are lots of
spikes
B. Use local HDFS, 30%-100% slower than Alluxio ( Mem + HDD)
C. On dedicated SSD cluster
• on par with Alluxio in regular days, but overall read/write latency doubled during busy days
D. On dedicated Alluxio cluster, still not as good as co-located setup ( more test to be done)
E. Spark Cache
• Our daily views, clicks and path result are too big to fit into JVM
• Slow to create and we have lots of “only used twice” data
• Multiple downstream spark apps need to share the data
35. 35
L
q Move the downstream processes closer to the data, avoid duplicating large amount of
data from Alluxio to remote HDFS
q Manage NRT jobs
q A single big Spark Streaming job? too many inputs and outputs at different stages
q Split into multiple jobs? how to coordinate multiple stream jobs
q NRT executed in much higher frequency, very sensitive to system hiccups
q Current batch job scheduling
q Process dependency, executed for every fixed interval
q When there is a severe delay, multiple batch instances for different slot running at
the same time
36. 36
q Report data readiness to Watermark Service, manage dependency
between loosely coupled jobs
q Ultimate goal is get the latest result fast
q a delayed batch might consume the unprocessed input blocks span
over multiple cycles.
q Output for fix intervals is not guaranteed
q not all inputs are mandatory, iteration get kicked off even when
optional input sources are not update for that particular cycle
37. 37
• Easy to setup
• Pluggable, just a simple switch from hdfs://xxxx to alluxio://xxxx
• Together with Spark, either form a separated satellite cluster or on label machines in
our big clusters
• Within our Data Centers, it is easier to allocate computing resources but SSD
machines are scare
• Spark and Alluxio on K8S: Over 1k machines, we need shuffle those machines to
run Streaming, Spark ETL,Presto Ad Hoc Query or ML at different days or different
time of a day
• Very stable in production
• Over 2 and a half years without any major issue. A big thank to Alluxio Engineers!
38. 38
• Async persistent to remote HDFS
• Avoid duplicated write in user code/SQL,
• Put hadoop /tmp/ directory on Alluxio over SSD, reduce
NN rpc and load on DN
• Cache hot/warm data for Presto, Heavy traffic and ad hoc
query is very sensitive to HDFS stability