Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio

Alluxio 2.0 & Near Real-time
Analytics with Spark & Alluxio
@VipShop
Alluxio Bay Area Meetup
@alluxio alluxio.org/slack info@alluxio.com

Special thanks to AICamp and
ODSC for co-hosting!

Alluxio 2.0.0-preview
03/14 Alluxio Meetup

● Release Manager for Alluxio 2.0.0
● Contributor since Tachyon 0.4 (2012)
● Founding Engineer @ Alluxio
About Me
Calvin Jia

Alluxio Overview
• Open source, distributed storage system
• Commonly used for data analytics such as OLAP on Hadoop
• Deployed at Huya, Two Sigma, Tencent, and many others
• Largest deployments of over 1000 nodes
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver

Agenda
Alluxio 2.0 Motivation1
Architectural Innovations2
Release Roadmap3

Why 2.0
• Alluxio 1.x target use cases are largely addressed
• Three major types of feedback from users
• Want to support POSIX-based workloads, especially ML
• Want better options for data management
• Want to scale to larger clusters

Use Cases
Alluxio 1.x
• Burst compute into cloud with data
on-prem
• Enable object stores for data
analytics platforms
• Accelerate OLAP on Hadoop
Example
• As a data scientist, I want to be able
to spin up my own elastic compute
cluster that can easily and efficiently
access my data stores
New in Alluxio 2.x
• Enable ML/DL frameworks on object
stores
• Data lifecycle management and data
migration
Examples
• As a data scientist, I want to run my
existing simulations on larger
datasets stored in S3.
• As a data infrastructure engineer, I
want to automatically tier data
between Alluxio and the under store.

ML/DL Workloads
• Alluxio 1.x focuses primarily on Hadoop based workloads, ie. OLAP
on Hadoop
• Alluxio 2.x will continue to excel for these workloads
• New emphasis on ML frameworks such as Tensorflow
• Primarily accesses the same data set which Alluxio already is serving
• Challenges include new API and file characteristics, such as file access
pattern and file sizes

Data Management
• Finer grained control over Alluxio replication
• Automated and scalable async persistence
• Distributed data loading
• Mechanism for cross-mount data operations

Scaling
• Namespace scaling - scale to 1 billion files
• Cluster scaling - scale to 3000 worker nodes
• Client scaling - scale to 30,000 concurrent clients

Architectural Innovations in 2.0
• Off heap metadata storage (namespace scaling)
• gRPC transport layer (cluster and client scaling)
• Improved POSIX API (new workloads)
• Job Service (enable data management)
• Embedded Journal and Internal Leader Election (better integration
with object stores, fewer external dependencies)

Off Heap Metadata Storage
• Uses an embedded RocksDB to store inode tree
• Internal cache for frequently used inodes
• Performance is comparable to previous on-heap option when
working set can fit in cache

gRPC Transport Layer
• Switch from Thrift (metadata) + Netty (data) transport to a
consolidated gRPC based transport
• Connection multiplexing to reduce the number of connections from
# of application threads to # of applications
• Threading model enables the master to serve concurrent requests
without being limited by internal threadpool size or open file
descriptors on the master

Improved POSIX API
• Alluxio FUSE based POSIX API
• Limitations such as no random write, file cannot be read until
complete
• Validated against Tensorflow’s image recognition and
recommendation workloads
• Taking suggestions for other POSIX-based workloads!

Job Service
• New process which serves as a lightweight computation framework
for Alluxio specific tasks
• Enables replication factor control without user input
• Enables faster loading/persisting of data in a distributed manner
• Allows users to do cross-mount operations
• Async through is handled automatically

Embedded Journal and Internal Leader Election
• New journaling service reliant only on Alluxio master processes
• No longer need an external distributed storage to store the journal
• Greatly benefits environments without a distributed file system
• Uses Raft as the consensus algorithm
• Consensus is used for journal integrity
• Consensus can also be used for leader election in high availability mode

Alluxio 2.0.0 Release
• Alluxio 2.0.0-preview is available now
• Any and all feedback is appreciated!
• File bugs and feature requests on our Github issues
• Alluxio 2.0.0 will be released in ~3 months

Questions?
Alluxio Website - www.alluxio.org
Alluxio Community Slack Channel - www.alluxio.org/slack
Alluxio Office Hours & Webinars - https://www.alluxio.org/resources/events

Wanchun Wang
Chief Data Architect
Near Real-time ETL
platform using Spark &
Alluxio

About VipShop
• A leading online discount retailer for brands in China
• $12.3 Billion net revenue in 2018
• 20M+ visitors/day

q - 5 -5 5
q 5 2 2 & 2 : 5
q 5 5 : 2 : 5
q 5 5 5 2 5 5 - 5- 22: 5
q : : 5
q 22 2 -5

26
Overview - Big Data systems
q Separate Streaming and Batch platforms, single data pre-
processing pipeline, no longer a pure Lambda architecture
q Typically streaming data get sinked into hive tables every 5
minutes
q More ETL jobs are moving toward Near Real Time
Lo
g
Kafka
Data
Cleansin
g
Kafka
Augmen
-tation
Kafka
Hive
Delta
Hive
Daily
Streaming(Storm/Flink/Spark)
Batch ETL
(Hive/Spark)

27
The process of identifying a set of user actions (“events”) across screens and touch
points that contribute in some manner to a product sale, and then assigning value to
each of these events.
front
today’s
new
man’s
special
Product A
detail
man’s
special
Product B
detail
add cartorder

Near Real-time sales attribution
is a very complex process
• Recompute full day’s data at each iteration:
• ~ 30 minutes, worst case 2-3 hours
• Many data sources involved:
• page view, add cart, order_with_discount, order_cookie_map, sub_order, prepay_order_goods etc
• Several large data sources each contain billions of records and take up 300GB ~
800GB space on Disk
• Sales Path assignment is very CPU intensive computation
• Written by business analysts
• Complex SQL scripts with UDF functions
Business expectation: updated result every 5 - 15 minutes

29
+ + ++
+
Running performance sensitive jobs on current batch platform
not an option
• Around 200K batch jobs executed daily in Hadoop & Spark clusters
• Hdfs 1400+ nodes
• SSD hdfs 50+ nodes
• Spark Clusters( 300+ nodes)
• Cluster usage is above 80% at normal days, resources are even more saturated
during monthly promotion period
• Many issues contribute to the Inconsistent data access time such as NN RPC too
high, slow DataNode response etc
• Scheduling overhead when running M/R jobs

30
1. Adding more compute power
• Too expensive - Not a real option
2. Improve ETL job to process updates incrementally
3. Create a new, relatively isolated environment
• consistent computing resource allocation
• intermediate data caching
• faster read/write

• Recompute the click paths for the active users in current window
• Merge active user paths with previous full path result
• Less data in computation but one more read on history data
2.Improve ETL Job to process
updates incrementally

33
q A Satellite Spark + Alluxio 1.8.1 cluster with 27 nodes (48 cores,
256G Memory)
q Alluxio colocated with Spark
qVery consistent read/write I/O time over iterations
q Alluxio Mem + HDD
qDisable multiple copies to save space
qLeave enough memory to OS, improve stability

34
A. Remote HDFS cluster: 1-2 times slow than Alluxio, the biggest problem is there are lots of
spikes
B. Use local HDFS, 30%-100% slower than Alluxio ( Mem + HDD)
C. On dedicated SSD cluster
• on par with Alluxio in regular days, but overall read/write latency doubled during busy days
D. On dedicated Alluxio cluster, still not as good as co-located setup ( more test to be done)
E. Spark Cache
• Our daily views, clicks and path result are too big to fit into JVM
• Slow to create and we have lots of “only used twice” data
• Multiple downstream spark apps need to share the data

35
L
q Move the downstream processes closer to the data, avoid duplicating large amount of
data from Alluxio to remote HDFS
q Manage NRT jobs
q A single big Spark Streaming job? too many inputs and outputs at different stages
q Split into multiple jobs? how to coordinate multiple stream jobs
q NRT executed in much higher frequency, very sensitive to system hiccups
q Current batch job scheduling
q Process dependency, executed for every fixed interval
q When there is a severe delay, multiple batch instances for different slot running at
the same time

36
q Report data readiness to Watermark Service, manage dependency
between loosely coupled jobs
q Ultimate goal is get the latest result fast
q a delayed batch might consume the unprocessed input blocks span
over multiple cycles.
q Output for fix intervals is not guaranteed
q not all inputs are mandatory, iteration get kicked off even when
optional input sources are not update for that particular cycle

37
• Easy to setup
• Pluggable, just a simple switch from hdfs://xxxx to alluxio://xxxx
• Together with Spark, either form a separated satellite cluster or on label machines in
our big clusters
• Within our Data Centers, it is easier to allocate computing resources but SSD
machines are scare
• Spark and Alluxio on K8S: Over 1k machines, we need shuffle those machines to
run Streaming, Spark ETL,Presto Ad Hoc Query or ML at different days or different
time of a day
• Very stable in production
• Over 2 and a half years without any major issue. A big thank to Alluxio Engineers!

38
• Async persistent to remote HDFS
• Avoid duplicated write in user code/SQL,
• Put hadoop /tmp/ directory on Alluxio over SSD, reduce
NN rpc and load on DN
• Cache hot/warm data for Presto, Heavy traffic and ad hoc
query is very sensitive to HDFS stability

Questions?
WeChat

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio

Similaire à Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio (20)

Plus de Alluxio, Inc.

Plus de Alluxio, Inc. (20)

Dernier

Dernier (20)

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio