This document provides an overview of Alluxio and JD's contributions to the project. It discusses how Alluxio acts as a virtual distributed storage system that unifies data access at memory speed. It also describes how JD has optimized Alluxio for use with Presto, contributed over 50 pull requests, and hopes to further explore use cases like high availability and a global namespace.
3. Contents
A short Introduction
Introduce how to build when you modify your Alluxio or hadoop
Cache the job container log
Using Alluxio accelerate JobHistory
10x performance improvement
some of the features contributed by JD
JD Contribution
Expectation of Alluxio & Future plan
Alluxio Future
4.
5. • It is the world’s first virtual distributed storage system.
• Alluxio unifies data at memory-speed.
• Virtual Data Lake
What is Alluxio
6. • Application interface
• Apache Spark Presto Tensorflow
• Apache Hbase
• Apache Hive or Apache Flink
• Storage interface
• Amazon S3 Google Cloud Storage OpenStack Swift
• GlusterFS HDFS(Various version)
• IBM Cleversafe EMC ECS
• Ceph NFS Alibaba OSS
Alluxio is a bridge
7. • Powered by alluxio
https://www.alluxio.io/powered-by-alluxio/
Today, Alluxio is deployed in production by hundreds of
organizations with the largest deployment exceeding 1,500 nodes.
8. Alluxio is one of the fastest growing open source projects that has
attracted more than 1000 contributors from over 300 institutions
including Alibaba, Alluxio, Baidu, JD.COM,CMU, Google, IBM, Intel, N
JU, Red Hat, Tencent, UC Berkeley, and Yahoo.
• Active Open Source Comunity
19. • Alluxio led to 10x performance
improvement
• 100+ nodes
• More than 2.5 year.
•
When we use Alluxio for JDPresto, we make
some changes and bring some good features
• Pluggable
• Fault-tolerant
• Locality
Alluxio can be online or updated at any time
When Alluxio unable to access JDPresto
can access HDFS directly.
Reduce the remote read
Presto on Alluxio
27. Watermark Evict Strategy
Start
apply for space
check space
load file from hdfs
release
space
space
enough
End
no space
• Sync Evit Strategy • Async Evit Strategy
Client
apply for space
High
watermark
load file from hdfs
Start
(async thread)
End
release space
N
Y
29. Alluxio Cache Consistency(2)
Start
is file
traverse the path
End
exist in UFS
file size
are same
modify time
are same
clean metadata
N
N
Y
Y
Y
Y
Keep Alluxio & HDFS Consistency
To ensure that dirty data is not read. There are three
ways to trigger file consistency check.
• RPC API
• RESTful API
• Alluxio Master startup
Client request metadata by getFileId, getFileInfo, listStatus, etc
Alluxio master will check file cache consistency
calling reloadMetaData to trigger Alluxio to
reload all metadata
check file cache consistency while master start up
35. - HA, stability, High Performance, Confidence
- Global Namespace
- Server-Side API Translation
- Monitorable & Measurable
- Cutability (fs metamountTabledistributed
cache)
Core expectations for Alluxio
36. Alluxio Exploration
• Exploring more application scenarios
• Porting HDFS Authentication to Alluxio
• HDFS RBF or Alluxio
Stores MapReduce/Spark shuffle data, to reduce disk storage pressure and
speed up access to shuffle data
We are going to port custom permissions on existing HDFS to Alluxio.
We have tried to use HDFS router-based federation, but its performance
does not meet our online requirements. We find that Alluxio also has
forwarding capabilities and hopes that Alluxio will perform better.