Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
1. 2019/08/26 Office Hour
Website | www.alluxio.io
Q&A | https://alluxio.io/slack
Building a Cloud Native Stack with EMR Spark,
Alluxio, and S3
Bin Fan, Nakkul Sreenivas
2. AWS S3: The Default Data Lake on AWS
▪ Why we love it
▪ Cheap,
▪ High available
▪ Fully managed
▪ Really large scale
▪ Still, limitations & difference:
▪ Slow object listing
▪ Expensive rename
▪ Tput throttling
▪ Variable performance
▪ No data locality on computation
▪ No user-managed cache
3. Alluxio is Open-Source Data Orchestration
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver GCS Driver S3 Driver Azure Driver
4. Why put Alluxio in AWS
▪ Provide better or consistent performance
▪ Add a data caching tier to S3: cache Hot data/Metadata
▪ Familiar FS semantics: listing, rename
▪ Keep data local to applications like Spark
▪ Compatible with other existing services like Hadoop, Hive, Presto
▪ Mount multiple data sources into the namespace
▪ Files/Objects in different storage GCS, Azure, HDFS
▪ Objects in other S3 buckets
5. The Alluxio Story
Originated as Tachyon project, at UC Berkley AMPLab by
Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CTO2013
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the Cloud
for data driven apps such as Big Data Analytics, ML and AI.
20192018
2019
Top 10 Big Data
2019
Top 10 Cloud Software
6. Fast-growing Open Source Community
4000+ Github Stars1000+ Contributors
Join the community on Slack
(FAQ for this office hour)
alluxio.io/slack
Apache 2.0 Licensed
Contribute to source code
github.com/alluxio/alluxio
7. Data Locality via Intelligent Multi-tiering
▪ Local performance from remote data using multi-tier storage
RAM SSD HDD
Hot Warm Cold
Read & Write
Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL
8/20/19 7
8. Spark
Presto
Bash
Tensorflow
Java
~$ cat /mnt/alluxio/myInput
Data Accessibility via popular APIs
> rdd = sc.textFile(“alluxio://master:19998/myInput”)
> CREATE SCHEMA hive.web
> WITH (location = 'alluxio://master:19998/my-table/')
~$ python classify_image.py --model_dir /mnt/fuse/imagenet/
FileSystem fs = FileSystem.Factory.get();
FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
9. Data Abstraction via Unified Namespace
Enables effective data management across different Under Store
$ ./bin/alluxio fs mount /Data s3://bucket/directory
10. Typical Alluxio Use Cases
• Cloud Analytics Caching
Get in-memory data access for Spark, Presto,
or any analytics framework on Cloud storage
• Hybrid Cloud Analytics
Get in-memory data access for Spark, Presto,
or any analytics framework on Cloud storage
11. Spark
Alluxio
AWS S3
Co-locate Alluxio Workers with Spark for
optimal I/O performance
Deployment Approaches
Same instance
Spark
Alluxio
AWS S3
Deploy Alluxio as standalone cluster
between Spark and Storage
Same data
center / region
Presto
12. Alluxio-EMR Prerequisites and Design Considerations
▪ IAM Account with the default EMR Roles
▪ S3 Bucket to host Bootstrap script and to act as a UFS
▪ Key Pair for EC2
▪ AWS CLI
▪ Leverage AWS Glue/RDS to persist Hive Metastore State
▪ Bootstrap Scripts
12
13. Alluxio EMR Service Integration: Bootstrap Actions
▪ EMR provides hooks into the main configuration files for Hadoop
Services:
▪ hive-site.xml, core-site.xml, hadoop-env.sh, hive.properties
▪ Bootstrap Actions
▪ Up to 10 shell scripts specified by the user
▪ Runs before Hadoop service installation
▪ Offering for shutdown actions as well
16. Read data in Alluxio, on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Read of Data
Application
Alluxio
Client
Alluxio
Master
17. Read data not in Alluxio
RAM / SSD / HDD
Network / Disk Speed Read of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
WorkerUnder Store
18. Write data only to Alluxio on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Write of Data
Application
Alluxio
Client
Alluxio
Master
19. Write data to Alluxio and Under Store synchronously
RAM / SSD / HDD
Network / Disk Speed Write of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
Under Store
20. Alluxio 2.0 & Coming in 2.1 Release
▪ Alluxio 2.0: Released in July
▪ Metadata scales to 1 bln file or more (based on rocksdb)
▪ Self-managed Metadata service based on Quorum
▪ Async writes, distributed load
▪ Many more: https://www.alluxio.io/download/releases/alluxio-2-0-0-release/
▪ Alluxio 2.1: Scheduled in Sept
▪ A Presto-Alluxio Connector with Iceberg Integration
▪ Use Alluxio as a caching layer without modifying HMS
21. Next steps - Try it out!
• Getting Started
• Spark Performance Tuning Tips
• Accelerate Spark and Hive Jobs on AWS S3: Use case from Bazaarvoic
• Spark + Alluxio: Tencent Use Case
Questions or Suggestions? Engage with us at alluxio.io/slack!