Alluxio Austin Meetup
Aug 15, 2019
Speakers: Tim Kelly & Thai Bui, Bazaarvoice
At Bazaarvoice, a software-as-a-service digital marketing company, the data engineering team is tasked to handle data at massive Internet-scale to serve over 1,900 of the biggest internet retailers and brands.
We built our data pipelines all in the cloud using Apache Spark and Hive on AWS EC2 accessing data in S3. AWS enables us to scale “out” the infrastructure capacity effortlessly to keep up with the Internet-scale data and web traffic, but scaling out also exposes certain limitations like the ability to further scale “up”. While this cloud native stack is scalable and elastic we experience performance limitations, because data access is limited by the network bandwidth, and this is exacerbated for workloads that involve repeated queries.
To address the data access challenges, we leverage Alluxio, an open source data orchestration system for analytics in the cloud. Alluxio serves as a transparent caching layer for hot and warm data, such that Hive and Spark jobs are able to access all data transparently in S3. We have seen 10x performance acceleration of Spark and Hive jobs on S3 with Alluxio.
2. Inaugural Cloud, Data & Orchestration Meetup!
● Welcome!
● First Meetup
● Looking for future presenters in Data Engineering/Ops
Community
● Let us know on the Meetup group or talk to Bin, Thai, & Tim
3. Your Hosts:
● Thai Bui - Senior Staff Big Data Engineer, Bazaarvoice
● Bin Fan - VP, Founding Member, Alluxio
● Tim Kelly - Engineering Manager, Bazaarvoice
● Amelia Wong - Co-Founder, Alluxio
17. AWS S3 : The Good
An object storage service provided by Amazon
● Really cheap
● Highly available
● Fully-managed service
● Scales really well
● Integrates with virtually all tools
18. AWS S3 : The Bad
When you have 100s of TB of data and millions of files
● Just object listing is slow
● Download speed is limited by network bandwidth
● No concept of cache
● No concept of data locality
19. AWS S3 : The Need For Speed
● Add tiered-storage to S3
○ Hot, warm, cold storage (fastest, fast, and not so fast)
○ Metadata cache
○ Data cache
● Keep data local
○ In the same machine, not via the network cable
● Compatible with existing services
○ Hadoop, Spark, Hive, Presto, etc.
● Adaptive & highly configurable
○ Symlink for S3
20. ZFS
Hive Spark
Alluxio S3
Hot & Warm
Cold
Overview
Hive
● Alluxio
○ Compatibility-layer
○ Tiered storage layer
● ZFS
○ OS-level file system
○ Volume manager
○ Acceleration layer
● Both are open-source
metastore metastore metastore
21. Alluxio : The tiered-storage layer
● Support for native filesystem and Hadoop filesystem
● Distributed but can be installed in every node
○ Provides data locality
● Mount S3, HDFS, etc. to Alluxio
○ Think symlink. No data movement.
● Use RAM, NVMe, SSD, HDD to define hot, warm, cold data tier
● LRU, LFU policies for caching data at different layers
● Not enough space -> evict or move least used files to the next tier
22. ZFS : The acceleration layer
● Both a filesytem & a volume manager
○ Works with RAM to accelerate read/write
○ Auto promote/demote blocks from RAM to other storage
○ Used with local NVMe SSD if data is not in RAM
○ Mirror write to 2 SSDs -> 2x read speed
● Works at the kernel-space
● Extremely reliable
○ Automatic block checksum & repair
27. Example query plan
SELECT ..
FROM ..
WHERE year = 2019
AND month = 1
GROUP BY .. ORDER BY ..
LIMIT 100
28. Without tiered-storage
● 50s for split calculations
○ Listing files on S3
○ Sub-dividing the tasks amongst workers
● 12s for scanning data on S3
● 70s to complete the query
29. With tiered-storage
● 1.7s for split calculations
○ 30x improvement
● 3s for scanning data on tiered-storage
○ 3x improvement
● 6s to complete the query
○ 10x improvement overall
30. Result
● 5-10X read improvement in Hive
○ Worker can short-circuit and read directly from ZFS instead of S3
○ Move compute to the data
● Should give the same result in Apache Spark
● Good for iterating over the same data set multiple times
○ Machine learning
○ Exploratory analysis
● Give us control over S3
○ More recent data should be faster to access