Alluxio Community Office Hour
October 20, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speaker(s):
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
3. Enterprises have organically created a legacy of data silos through short term
focused projects, mergers & acquisitions!
Data Lakes and Silos Abound
▪ Data lakes and critical data are often in a silo and challenging to access
▪ Consolidation of data lakes and silos are expensive and slow to complete
▪ Compute is everywhere
Teradata POSIX
DB
Intern
apps
Public
Clouds
S3 Object HDFS 1
HDFS 2
4. 4 Big Trends Driving the Need for a New
Architecture
Separation of
Compute &
Storage
Hybrid – Multi
cloud
environments
Self-service
data across the
enterprise
Rise
of the object
store
5. ▪ Data volume, velocity and variety are avalanching - data doubles every two years*
▪ The business knows that data analytics/ML models allow them to compete
effectively*
▪ The Hadoop investment is being replaced by object (on prem and cloud)
▪ The enterprise is a multi cloud world and will remain so for some time
▪ Technical leadership wants the agility to run applications anywhere to sustain
operations offering users a transparent self-service experience
▪ Technical organizations struggle to keep up with data ingest and business demands
▪ Data is still not fully optimized yet there are many copies costing $$$$
* “The Fourth Industrial Revolution”, by Klaus Schwab
Market Summary
6. Alluxio’s Vision
"Orchestrate data for analytics and machine learning to enable
companies to grow and be agile regardless of where their data
and compute are located."
Quick start cloud adoption that optimizes cost that yields 2X –
5X analytics acceleration for –
● Fraud protection
● Research for treatments for diseases like COVID-19
● Uptime for all industrial and digital technologies we depend on
7. What is Data Orchestration?
A platform that brings your data closer to compute across
clusters, regions, clouds, and countries.
8. Companies Using Alluxio
Consumer Travel & TransportationTelco & Media
Learn more
TechnologyFinancial Services Retail & Entertainment Data & Analytics
Services
8
9. Companies use Alluxio to …
• Gain faster results that matter to the business – advanced caching
technology
• Dramatically lower OpEx by eliminating data management and cloud
egress and compute costs – unified namespace and API translations
• Drop into existing on prem and clouds with zero programming
10. Unified
Namespace
Bring all files and
objects into a
single interface
Interact with data
using any API Accelerate & tier
data transparently
API
Translation
Intelligent
Caching
Multi-tiering
Alluxio - Key Innovations
11. Data Accessibility (via popular APIs and API Translation)
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIFUSE Interface
HDFS Driver Swift DriverS3 Driver NFS Driver
13. Approaches to Hybrid Cloud
▪ Simple tools available like distCP
▪ Works for workloads with easily
identifiable datasets
Issues
▪ Datasets for many workloads
cannot always be identified easily
▪ Significantly more data transfer than
workload requirements
▪ Additional copies are very hard to
sync back with master data
Performance can be dramatically
impacted due to cloud storage
limitations
Lift and Shift
Data copy by
workload
Compute-driven
Data Caching
▪ Migration may seem easier as no
application re-architecture needed
Issues
▪ If workloads are not made cloud-
native and elastic, infrastructure cost
can skyrocket
▪ If on-prem data copy needs to be
maintained, syncing cloud and on-
prem data can be hard
▪ Data pulled into cloud based on
compute requests
▪ Data is cached locally to reduce I/O
on remote clusters and is
automatically synced
Issues
▪ Less helpful for workloads that don’t
read data set more than once
13
14. Problem: HDFS cluster is compute-
bound & complex to maintain
Google Cloud Platform
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Alluxio Data Orchestration and Control Service
Barrier 1: Prohibitive network
latency and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute
Capacity• Offload on-prem cluster (both compute & I/O)
• Manage working set, not FULL set of data
• Local performance
• Automatic synchronization with on-prem changes
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
GCS
Our Solution: “Zero-Copy Burst”
14
16. Alluxio at Walmart
16
Architectural Components
• Alluxio is co-located with Presto
For Data Locality
• Automatic Metadata
Synchronization To create Hive tables
with Alluxio mount points
• Auto-scaling
To maintain a min number of Alluxio
workers
• Pin frequently used data
To avoid cache evictions
17. 2x Performance
For range queries
High Concurrency
With Alluxio
Cost Reduction
With Half the compute costs or 2x
compute capacity for the same
environment
Alluxio at Walmart
Takeaways
17
18. Alluxio at Adobe
Primary DC with large Hadoop Cluster out
of space, ad hoc SQL workloads
exponentially growing as analyst
headcount as reached 1800 ppl
PROBLEM
● 80% less network usage
● More stable infrastructure
● Lower costs
● Results come in faster
● Easier to scale
● Ability handle new analysts with no impact and increase response times
● Self-service for end-users
Leverage compute resources outside of
primary on-prem DC for multiple analytical
frameworks.
SOLUTION
REMOTE DATA RESULTS
18
Cross Data Center Access
19. Alluxio at Electronic Arts (EA)
Single Cloud with AWS
Learn More
Upto 6x Performance
When handling a large
number of small files
Elastic Compute
To Reduce Infrastructure
Costs
Reduce S3 Costs
By eliminating S3 access
operations
21. Data Locality with Intelligent Multi-tiering
Local Performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL
On-premisesPublic Cloud
21
22. Metadata Locality with “Active Sync”
Detect on-prem changes and synchronize metadata
Old File at path
/file1 ->
New File at path
/file1 ->
Alluxio Master
Policies for pinning,
promotion/demotion, TTL
HDFS iNotify Based
Metadata Synchronization
Mutation
On-premisesPublic Cloud
22
23. Policy Driven Data Migration
Migrate Data to Cloud Storage based on Access Policies
hdfs://host:port/directory/
Reports Sales
• Single Alluxio path backed by multiple storage systems
• Example policy: Migrate data older than 7 days from HDFS to S3
23
25. Alluxio Catalog Service
Hive Metastore
Hive Under Database
Functionality
Manages metadata for structured data
Abstracts other database catalogs as
Under Database (UDB)
Benefits
Schema-aware optimizations
Simple deployment
25
Alluxio Catalog Service
26. Transform data to be compute-optimized
independent of the storage format
Coalesce Format Conversion
parquetcsv
26
Transformation Service
27. Attached existing Hive database into Alluxio Catalog
Alluxio Catalog served table metadata for Presto
Transformed store_sales by coalescing and converting CSV to Parquet
Presto Without
Alluxio
20s
Alluxio
Transformations
7s
Alluxio
Transformations With
Caching
3s
27
Example Results
29. How can Alluxio help you?
• Did you learn what Alluxio Data Orchestration is?
• Do you have a use case Alluxio can accelerate?
For follow up questions and to discuss your situation, please contact Peter
at peter@alluxio.com