Alluxio Tech Talk
Dec 2019
Speaker:
Madan Kumar, Alluxio
If you’re a MapR user, you might have concerns with your existing data stack. Whether it’s the complexity of Hadoop, financial instability and no future MapR product roadmap, or no flexibility when it comes to co-locating storage and compute, MapR may no longer be working for you.
Alluxio can help you migrate to a modern, disaggregated data stack using any object store with the similar performance of Hadoop plus significant cost savings.
Join us for this tech talk where we’ll discuss how to separate your compute and storage on-prem and architect a new data stack that makes your object store the core. We’ll show you how to offload your MapR/HDFS compute to any object store and how to run all of your existing jobs as-is on Alluxio + object store.
2. MapR Future and Offload Solutions
Uncertainty around future state of MapR means looking at
alternative solutions to being able to offload HDFS data
Possible Offload Solutions:
Move to a different Hadoop distribution, Cloudera/Hortonworks
Move to the Cloud, EMR(AWS)/ Dataproc(GCP)/HDInsight(Azure)
Move to an Object Store
3. Why Object Store
Object storage solves problem of scalability at a fraction of the
cost of scale-out file systems(HDFS)
Economies of scale mean that it is also cheaper long term
than leveraging public cloud as data sizes grow
Object store providers give the flexibility of being able to
deploy both on-premise and in the cloud
4. Migration Solution Overview w/ Alluxio
Migration to Cloud Object
Store
Cloud Object Store
Alluxio
HDFS
Presto/Spark
OnPrem Object
Store
Alluxio
HDFS
Presto/Spark
Migration to On-Prem
Object store
5. Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Independent scaling of compute & storage
6. Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Alluxio – Key innovations
8. Environment Setup with Alluxio
S3
Alluxio
HDFS
Alluxio with 2 mount points:
- First being an AWS S3 bucket
- Second being an on premise HDFS cluster
9. Data Movement via Alluxio (HDFS to Alluxio)
Map HDFSTables to Alluxio File Location
Assuming that the table already exists in HDFS and HDFS is
mounted as a root understore then can move an internal table
from to Alluxio as follows(shown via hive):
hive> alter table u_user set location
"alluxio://master_hostname:port/hdfs/hive/warehouse/u_user";
Once altered on first time reads the data will be served from
HDFS, however subsequent reads will pull from data cached in
Alluxio.
10. Data Movement via Alluxio (Alluxio to S3)
Create table against Alluxio file location backed by S3
hive> CREATE EXTERNALTABLE u_user (
userid INT, age INT, gender CHAR(1), occupation STRING,
zipcode STRING) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '|' LOCATION
'alluxio://master_hostname:port/s3/u_user’;
11. Data Movement via Alluxio (HDFS to S3)
Leverage the Alluxio cp command to copy a file or directory in the
Alluxio file system or between the local file system and Alluxio file
system.
cp command can be used to copy files between under storage systems.
$ ./bin/alluxio fs cp /hdfs/hive/warehouse/u_user /s3/u_user
The above will copy files belonging to the u_user table on HDFS to the
table we have created against S3
12. Data Movement Options via Alluxio
Through Alluxio you are able to seamlessly move data off of
HDFS and to the object store
Alluxio allows end users continued access to the data without
waiting for the data migration to complete
Alluxio can also set policies to do data movement on demand
(2.0 EE feature)
15. Data Elasticity with a Global Namespace
hdfs://host:port/directory/
Reports Sales
16. Interacting with data in Alluxio – flexible app patterns
Reading Data
• From under store
• From a co-located Alluxio
node
• From a different Alluxio
node
Writing Data
• Write only to Alluxio
• Write only to Under Store
• Write synchronously to Alluxio and
Under Store
• Write to Alluxio and
asynchronously write to Under
Store
• Write to Alluxio and replicate to N
other workers
• Write to Alluxio and async write to
multiple Under stores
Application have great flexibility to read / write data with many options
18. Incredible Open Source Momentum with growing community
1000+ contributors &
growing
4000+ Git Stars
Apache 2.0 Licensed
Hundreds of thousands
of downloads
Join the conversation on Slack
alluxio.io/slack