Accelerating workloads and bursting data with Google Dataproc & Alluxio

Accelerating workloads and
bursting data with Google
Dataproc & Alluxio
Dipti Borkar | VP, Product | Alluxio
Roderick Yao | Strategic Cloud Engineer | Google

▪ What’s Google Dataproc?
▪ What’s Alluxio?
▪ Alluxio in Dataproc
▪ Demo

Enterprises are telling us
they need:
To respond to different business data needs with
different urgency and emphasis
● Create bespoke hadoop clusters customized for
any workload
● Use them for a minute or a year
A faster, more scalable way to get insights from data
● Get up and running without waiting for hardware
or software to be installed or configured
To get their people out of owning and monitoring
technology and back to innovating
● Design workflows that create clusters, complete
jobs end-to-end, and then delete themselves
To spend less money
● Create clusters in seconds
● Pay only for when the cluster is running
● Take advantage of preemptible VM instances

Enterprise Hadoop cluster woes
You know that
managing a Hadoop
cluster can be frustrating
and time consuming
It’s a hassle to renew the license
on your on-premises system
It’s hard to scale compute or storage on-
demand
Maintaining the operations of your
Hadoop cluster takes too much time
Your system can’t keep up with
forecasted usage and data growth
Your legacy system busts
your budget

What is Cloud Dataproc?
Rapid cluster creation
Familiar open source tools
Google Cloud Platform’s fully-
managed Apache Spark and
Apache Hadoop service
Ephemeral clusters on-demand
Customizable machines
Tightly Integrated
with other Google Cloud
Platform services

Fast
Things take seconds
to minutes, not
hours or weeks
Easy
Be an expert with
your data, not your
data infrastructure
Cost-effective
Pay for exactly what
you use to process
your data, not more
Google Cloud Dataproc vision

Disaggregation of storage and compute
Analysis
Cloud Datalab
Development & Test
Data sinksProduction
Cloud Dataproc
External applications
Storage
Cloud Storage
Application Logs
Storage
BigQuery
Development
Cloud Dataproc
Test
Cloud Dataproc
Data sources
Storage
Cloud Bigtable
Storage
Cloud Storage
Storage
BigQuery
Storage
Cloud Bigtable
Data scienceCluster monitoring
Monitor
Stackdriver
Logs
Logging

Ephemeral and long-lived clusters
Semi-long-lived clusters - group and select by labelClusters per job
Cluster
Cloud Dataproc
Cluster
Cloud Dataproc
Cluster
Cloud Dataproc
Cloud Storage
Edge Nodes
Compute Engine
Client Client Client
ClientsClients
Development (Preview)
Production (1.2)
Prod 1
Cloud Dataproc
Dev cluster
Cloud Dataproc
Prod 2
Cloud Dataproc

Customers using
Dataproc to Scale

BigQuery
Stackdriver
Compute
Cloud Storage PSO & SupportBigTable
Dataflow
Dataproc
Pub/Sub
Challenge
To build machine learning models that focused on fraud detection and inventory management
How Google Helped
Partnered with retailer to think about both the digital experience as well as the in-store customer
experience to especially help them manage major retail events like Black Friday.
What they are running:
67 avg. clusters per day 513 nodes per cluster
Products & Services:
NDA
Traditional Brick and Mortar Retailer

Combining the best of
open source and cloud.
Cloud Dataproc

Introduction to Alluxio
Open source data orchestration

The Alluxio Story
Originated as Tachyon project, at the UC Berkeley’s AMP
Lab by then Ph.D. student & now Alluxio CTO, Haoyuan
(H.Y.) Li.
2014
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data for the Cloud for data driven
apps such as Big Data Analytics, ML and AI.
Focus: Accelerating modern app frameworks running on
HDFS/S3/ GCS -based data lakes or warehouses

Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Lines of Business

Data Orchestration for the Cloud

Alluxio
MasterZookeeper
/ RAFT
Standby
Master
WA
N
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
…
…
Applicatio
n
Applicatio
n
Under Store 1
Under Store
2

Compute
Storage
2–5 Mins
2–5 Mins
Elastic
✓
Elastic
✓
Enterprise Cloud Compute & Storage is Great…
but Data got left behind
2–4 Weeks
Request
Data
Request Review Find
Dataset
Code
Script/Job
Run
ETL jobs
Grant
Permissions
Not Elastic
!
Dataset

Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
Alluxio enables compute!
Solution: Consistent High Performance
• Performance increases range from 1.5X
to 10X
• AWS EMR & Google Dataproc
integrations
• Fewer copies of data means lower costs
Problem: Object Stores have
inconsistent performance for analytics
and AI workloads
▪ SLAs are hard to achieve
▪ Metadata operations are expensive
▪ Copied data storage costs add up making
the solution expensive
Accelerating Analytics in the cloud

PRESTO
OBJECT STORE
Public Cloud
Project:
• Utilize Presto for interactive queries
on cloud object store compute
Problem:
• Low performance of queries too slow
to be usable
• Inconsistent performance of queries
Walmart | High Performance Cloud analytics
Alluxio solution:
• Alluxio provides intelligent distributed
caching layer for object storage
Result:
• High performance queries
• Consistent performance
• Interactive query performance for
analysts
PRESTO
OBJECT STORE
Public Cloud
ALLUXIO

20
Presto & Alluxio on
Works well together…
Small range query response time
Lower is better
Large scan query response time
Lower is better
Concurrency
Higher is better
Prest
o
Presto +
Alluxio
• Query performance bottlenecks
• Un-predictable network IO
• Query pattern - Datasets modelled in star
schema could benefit by dimension table
caching
• Presto + Alluxio
• Avoids unpredictable network
• Consistent query latency
• Higher throughput and better concurrency

Google Dataproc
Presto Hive Presto Hive
Google
Dataproc
Cluster
Google Cloud Store Google Cloud Store

Using Alluxio with Google Dataproc
Presto Hive
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
Compute-driven
Continuous sync
Compute-driven
Continuous sync
Google
Dataproc
Cluster
Google Cloud Store Google Cloud Store
Single command initialization action brings up Alluxio in dataproc
Alluxio Initialization Action - https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/alluxio

Bursting workloads to the cloud with remote data
Typical Restrictions
▪ Data cannot be persisted in a public cloud
▪ Additional I/O capacity cannot be added to existing Hadoop infrastructure
▪ On-prem level security needs to be maintained
▪ Network bandwidth utilization needs to be minimal
Options
Lift and Shift
Data copy by
workload
“Zero-copy” Bursting

Problem: HDFS cluster is compute-
bound & complex to maintain
AWS Public Cloud IaaS
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Barrier 1: Prohibitive network latency
and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute Capacity
• Orchestrates compute access to on-prem data
• Working set of data, not FULL set of data
• Local performance
• Scales elastically
• On-Prem Cluster Offload (both Compute & I/O)
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
“Zero-copy” bursting to scale to the cloud

RAM
Framework
Read file /trades/us
Trades Directory Customers Directory
Data requests
”Zero-copy” bursting under the hood
Read file /trades/us again Read file /trades/top
Read file /trades/top
Variable latency
with throttling
Read file /trades/us again

RAM
SSD
Disk
Framework
New Trades
Policy Defined Move data > 90 days old to
Feature Highlight – Policy-driven Data Management
GCS
Policy interval : Every day
Policy applied everyday

Demo: initialization action installs Alluxio in Dataproc
Presto Hive
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
Compute-driven
Continuous sync
Compute-driven
Continuous sync
Google
Dataproc
Cluster
#1 - Access data in Google Cloud Store

Demo: initialization action installs Alluxio in Dataproc
Presto Hive
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
Compute-driven
Continuous sync
Compute-driven
Continuous sync
Google
Dataproc
Cluster
#2 - Access data from remote Hadoop cluster (simulated as Dataproc)

Get Started with Alluxio on Dataproc
Single command created Dataproc cluster with Alluxio installed
$ gcloud dataproc clusters create roderickyao-alluxio --initialization-actions
gs://alluxio-public/enterprise-dataproc/2.1.0-1.0/alluxio-dataproc.sh
--metadata alluxio_root_ufs_uri=gs://ryao-test/alluxio-test/,
alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=<KEYID>;
alluxio.master.mount.table.root.option.fs.gcs.secretAccessKey=<SECRET>",
alluxio_license_base64=$(cat alluxio-enterprise-license.json | base64 | tr -d
"n"),alluxio_download_path=gs://ryao-test/alluxio-enterprise-2.1.0-1.0.tar.gz
Tutorial: Getting started with Dataproc and Alluxio
https://www.alluxio.io/products/google-cloud/gcp-dataproc-tutorial/

Resources
Alluxio Initialization Action
- https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/alluxio
Alluxio with Google Cloud Storage documentation
- https://docs.alluxio.io/ee/user/stable/en/ufs/GCS.html

Accelerating workloads and bursting data with Google Dataproc & Alluxio

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Accelerating workloads and bursting data with Google Dataproc & Alluxio

Similaire à Accelerating workloads and bursting data with Google Dataproc & Alluxio (20)

Plus de Alluxio, Inc.

Plus de Alluxio, Inc. (20)

Dernier

Dernier (20)

Accelerating workloads and bursting data with Google Dataproc & Alluxio