SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
Patrick Stuedi, Michael Kaufmann, Adrian Schuepbach
IBM Research
Serverless Machine
Learning on Modern
Hardware
#Res6SAIS
Serverless Computing
●
No need to setup/manage a
cluster
●
Automatic, dynamic and fine-
grained scaling
●
Sub-second billing
●
Many frameworks: AWS
Lambda, Google Cloud
Functions, Azure Functions,
Databricks Serverless, etc.
Challenge: Performance
AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise
0
100
200
300
400
500
Increasing flexibility
Increasing performance
Example:
Sorting 100GB
AWS λ Databricks Databricks Spark Spark
Serverless Standard On-premise On-premise++
Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17
Runtime[seconds]
Challenge: Performance
AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise
0
100
200
300
400
500
Increasing flexibility
Increasing performance
Example:
Sorting 100GB
64 lambda
workers
AWS λ Databricks Databricks Spark Spark
Serverless Standard On-premise On-premise++
Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17
Runtime[seconds]
Challenge: Performance
AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise
0
100
200
300
400
500
Increasing flexibility
Increasing performance
Example:
Sorting 100GBserverless cluster
with autoscaling
min machines: 1
max machines: 864 lambda
workers
AWS λ Databricks Databricks Spark Spark
Serverless Standard On-premise On-premise++
Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17
Runtime[seconds]
Challenge: Performance
AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise
0
100
200
300
400
500
Increasing flexibility
Increasing performance
Example:
Sorting 100GBserverless cluster
with autoscaling
min machines: 1
max machines: 864 lambda
workers standard cluster
no autoscaling
8 machines
AWS λ Databricks Databricks Spark Spark
Serverless Standard On-premise On-premise++
Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17
Runtime[seconds]
Challenge: Performance
AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise
0
100
200
300
400
500
Increasing flexibility
Increasing performance
Example:
Sorting 100GBserverless cluster
with autoscaling
min machines: 1
max machines: 864 lambda
workers standard cluster
no autoscaling
8 machines
100Gb/s
Ethernet
AWS λ Databricks Databricks Spark Spark
Serverless Standard On-premise On-premise++
Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17
Runtime[seconds]
Challenge: Performance
AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise
0
100
200
300
400
500
Increasing flexibility
Increasing performance
Example:
Sorting 100GBserverless cluster
with autoscaling
min machines: 1
max machines: 864 lambda
workers standard cluster
no autoscaling
8 machines
100Gb/s
Ethernet
AWS λ Databricks Databricks Spark Spark
Serverless Standard On-premise On-premise++
RDMA,
NVMe flash,
NVMeF
Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17
Runtime[seconds]
put your #assignedhashtag here by setting the footer in view-header/footer●
Scheduler: when to best add/remove resources?
●
Container startup: may have to dynamically spin up containers
●
Storage: input data needs to be fetched from remote storage
(e.g., S3)
– As opposed to compute-local storage such as HDFS
●
Data sharing: intermediate needs to be temporarily stored on
remote storage (S3, Redis)
– Affects operations like shuffle, broadcast, etc.,
Why is it so hard?
put your #assignedhashtag here by setting the footer in view-header/footer●
Scheduler: when to best add/remove resources?
●
Container startup: may have to dynamically spin up containers
●
Storage: input data needs to be fetched from remote storage
(e.g., S3)
– As opposed to compute-local storage such as HDFS
●
Data sharing: intermediate needs to be temporarily stored on
remote storage (S3, Redis)
– Affects operations like shuffle, broadcast, etc.,
Why is it so hard?
Example: MapReduce (Cluster)
Map
Stage
Reduce
Stage
Compute &
Store Nodes
Compute &
Store Nodes
Compute &
Store Nodes
data is mostly
written and
read locally
Map
Stage
Reduce
Stage
Dynamically
growing/shrinking
compute cloud
Storage Service
(e.g, S3, Redis)
Shuffle
data is
exclusively
written and
read remotely
Example: MapReduce (Serverless)
I/O Overhead: Sorting 100GB
AWS Lambda Spark Serverless Spark Cloud Spark Cluster Spark HPC
0
100
200
300
400
500 Shuffle I/O
Compute
Input/Output
60%
22%
18%
Spark Cluster Spark HPC
0
10
20
30
40
50
60
Shuffle overheads are significantly higher when intermediate data is stored remotely
19%
49%
32%
38%
3%
59%AWS λ Databricks Databricks Spark Spark
Serverless Standard On-premise On-premise++
Spark Spark
On-premise On-premise++
Runtime[seconds]
What about other workloads?
Example: SQL, Query 77 / TPC-DS benchmark
What about other workloads?
Example: SQL, Query 77 / TPC-DS benchmark
Shuffle/Broadcast
(needs to be stored remotely)
What about other workloads?
W W PS W W
could be co-located
with worker nodes
*) read training data
*) fetch model params
*) compute
*) update model
*) use cached data
*) fetch model params
*) compute
*) update model
Example: Iterative ML (e.g., linear regression)
What about other workloads?
Example: Iterative ML (e.g., linear regression)
W W PS W W
could be co-located
with worker nodes
*) read training data
*) fetch model params
*) compute
*) update model
*) use cached data
*) fetch model params
*) compute
*) update model
What about other workloads?
Example: Iterative ML (e.g., linear regression)
W W PS W W
could be co-located
with worker nodes
Needs to be
remote
W
*) read training data
*) fetch model params
*) compute
*) update model
*) use cached data
*) fetch model params
*) compute
*) update model
read remote data
scale
out
What about other workloads?
Example: Iterative ML (e.g., linear regression)
W W PS W W
could be co-located
with worker nodes
Needs to be
remote
W
*) read training data
*) fetch model params
*) compute
*) update model
*) use cached data
*) fetch model params
*) compute
*) update model
read remote data
scale
out
barrier, need to wait
Can we..
●
..use Spark to run such workloads in a serverless
fashion?
– Dynamic scaling of compute nodes while jobs are running
– No cluster configuration
– No startup time overhead
●
..eliminate the performance overheads?
– Workloads should run as fast as on a dedicated cluster
put your #assignedhashtag here by setting the footer in view-header/footer
●
Scheduling:
– Use serverless framework to schedule executors
– Use serverless framework to schedule tasks
– Enable sharing of executors among different applications
●
Intermediate data:
– Executors cooperate with scheduler to flush data remotely
– Consequently store all intermediate state remotely
Design Options
1
2
3
1
2
put your #assignedhashtag here by setting the footer in view-header/footer
●
Scheduling:
– Use serverless framework to schedule executors
– Use serverless framework to schedule tasks
– Enable sharing of executors among different applications
●
Intermediate data:
– Executors cooperate with scheduler to flush data remotely
– Consequently store all intermediate state remotely
Design Options
High startup
Latency!
1
2
3
1
2
put your #assignedhashtag here by setting the footer in view-header/footer
●
Scheduling:
– Use serverless framework to schedule executors
– Use serverless framework to schedule tasks
– Enable sharing of executors among different applications
●
Intermediate data:
– Executors cooperate with scheduler to flush data remotely
– Consequently store all intermediate state remotely
Design Options
Slow!
High startup
Latency!
1
2
3
1
2
put your #assignedhashtag here by setting the footer in view-header/footer
●
Scheduling:
– Use serverless framework to schedule executors
– Use serverless framework to schedule tasks
– Enable sharing of executors among different applications
●
Intermediate data:
– Executors cooperate with scheduler to flush data remotely
– Consequently store all intermediate state remotely
Design Options
Slow!
High startup
Latency!
1
2
3
1
2
put your #assignedhashtag here by setting the footer in view-header/footer
●
Scheduling:
– Use serverless framework to schedule executors
– Use serverless framework to schedule tasks
– Enable sharing of executors among different applications
●
Intermediate data:
– Executors cooperate with scheduler to flush data remotely
– Consequently store all intermediate state remotely
Design Options
Slow!
High startup
Latency!
Complex!
1
2
3
1
2
put your #assignedhashtag here by setting the footer in view-header/footer
●
Scheduling:
– Use serverless framework to schedule executors
– Use serverless framework to schedule tasks
– Enable sharing of executors among different applications
●
Intermediate data:
– Executors cooperate with scheduler to flush data remotely
– Consequently store all intermediate state remotely
Design Options
Slow!
High startup
Latency!
Complex!
1
2
3
1
2
Architecture Overview
Driver HCS/
Scheduler
ExecutorStorage
server
Intermediate data
Intermediate
data
send schedule
register
Metadata
server
send job DAG
register
assign
application
register
launch
task
crail.apache.org
Architecture Overview
Driver HCS/
Scheduler
ExecutorStorage
server
send schedule
register
Metadata
server
send job DAG
register
assign
application
register
launch
task
crail.apache.org
Intermediate
data
Intermediate data
Architecture Overview
Driver HCS/
Scheduler
ExecutorStorage
server
send schedule
register
Metadata
server
send job DAG
register
assign
application
register
launch
task
crail.apache.org
Intermediate
data
Intermediate data
Architecture Overview
Driver HCS/
Scheduler
ExecutorStorage
server
send schedule
register
Metadata
server
send job DAG
register
assign
application
register
launch
task
Executors get
dynamically
re-assigned to
apps/drivers
crail.apache.org
Intermediate
data
Intermediate data
HCS Scheduler
Executor
Resource
Manager
“Oracle”
Scheduler
Driver
REST API REST API
request resource
assign resource
HCS Scheduler
Executor
Resource
Manager
“Oracle”
Scheduler
Driver
REST API REST API
request resource
assign resource
computes task
schedule
HCS Scheduler
Executor
Resource
Manager
“Oracle”
Scheduler
Driver
REST API REST API
request resource
assign resource
computes task
schedule
Uses ML to
predict task
runtimes
based on
previous runs
HCS Scheduler
Executor
Resource
Manager
“Oracle”
Scheduler
Driver
REST API REST API
request resource
assign resource
determines
resource set
(executors)
for a given
schedule
computes task
schedule
Uses ML to
predict task
runtimes
based on
previous runs
HCS Scheduler
Executor
Resource
Manager
“Oracle”
Scheduler
Driver
REST API REST API
request resource
assign resource
determines
resource set
(executors)
for a given
schedule
computes task
schedule
Uses ML to
predict task
runtimes
based on
previous runs
“The HCl Scheduler:
Going all-in on Heterogeneity”,
Michael Kaufmann et al., HotCloud’17
Video: Putting things together
Application 1:
GridSearch
Application 2:
SQL TPC-DS
Application 3:
SQL TPC-DS
HCL
Scheduler
Resource view:
fraction of resources
each app consumes
Executor view:
Which app an executor
currently runs
Scheduling Snapshot
Executors
Time
Grid Search
SQL
put your #assignedhashtag here by setting the footer in view-header/footer●
Compute cluster size: 8 nodes: IBM Power8 Minsky
●
Storage cluster size: 8 nodes, IBM Power8 Minsky
●
Cluster hardware:
– DRAM: 512 GB
– Storage: 4x 1.2 TB NVMe SSD
– Network: 10Gb/s Ethernert, 100Gb/s RoCE
– GPU: NVIDIA P100, NVLink
●
Workloads
– ML: Logistic Regression using the CoCoa framework
– SQL: TCP-DS
Let’s look at performance...
ML: Logistic Regression
Vanilla Cluster HCS/CRAIL/TCP HCS/CRAIL/100Gb/s HCS/CRAIL/RDMA
0
0.5
1
1.5
2
2.5
Input
Reduce
Broadcast
Compute
KDDA data set
6.5 GB
Runtime[seconds]
Vanilla
Cluster,100Gb/s
Vanilla
Serverless
HCS/Crail
TCP,100Gb/s
HCS/Crail
RDMA,100Gb/s
Using HCS/Crail we can store intermediate data remotely and reduce the runtime for ML
TPC-DS: Query #87
Spark Cluster Simple Serverless HCS/CRAIL/TCP HCS/CRAIL/RDMA
0
2
4
6
8
10
12
14
Runtime[seconds]
Vanilla
Cluster
HCS/Crail
TCP,10Gb/s
HCS/Crail
TCP,100Gb/s
HCS/Crail
RDMA,100Gb/s
We need a fast network and an optimized software stack to eliminate storage overheads
TPC-DS: Query #3
Warm serverless
Crail Serverless
Spark Cluster
0 5 10 15 20 25 30 35
Runtime [seconds]
Executing short running queries on a warm serverless cluster improves performance
put your #assignedhashtag here by setting the footer in view-header/footer●
Efficient serverless computing is challenging
– Local state (e.g. shuffle, cached input, network state) is lost as compute cloud scales up/down
●
This talk: turning Spark into a serverless framework by
– Implementing HCS, a new serverless scheduler
– Consequently storing compute state remotely using Apache Crail
●
Supports arbitrary Spark workloads with almost no performance overhead
– MapReduce, SQL, Iterative Machine Learning
●
Implicit support for fast network and storage hardware
– e.g, RDMA, NVMe, NVMf
Conclusion
put your #assignedhashtag here by setting the footer in view-header/footer
●
Containerize the platform
●
Add support for dynamic re-partitioning on scale
events
●
Add support for automatic caching
●
Add more sophisticated scheduling policies
Future Work
put your #assignedhashtag here by setting the footer in view-header/footer
●
Running Apache Spark on a High-Performance
Cluster Using RDMA and NVMe Flash, Spark
Summit’17, https://tinyurl.com/yd453uzq
●
Apache Crail, http://crail.apache.org
●
HCS Scheduler, github.com/zrlio/hcs
●
Spark-HCS, github.com/zrlio/spark-hcs
●
Spark-IO, github.com/zrlio/spark-io
Links
Thanks to
Michael Kaufmann, Adrian Schuepbach, Jonas Pfefferle,
Animesh Trivedi, Bernard Metzler, Ana Klimovic, Yawen
Wang

Contenu connexe

Tendances

700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleSpark Summit
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDataWorks Summit
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemDaniel Rodriguez
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Hubert Fan Chiang
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryDatabricks
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Apache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentDatabricks
 

Tendances (20)

700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent Memory
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Apache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and Present
 

Similaire à Serverless Machine Learning on Modern Hardware Using Apache Spark with Patrick Stuedi

Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfprevota
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & DataductAmazon Web Services
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructureharendra_pathak
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowTatiana Al-Chueyr
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 

Similaire à Serverless Machine Learning on Modern Hardware Using Apache Spark with Patrick Stuedi (20)

Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 

Dernier (20)

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 

Serverless Machine Learning on Modern Hardware Using Apache Spark with Patrick Stuedi

  • 1. Patrick Stuedi, Michael Kaufmann, Adrian Schuepbach IBM Research Serverless Machine Learning on Modern Hardware #Res6SAIS
  • 2. Serverless Computing ● No need to setup/manage a cluster ● Automatic, dynamic and fine- grained scaling ● Sub-second billing ● Many frameworks: AWS Lambda, Google Cloud Functions, Azure Functions, Databricks Serverless, etc.
  • 3. Challenge: Performance AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise 0 100 200 300 400 500 Increasing flexibility Increasing performance Example: Sorting 100GB AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17 Runtime[seconds]
  • 4. Challenge: Performance AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise 0 100 200 300 400 500 Increasing flexibility Increasing performance Example: Sorting 100GB 64 lambda workers AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17 Runtime[seconds]
  • 5. Challenge: Performance AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise 0 100 200 300 400 500 Increasing flexibility Increasing performance Example: Sorting 100GBserverless cluster with autoscaling min machines: 1 max machines: 864 lambda workers AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17 Runtime[seconds]
  • 6. Challenge: Performance AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise 0 100 200 300 400 500 Increasing flexibility Increasing performance Example: Sorting 100GBserverless cluster with autoscaling min machines: 1 max machines: 864 lambda workers standard cluster no autoscaling 8 machines AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17 Runtime[seconds]
  • 7. Challenge: Performance AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise 0 100 200 300 400 500 Increasing flexibility Increasing performance Example: Sorting 100GBserverless cluster with autoscaling min machines: 1 max machines: 864 lambda workers standard cluster no autoscaling 8 machines 100Gb/s Ethernet AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17 Runtime[seconds]
  • 8. Challenge: Performance AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise 0 100 200 300 400 500 Increasing flexibility Increasing performance Example: Sorting 100GBserverless cluster with autoscaling min machines: 1 max machines: 864 lambda workers standard cluster no autoscaling 8 machines 100Gb/s Ethernet AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ RDMA, NVMe flash, NVMeF Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17 Runtime[seconds]
  • 9. put your #assignedhashtag here by setting the footer in view-header/footer● Scheduler: when to best add/remove resources? ● Container startup: may have to dynamically spin up containers ● Storage: input data needs to be fetched from remote storage (e.g., S3) – As opposed to compute-local storage such as HDFS ● Data sharing: intermediate needs to be temporarily stored on remote storage (S3, Redis) – Affects operations like shuffle, broadcast, etc., Why is it so hard?
  • 10. put your #assignedhashtag here by setting the footer in view-header/footer● Scheduler: when to best add/remove resources? ● Container startup: may have to dynamically spin up containers ● Storage: input data needs to be fetched from remote storage (e.g., S3) – As opposed to compute-local storage such as HDFS ● Data sharing: intermediate needs to be temporarily stored on remote storage (S3, Redis) – Affects operations like shuffle, broadcast, etc., Why is it so hard?
  • 11. Example: MapReduce (Cluster) Map Stage Reduce Stage Compute & Store Nodes Compute & Store Nodes Compute & Store Nodes data is mostly written and read locally
  • 12. Map Stage Reduce Stage Dynamically growing/shrinking compute cloud Storage Service (e.g, S3, Redis) Shuffle data is exclusively written and read remotely Example: MapReduce (Serverless)
  • 13. I/O Overhead: Sorting 100GB AWS Lambda Spark Serverless Spark Cloud Spark Cluster Spark HPC 0 100 200 300 400 500 Shuffle I/O Compute Input/Output 60% 22% 18% Spark Cluster Spark HPC 0 10 20 30 40 50 60 Shuffle overheads are significantly higher when intermediate data is stored remotely 19% 49% 32% 38% 3% 59%AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ Spark Spark On-premise On-premise++ Runtime[seconds]
  • 14. What about other workloads? Example: SQL, Query 77 / TPC-DS benchmark
  • 15. What about other workloads? Example: SQL, Query 77 / TPC-DS benchmark Shuffle/Broadcast (needs to be stored remotely)
  • 16. What about other workloads? W W PS W W could be co-located with worker nodes *) read training data *) fetch model params *) compute *) update model *) use cached data *) fetch model params *) compute *) update model Example: Iterative ML (e.g., linear regression)
  • 17. What about other workloads? Example: Iterative ML (e.g., linear regression) W W PS W W could be co-located with worker nodes *) read training data *) fetch model params *) compute *) update model *) use cached data *) fetch model params *) compute *) update model
  • 18. What about other workloads? Example: Iterative ML (e.g., linear regression) W W PS W W could be co-located with worker nodes Needs to be remote W *) read training data *) fetch model params *) compute *) update model *) use cached data *) fetch model params *) compute *) update model read remote data scale out
  • 19. What about other workloads? Example: Iterative ML (e.g., linear regression) W W PS W W could be co-located with worker nodes Needs to be remote W *) read training data *) fetch model params *) compute *) update model *) use cached data *) fetch model params *) compute *) update model read remote data scale out barrier, need to wait
  • 20. Can we.. ● ..use Spark to run such workloads in a serverless fashion? – Dynamic scaling of compute nodes while jobs are running – No cluster configuration – No startup time overhead ● ..eliminate the performance overheads? – Workloads should run as fast as on a dedicated cluster
  • 21. put your #assignedhashtag here by setting the footer in view-header/footer ● Scheduling: – Use serverless framework to schedule executors – Use serverless framework to schedule tasks – Enable sharing of executors among different applications ● Intermediate data: – Executors cooperate with scheduler to flush data remotely – Consequently store all intermediate state remotely Design Options 1 2 3 1 2
  • 22. put your #assignedhashtag here by setting the footer in view-header/footer ● Scheduling: – Use serverless framework to schedule executors – Use serverless framework to schedule tasks – Enable sharing of executors among different applications ● Intermediate data: – Executors cooperate with scheduler to flush data remotely – Consequently store all intermediate state remotely Design Options High startup Latency! 1 2 3 1 2
  • 23. put your #assignedhashtag here by setting the footer in view-header/footer ● Scheduling: – Use serverless framework to schedule executors – Use serverless framework to schedule tasks – Enable sharing of executors among different applications ● Intermediate data: – Executors cooperate with scheduler to flush data remotely – Consequently store all intermediate state remotely Design Options Slow! High startup Latency! 1 2 3 1 2
  • 24. put your #assignedhashtag here by setting the footer in view-header/footer ● Scheduling: – Use serverless framework to schedule executors – Use serverless framework to schedule tasks – Enable sharing of executors among different applications ● Intermediate data: – Executors cooperate with scheduler to flush data remotely – Consequently store all intermediate state remotely Design Options Slow! High startup Latency! 1 2 3 1 2
  • 25. put your #assignedhashtag here by setting the footer in view-header/footer ● Scheduling: – Use serverless framework to schedule executors – Use serverless framework to schedule tasks – Enable sharing of executors among different applications ● Intermediate data: – Executors cooperate with scheduler to flush data remotely – Consequently store all intermediate state remotely Design Options Slow! High startup Latency! Complex! 1 2 3 1 2
  • 26. put your #assignedhashtag here by setting the footer in view-header/footer ● Scheduling: – Use serverless framework to schedule executors – Use serverless framework to schedule tasks – Enable sharing of executors among different applications ● Intermediate data: – Executors cooperate with scheduler to flush data remotely – Consequently store all intermediate state remotely Design Options Slow! High startup Latency! Complex! 1 2 3 1 2
  • 27. Architecture Overview Driver HCS/ Scheduler ExecutorStorage server Intermediate data Intermediate data send schedule register Metadata server send job DAG register assign application register launch task crail.apache.org
  • 28. Architecture Overview Driver HCS/ Scheduler ExecutorStorage server send schedule register Metadata server send job DAG register assign application register launch task crail.apache.org Intermediate data Intermediate data
  • 29. Architecture Overview Driver HCS/ Scheduler ExecutorStorage server send schedule register Metadata server send job DAG register assign application register launch task crail.apache.org Intermediate data Intermediate data
  • 30. Architecture Overview Driver HCS/ Scheduler ExecutorStorage server send schedule register Metadata server send job DAG register assign application register launch task Executors get dynamically re-assigned to apps/drivers crail.apache.org Intermediate data Intermediate data
  • 32. HCS Scheduler Executor Resource Manager “Oracle” Scheduler Driver REST API REST API request resource assign resource computes task schedule
  • 33. HCS Scheduler Executor Resource Manager “Oracle” Scheduler Driver REST API REST API request resource assign resource computes task schedule Uses ML to predict task runtimes based on previous runs
  • 34. HCS Scheduler Executor Resource Manager “Oracle” Scheduler Driver REST API REST API request resource assign resource determines resource set (executors) for a given schedule computes task schedule Uses ML to predict task runtimes based on previous runs
  • 35. HCS Scheduler Executor Resource Manager “Oracle” Scheduler Driver REST API REST API request resource assign resource determines resource set (executors) for a given schedule computes task schedule Uses ML to predict task runtimes based on previous runs “The HCl Scheduler: Going all-in on Heterogeneity”, Michael Kaufmann et al., HotCloud’17
  • 36. Video: Putting things together Application 1: GridSearch Application 2: SQL TPC-DS Application 3: SQL TPC-DS HCL Scheduler Resource view: fraction of resources each app consumes Executor view: Which app an executor currently runs
  • 38. put your #assignedhashtag here by setting the footer in view-header/footer● Compute cluster size: 8 nodes: IBM Power8 Minsky ● Storage cluster size: 8 nodes, IBM Power8 Minsky ● Cluster hardware: – DRAM: 512 GB – Storage: 4x 1.2 TB NVMe SSD – Network: 10Gb/s Ethernert, 100Gb/s RoCE – GPU: NVIDIA P100, NVLink ● Workloads – ML: Logistic Regression using the CoCoa framework – SQL: TCP-DS Let’s look at performance...
  • 39. ML: Logistic Regression Vanilla Cluster HCS/CRAIL/TCP HCS/CRAIL/100Gb/s HCS/CRAIL/RDMA 0 0.5 1 1.5 2 2.5 Input Reduce Broadcast Compute KDDA data set 6.5 GB Runtime[seconds] Vanilla Cluster,100Gb/s Vanilla Serverless HCS/Crail TCP,100Gb/s HCS/Crail RDMA,100Gb/s Using HCS/Crail we can store intermediate data remotely and reduce the runtime for ML
  • 40. TPC-DS: Query #87 Spark Cluster Simple Serverless HCS/CRAIL/TCP HCS/CRAIL/RDMA 0 2 4 6 8 10 12 14 Runtime[seconds] Vanilla Cluster HCS/Crail TCP,10Gb/s HCS/Crail TCP,100Gb/s HCS/Crail RDMA,100Gb/s We need a fast network and an optimized software stack to eliminate storage overheads
  • 41. TPC-DS: Query #3 Warm serverless Crail Serverless Spark Cluster 0 5 10 15 20 25 30 35 Runtime [seconds] Executing short running queries on a warm serverless cluster improves performance
  • 42. put your #assignedhashtag here by setting the footer in view-header/footer● Efficient serverless computing is challenging – Local state (e.g. shuffle, cached input, network state) is lost as compute cloud scales up/down ● This talk: turning Spark into a serverless framework by – Implementing HCS, a new serverless scheduler – Consequently storing compute state remotely using Apache Crail ● Supports arbitrary Spark workloads with almost no performance overhead – MapReduce, SQL, Iterative Machine Learning ● Implicit support for fast network and storage hardware – e.g, RDMA, NVMe, NVMf Conclusion
  • 43. put your #assignedhashtag here by setting the footer in view-header/footer ● Containerize the platform ● Add support for dynamic re-partitioning on scale events ● Add support for automatic caching ● Add more sophisticated scheduling policies Future Work
  • 44. put your #assignedhashtag here by setting the footer in view-header/footer ● Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash, Spark Summit’17, https://tinyurl.com/yd453uzq ● Apache Crail, http://crail.apache.org ● HCS Scheduler, github.com/zrlio/hcs ● Spark-HCS, github.com/zrlio/spark-hcs ● Spark-IO, github.com/zrlio/spark-io Links
  • 45. Thanks to Michael Kaufmann, Adrian Schuepbach, Jonas Pfefferle, Animesh Trivedi, Bernard Metzler, Ana Klimovic, Yawen Wang