SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
13th ANNUAL WORKSHOP 2017
ACCELERATING APACHE SPARK WITH RDMA
Yuval Degani, Sr. Manager, Big Data and Machine Learning
March 28th, 2017
Mellanox Technologies
OpenFabrics Alliance Workshop 2017
AGENDA
 Apache Spark 101
 The Potential in Accelerating Spark Shuffle
 Accelerating Spark Shuffle with RDMA – Deep Dive
 Results
 Roadmap
2
OpenFabrics Alliance Workshop 2017
APACHE SPARK 101
3
OpenFabrics Alliance Workshop 2017
APACHE SPARK 101
Spark within the Big Data ecosystem
4
OpenFabrics Alliance Workshop 2017
APACHE SPARK 101
 In a nutshell: Spark is a data analysis platform with implicit data parallelism and fault-
tolerance
 Initial release: May, 2014
 Originally developed at UC Berkeley’s AMPLab
 Donated as open source to the Apache Software Foundation
 Most active Apache open source project
 50% of production systems are in public clouds
 Notable users:
5
Quick facts
OpenFabrics Alliance Workshop 2017
APACHE SPARK 101
 The traditional Map-Reduce model is bound to an acyclic data flow from stable storage
to stable storage (e.g. HDFS)
But why?
6
Map
Map
Map
Reduce
Reduce
Map
Map
Map
Reduce
Reduce
OpenFabrics Alliance Workshop 2017
APACHE SPARK 101
 Some applications frequently reuse data, and don’t just digest it once:
• Iterative algorithms (e.g. machine learning, graphs)
• Interactive data-mining and streaming
 The acyclic data flow model becomes very inefficient when data gets repeatedly reused
 The solution: Resilient Distributed Datasets (RDDs)
• In-memory data representation
• The heart of Apache Spark
• Preserves and enhances the appealing properties
of Map-Reduce:
• Fault tolerance
• Data locality
• Scalability
7
But why?
Step Step Step
Step Step Step
OpenFabrics Alliance Workshop 2017
APACHE SPARK 101
 RDDs let us keep data within quick reach, in-memory
 But, cluster computing is all about moving data around the network!
 Data transfer over the network between stages of computation is what we call “Shuffle”
 So, we have memory-to-network-to-memory transfers: RDMA is a perfect fit!
8
“Everyday I'm Shuffling”
Step Step
OpenFabrics Alliance Workshop 2017
APACHE SPARK 101
Shuffling is the process of redistributing data across partitions (AKA repartitioning)
Shuffle Basics
9
Shuffle Write Shuffle ReadBroadcast block
locations1 2 3
Worker nodes write their intermediate data
blocks (Map output) into local storage, and
list the blocks by partition
The master node broadcasts a combined
list of blocks, grouped by partition
Each reduce partition fetches all of the
blocks listed under its partition – from
various nodes in the cluster
OpenFabrics Alliance Workshop 2017
THE POTENTIAL IN ACCELERATING SHUFFLE
10
OpenFabrics Alliance Workshop 2017
THE POTENTIAL IN ACCELERATING SHUFFLE
 Before setting off on the journey of adding RDMA to Spark, it was essential to estimate
the ROI of such work
 What’s better than a controlled experiment?
What’s the opportunity here?
11
Goal Quantify the potential performance gain in accelerating network
transfers
Method Bypass the network in Spark Shuffle and compare to the original:
No network = maximum performance gain
Step Step
OpenFabrics Alliance Workshop 2017
THE POTENTIAL IN ACCELERATING SHUFFLE
 General idea:
• Do not fetch blocks from the network, just reuse a local sampled block over and over
• No copying
• No message passing what so ever (for Shuffle)
• Everything else stays exactly the same! (no optimizations in Shuffle)
• Compare to TCP
Bypassing the network in Spark Shuffle
12
Phase TCP Shuffle Network bypass Shuffle
Worker node initialization Read sampled shuffle block from disk for reuse in Shuffle
On block fetch request Send request and wait for data to
come in to trigger completion
Immediately complete the block
fetch request
On block fetch request
completion
Drop incoming shuffle data block No op
Deliver shuffle block Duplicate (shallow copy) the sampled buffer from initialization and
publish (trim to desired size)
OpenFabrics Alliance Workshop 2017
THE POTENTIAL IN ACCELERATING SHUFFLE
 Benchmark:
• HiBench TeraSort
• Workload: 600GB
 Testbed:
• HDFS on Hadoop 2.6.0
• No replication
• Spark 2.0.0
• 1 Master
• 30 Workers
• 28 active Spark cores on each node, 840 total
• Node info:
• Intel Xeon E5-2697 v3 @ 2.60GHz
• 256GB RAM, 128GB of it reserved as RAMDISK
• RAMDISK is used for Spark local directories and HDFS
Testbed
13
OpenFabrics Alliance Workshop 2017
THE POTENTIAL IN ACCELERATING SHUFFLE
 TeraSort:
• Read unsorted data from HDFS
• Sort the data
• Write back sorted data to HDFS
 Workload:
• A collection of K-V pairs
• Keys are random
 Algorithm at high-level (all phases are distributed among all cores):
• Define the partitioner:
• Read random samples from HDFS and define balanced partitions of key ranges
• Partition (map) the data:
• Read data from HDFS, and group by partition, according to the partitions defined in #1 (Shuffle Write)
• Retrieve partitioned data (Shuffle Read)
• Sort the K-V pairs by key
• Write the sorted data to HDFS
TeraSort
10 2 32 484 4
Key Const Row ID Const Filler Const
OpenFabrics Alliance Workshop 2017
THE POTENTIAL IN ACCELERATING SHUFFLE
Results
Runtime samples Normal Network bypass Opportunity
Average 777 seconds 356 seconds 54%
Max 815 seconds 389 seconds 52%
Min 755 seconds 318 seconds 58%
15
54% 52%
58%
0
100
200
300
400
500
600
700
800
900
Average Min Max
TCP TeraSort vs. Network-bypass TeraSort
TCP
Network-Bypass
Lower is better
Huge
opportunity
for RDMA!
OpenFabrics Alliance Workshop 2017
ACCELERATING SHUFFLE WITH RDMA
16
OpenFabrics Alliance Workshop 2017
ACCELERATING SHUFFLE WITH RDMA
Challenges in adding RDMA to Spark Shuffle
 Spark is written in Java & Scala
• What RDMA API to use?
• Java’s Garbage Collection is a pain for managing low-level buffers
 RDMA connection establishment – how can we make connections as long lived as
possible?
 Spark’s Shuffle Write data is currently saved on the local disk. How can we make the
data available for RDMA?
 Must keep changes to Spark code to a minimum
• Spark is not very plug-in-able
• Spark keeps changing rapidly – API changes, implementation changes
• Maintain long term functionality
Challenges
17
OpenFabrics Alliance Workshop 2017
ACCELERATING SHUFFLE WITH RDMA
 RDMA API is available in Java through:
• JXIO-AccelIO – Abstract of RDMA through RPC (https://github.com/accelio/JXIO)
• DiSNi – Open-source release of IBM’s JVerbs – ibverbs like interface in Java (https://github.com/zrlio/disni)
• We have decided to go with DiSNi since it gives us more flexibility
 RDMA buffer management
• Centralized pool of buffers per worker node
• Pool consists of multiple stacks of different sizes (in powers of 2)
• Buffers are allocated off-heap and registered on demand
• When they are no longer in use – they are recycled back to the pool
• Survive for the length of a job (minutes)
 RDMA connection establishment
• For simplicity, couple each TCP connection with a corresponding RDMA connection
• RDMA is used for all block data transfers
• Survive for the length of a job (minutes)
RDMA Facilities – Design Notes
OpenFabrics Alliance Workshop 2017
ACCELERATING SHUFFLE WITH RDMA
 The challenge – get the Shuffle data in memory, and register as a Memory Region
 We provide two modes:
Shuffle Write – Design Notes
19
Method In-memory File-mapping
Description Instead of saving Shuffle data to disk,
save it directly into RDMA buffers
Map Spark’s file on local disk to memory, and
register as a Memory Region
Pros • Faster and more consistent than file-
mapping unless low on system memory
• Less overhead for RDMA
• Simpler indexing (buffer-per-block)
• Minimal changes in Spark code
• Files are already in buffer-cache – mmap is
not expensive
• Comparable memory footprint to TCP
• No functional limitations
Cons • High memory footprint
• Many changes in Spark’s ShuffleWriters
and ShuffleSorters
• Limited functionality (Spills, low
memory)
• mmap() occasionally demonstrates latency
jitter
• Indexing – multiple blocks per file as in TCP
OpenFabrics Alliance Workshop 2017
Collect incoming blocks, call
completion handler per chunk
ACCELERATING SHUFFLE WITH RDMA
20
Shuffle Read Protocol in Spark
Requestor Responder
Lookup blocks (from mem/disk) and
setup a stream of blocks
Fetch: List of BlockIDs for a new
stream
Fetch Request RPC:
BlockID list
Send block fetch requests for each
block in the StreamID
Fetch Response RPC:
StreamID
Lookup block, and send data as RPC
Fetch Request RPC:
BlockID=y
Collect incoming blocks, call
completion handler per block
Fetch Response RPC:
Data for BlockID=y
TCP
OpenFabrics Alliance Workshop 2017
ACCELERATING SHUFFLE WITH RDMA
21
Shuffle Read Protocol in Spark
Requestor Responder
Lookup blocks (from mem/disk) and
setup a stream of blocks
Fetch: List of BlockIDs for a new
stream
Fetch Request RPC:
BlockID + Rkey +
vAddress list
RDMAWrite each local block to the
remote block
Call completion handler on all of the
blocks in the stream
RDMA
Allocate an RDMA buffer per block
Retrieve Lkey and vAddress of blocks
RDMAWrite per block
Last block in Stream: RDMAWrite with
Immediate to notify requestor of Stream
fetch completionRDMAWrite with
Immediate for last
block
OpenFabrics Alliance Workshop 2017
ACCELERATING SHUFFLE WITH RDMA
 N = Number of blocks
Shuffle Read: TCP vs. RDMA
22
TCP RDMA
Number of messages per
fetch
2(N+1) N+1
Number of copy operations
per block
Responder: 1 to 2 times
Requestor: 1 time
Total: 2 to 3 times
0-copy
OpenFabrics Alliance Workshop 2017
RESULTS
23
OpenFabrics Alliance Workshop 2017
RESULTS
In-house TeraSort Results
Testbed:
HiBench TeraSort
Workload: 300GB
HDFS on Hadoop 2.6.0
No replication
Spark 2.0.0
1 Master
14 Workers
28 active Spark cores on
each node, 392 total
Node info:
Intel Xeon E5-2697 v3 @
2.60GHz
RoCE 100GbE
256GB RAM
HDD is used for Spark local
directories and HDFS
24
18% 17%
17%
0
50
100
150
200
250
300
350
Average Min Max
TCP vs. RDMA
TCP
RDMA
Runtime samples TCP RDMA Improvement
Average 301 seconds 247 seconds 18%
Max 284 seconds 237 seconds 17%
Min 319 seconds 264 seconds 17%
Lower is better
OpenFabrics Alliance Workshop 2017
17%
15%
11%
0
20
40
60
80
100
120
140
160
Customer App #1 Customer App #2 HiBench TeraSort
TCP vs. RDMA
TCP
RDMA
RESULTS
Real-life Applications - Initial Results
25
Runtime samples TCP RDMA Improvement Input Size Nodes Cores per node RAM per node
Customer App #1 138 seconds 114 seconds 17% 5GB 14 24 85GB
Customer App #2 60 seconds 51 seconds 15% 270GB 14 24 85GB
HiBench TeraSort 66 seconds 59 seconds 11% 100GB 16 24 85GB
Lower is better
OpenFabrics Alliance Workshop 2017
ROADMAP
 Spark RDMA in GA expected in 2017
 Open-source Spark plugin
 Push to Spark upstream
What’s next?
26
13th ANNUAL WORKSHOP 2017
THANK YOU
Yuval Degani, Sr. Manager, Big Data and Machine Learning
Mellanox Technologies

Contenu connexe

Tendances

Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streaming
DataWorks Summit
 
The HPE Machine and Gen-Z - BUD17-503
The HPE Machine and Gen-Z - BUD17-503The HPE Machine and Gen-Z - BUD17-503
The HPE Machine and Gen-Z - BUD17-503
Linaro
 
HPC Network Stack on ARM
HPC Network Stack on ARMHPC Network Stack on ARM
HPC Network Stack on ARM
inside-BigData.com
 
Challenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIChallenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPI
inside-BigData.com
 

Tendances (20)

High Performance Interconnects: Landscape, Assessments & Rankings
High Performance Interconnects: Landscape, Assessments & RankingsHigh Performance Interconnects: Landscape, Assessments & Rankings
High Performance Interconnects: Landscape, Assessments & Rankings
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
 
Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...
Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...
Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streaming
 
The HPE Machine and Gen-Z - BUD17-503
The HPE Machine and Gen-Z - BUD17-503The HPE Machine and Gen-Z - BUD17-503
The HPE Machine and Gen-Z - BUD17-503
 
Lenovo HPC Strategy Update
Lenovo HPC Strategy UpdateLenovo HPC Strategy Update
Lenovo HPC Strategy Update
 
Deep Learning: Convergence of HPC and Hyperscale
Deep Learning: Convergence of HPC and HyperscaleDeep Learning: Convergence of HPC and Hyperscale
Deep Learning: Convergence of HPC and Hyperscale
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
HPC Network Stack on ARM
HPC Network Stack on ARMHPC Network Stack on ARM
HPC Network Stack on ARM
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
DOME 64-bit μDataCenter
DOME 64-bit μDataCenterDOME 64-bit μDataCenter
DOME 64-bit μDataCenter
 
Challenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIChallenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPI
 
BXI: Bull eXascale Interconnect
BXI: Bull eXascale InterconnectBXI: Bull eXascale Interconnect
BXI: Bull eXascale Interconnect
 
Overview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future RoadmapOverview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future Roadmap
 
DUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution PlansDUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution Plans
 
It's Time to ROCm!
It's Time to ROCm!It's Time to ROCm!
It's Time to ROCm!
 
Programming Models for Exascale Systems
Programming Models for Exascale SystemsProgramming Models for Exascale Systems
Programming Models for Exascale Systems
 
Interconnect your future
Interconnect your futureInterconnect your future
Interconnect your future
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 

Similaire à Accelerating apache spark with rdma

October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
 
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
DataWorks Summit
 

Similaire à Accelerating apache spark with rdma (20)

October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Spark Workshop
Spark WorkshopSpark Workshop
Spark Workshop
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkThe Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
 
Couchbase Data Pipeline
Couchbase Data PipelineCouchbase Data Pipeline
Couchbase Data Pipeline
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark
SparkSpark
Spark
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on Flink
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
 
Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7
 

Plus de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
inside-BigData.com
 

Plus de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Accelerating apache spark with rdma

  • 1. 13th ANNUAL WORKSHOP 2017 ACCELERATING APACHE SPARK WITH RDMA Yuval Degani, Sr. Manager, Big Data and Machine Learning March 28th, 2017 Mellanox Technologies
  • 2. OpenFabrics Alliance Workshop 2017 AGENDA  Apache Spark 101  The Potential in Accelerating Spark Shuffle  Accelerating Spark Shuffle with RDMA – Deep Dive  Results  Roadmap 2
  • 3. OpenFabrics Alliance Workshop 2017 APACHE SPARK 101 3
  • 4. OpenFabrics Alliance Workshop 2017 APACHE SPARK 101 Spark within the Big Data ecosystem 4
  • 5. OpenFabrics Alliance Workshop 2017 APACHE SPARK 101  In a nutshell: Spark is a data analysis platform with implicit data parallelism and fault- tolerance  Initial release: May, 2014  Originally developed at UC Berkeley’s AMPLab  Donated as open source to the Apache Software Foundation  Most active Apache open source project  50% of production systems are in public clouds  Notable users: 5 Quick facts
  • 6. OpenFabrics Alliance Workshop 2017 APACHE SPARK 101  The traditional Map-Reduce model is bound to an acyclic data flow from stable storage to stable storage (e.g. HDFS) But why? 6 Map Map Map Reduce Reduce Map Map Map Reduce Reduce
  • 7. OpenFabrics Alliance Workshop 2017 APACHE SPARK 101  Some applications frequently reuse data, and don’t just digest it once: • Iterative algorithms (e.g. machine learning, graphs) • Interactive data-mining and streaming  The acyclic data flow model becomes very inefficient when data gets repeatedly reused  The solution: Resilient Distributed Datasets (RDDs) • In-memory data representation • The heart of Apache Spark • Preserves and enhances the appealing properties of Map-Reduce: • Fault tolerance • Data locality • Scalability 7 But why? Step Step Step Step Step Step
  • 8. OpenFabrics Alliance Workshop 2017 APACHE SPARK 101  RDDs let us keep data within quick reach, in-memory  But, cluster computing is all about moving data around the network!  Data transfer over the network between stages of computation is what we call “Shuffle”  So, we have memory-to-network-to-memory transfers: RDMA is a perfect fit! 8 “Everyday I'm Shuffling” Step Step
  • 9. OpenFabrics Alliance Workshop 2017 APACHE SPARK 101 Shuffling is the process of redistributing data across partitions (AKA repartitioning) Shuffle Basics 9 Shuffle Write Shuffle ReadBroadcast block locations1 2 3 Worker nodes write their intermediate data blocks (Map output) into local storage, and list the blocks by partition The master node broadcasts a combined list of blocks, grouped by partition Each reduce partition fetches all of the blocks listed under its partition – from various nodes in the cluster
  • 10. OpenFabrics Alliance Workshop 2017 THE POTENTIAL IN ACCELERATING SHUFFLE 10
  • 11. OpenFabrics Alliance Workshop 2017 THE POTENTIAL IN ACCELERATING SHUFFLE  Before setting off on the journey of adding RDMA to Spark, it was essential to estimate the ROI of such work  What’s better than a controlled experiment? What’s the opportunity here? 11 Goal Quantify the potential performance gain in accelerating network transfers Method Bypass the network in Spark Shuffle and compare to the original: No network = maximum performance gain Step Step
  • 12. OpenFabrics Alliance Workshop 2017 THE POTENTIAL IN ACCELERATING SHUFFLE  General idea: • Do not fetch blocks from the network, just reuse a local sampled block over and over • No copying • No message passing what so ever (for Shuffle) • Everything else stays exactly the same! (no optimizations in Shuffle) • Compare to TCP Bypassing the network in Spark Shuffle 12 Phase TCP Shuffle Network bypass Shuffle Worker node initialization Read sampled shuffle block from disk for reuse in Shuffle On block fetch request Send request and wait for data to come in to trigger completion Immediately complete the block fetch request On block fetch request completion Drop incoming shuffle data block No op Deliver shuffle block Duplicate (shallow copy) the sampled buffer from initialization and publish (trim to desired size)
  • 13. OpenFabrics Alliance Workshop 2017 THE POTENTIAL IN ACCELERATING SHUFFLE  Benchmark: • HiBench TeraSort • Workload: 600GB  Testbed: • HDFS on Hadoop 2.6.0 • No replication • Spark 2.0.0 • 1 Master • 30 Workers • 28 active Spark cores on each node, 840 total • Node info: • Intel Xeon E5-2697 v3 @ 2.60GHz • 256GB RAM, 128GB of it reserved as RAMDISK • RAMDISK is used for Spark local directories and HDFS Testbed 13
  • 14. OpenFabrics Alliance Workshop 2017 THE POTENTIAL IN ACCELERATING SHUFFLE  TeraSort: • Read unsorted data from HDFS • Sort the data • Write back sorted data to HDFS  Workload: • A collection of K-V pairs • Keys are random  Algorithm at high-level (all phases are distributed among all cores): • Define the partitioner: • Read random samples from HDFS and define balanced partitions of key ranges • Partition (map) the data: • Read data from HDFS, and group by partition, according to the partitions defined in #1 (Shuffle Write) • Retrieve partitioned data (Shuffle Read) • Sort the K-V pairs by key • Write the sorted data to HDFS TeraSort 10 2 32 484 4 Key Const Row ID Const Filler Const
  • 15. OpenFabrics Alliance Workshop 2017 THE POTENTIAL IN ACCELERATING SHUFFLE Results Runtime samples Normal Network bypass Opportunity Average 777 seconds 356 seconds 54% Max 815 seconds 389 seconds 52% Min 755 seconds 318 seconds 58% 15 54% 52% 58% 0 100 200 300 400 500 600 700 800 900 Average Min Max TCP TeraSort vs. Network-bypass TeraSort TCP Network-Bypass Lower is better Huge opportunity for RDMA!
  • 16. OpenFabrics Alliance Workshop 2017 ACCELERATING SHUFFLE WITH RDMA 16
  • 17. OpenFabrics Alliance Workshop 2017 ACCELERATING SHUFFLE WITH RDMA Challenges in adding RDMA to Spark Shuffle  Spark is written in Java & Scala • What RDMA API to use? • Java’s Garbage Collection is a pain for managing low-level buffers  RDMA connection establishment – how can we make connections as long lived as possible?  Spark’s Shuffle Write data is currently saved on the local disk. How can we make the data available for RDMA?  Must keep changes to Spark code to a minimum • Spark is not very plug-in-able • Spark keeps changing rapidly – API changes, implementation changes • Maintain long term functionality Challenges 17
  • 18. OpenFabrics Alliance Workshop 2017 ACCELERATING SHUFFLE WITH RDMA  RDMA API is available in Java through: • JXIO-AccelIO – Abstract of RDMA through RPC (https://github.com/accelio/JXIO) • DiSNi – Open-source release of IBM’s JVerbs – ibverbs like interface in Java (https://github.com/zrlio/disni) • We have decided to go with DiSNi since it gives us more flexibility  RDMA buffer management • Centralized pool of buffers per worker node • Pool consists of multiple stacks of different sizes (in powers of 2) • Buffers are allocated off-heap and registered on demand • When they are no longer in use – they are recycled back to the pool • Survive for the length of a job (minutes)  RDMA connection establishment • For simplicity, couple each TCP connection with a corresponding RDMA connection • RDMA is used for all block data transfers • Survive for the length of a job (minutes) RDMA Facilities – Design Notes
  • 19. OpenFabrics Alliance Workshop 2017 ACCELERATING SHUFFLE WITH RDMA  The challenge – get the Shuffle data in memory, and register as a Memory Region  We provide two modes: Shuffle Write – Design Notes 19 Method In-memory File-mapping Description Instead of saving Shuffle data to disk, save it directly into RDMA buffers Map Spark’s file on local disk to memory, and register as a Memory Region Pros • Faster and more consistent than file- mapping unless low on system memory • Less overhead for RDMA • Simpler indexing (buffer-per-block) • Minimal changes in Spark code • Files are already in buffer-cache – mmap is not expensive • Comparable memory footprint to TCP • No functional limitations Cons • High memory footprint • Many changes in Spark’s ShuffleWriters and ShuffleSorters • Limited functionality (Spills, low memory) • mmap() occasionally demonstrates latency jitter • Indexing – multiple blocks per file as in TCP
  • 20. OpenFabrics Alliance Workshop 2017 Collect incoming blocks, call completion handler per chunk ACCELERATING SHUFFLE WITH RDMA 20 Shuffle Read Protocol in Spark Requestor Responder Lookup blocks (from mem/disk) and setup a stream of blocks Fetch: List of BlockIDs for a new stream Fetch Request RPC: BlockID list Send block fetch requests for each block in the StreamID Fetch Response RPC: StreamID Lookup block, and send data as RPC Fetch Request RPC: BlockID=y Collect incoming blocks, call completion handler per block Fetch Response RPC: Data for BlockID=y TCP
  • 21. OpenFabrics Alliance Workshop 2017 ACCELERATING SHUFFLE WITH RDMA 21 Shuffle Read Protocol in Spark Requestor Responder Lookup blocks (from mem/disk) and setup a stream of blocks Fetch: List of BlockIDs for a new stream Fetch Request RPC: BlockID + Rkey + vAddress list RDMAWrite each local block to the remote block Call completion handler on all of the blocks in the stream RDMA Allocate an RDMA buffer per block Retrieve Lkey and vAddress of blocks RDMAWrite per block Last block in Stream: RDMAWrite with Immediate to notify requestor of Stream fetch completionRDMAWrite with Immediate for last block
  • 22. OpenFabrics Alliance Workshop 2017 ACCELERATING SHUFFLE WITH RDMA  N = Number of blocks Shuffle Read: TCP vs. RDMA 22 TCP RDMA Number of messages per fetch 2(N+1) N+1 Number of copy operations per block Responder: 1 to 2 times Requestor: 1 time Total: 2 to 3 times 0-copy
  • 23. OpenFabrics Alliance Workshop 2017 RESULTS 23
  • 24. OpenFabrics Alliance Workshop 2017 RESULTS In-house TeraSort Results Testbed: HiBench TeraSort Workload: 300GB HDFS on Hadoop 2.6.0 No replication Spark 2.0.0 1 Master 14 Workers 28 active Spark cores on each node, 392 total Node info: Intel Xeon E5-2697 v3 @ 2.60GHz RoCE 100GbE 256GB RAM HDD is used for Spark local directories and HDFS 24 18% 17% 17% 0 50 100 150 200 250 300 350 Average Min Max TCP vs. RDMA TCP RDMA Runtime samples TCP RDMA Improvement Average 301 seconds 247 seconds 18% Max 284 seconds 237 seconds 17% Min 319 seconds 264 seconds 17% Lower is better
  • 25. OpenFabrics Alliance Workshop 2017 17% 15% 11% 0 20 40 60 80 100 120 140 160 Customer App #1 Customer App #2 HiBench TeraSort TCP vs. RDMA TCP RDMA RESULTS Real-life Applications - Initial Results 25 Runtime samples TCP RDMA Improvement Input Size Nodes Cores per node RAM per node Customer App #1 138 seconds 114 seconds 17% 5GB 14 24 85GB Customer App #2 60 seconds 51 seconds 15% 270GB 14 24 85GB HiBench TeraSort 66 seconds 59 seconds 11% 100GB 16 24 85GB Lower is better
  • 26. OpenFabrics Alliance Workshop 2017 ROADMAP  Spark RDMA in GA expected in 2017  Open-source Spark plugin  Push to Spark upstream What’s next? 26
  • 27. 13th ANNUAL WORKSHOP 2017 THANK YOU Yuval Degani, Sr. Manager, Big Data and Machine Learning Mellanox Technologies