Magnet Shuffle Service: Push-based Shuffle at LinkedIn

Min Shen
Senior Staff Software Engineer, LinkedIn
Chandni Singh
Senior Software Engineer, LinkedIn

Agenda
§ Spark @ LinkedIn
§ Introducing Magnet
§ Production Results
§ Future Work

Spark @ LinkedIn
A massive-scale infrastructure
11K nodes
40K+ daily Spark apps
~70% of cluster
compute resources
~18 PB daily
shuffle data
3X+ growth YoY

Shuffle Basics – Spark Shuffle
§ Transfer intermediate data across
stages with a mesh connection
between mappers and reducers.
§ Shuffle operation is a scaling and
performance bottleneck.
§ Issues are especially visible at large
scale.

Issues with Spark Shuffle at LinkedIn
Shuffle service
unavailable under heavy
load.
• Efficiency issue
• Reliability issue
Job runtime can be very
inconsistent during peak
cluster hours due to
shuffle.
• Scalability issue
Small shuffle blocks
hurt disk throughput
which prolongs shuffle
wait time.

Existing Optimizations Not Sufficient
• Broadcast join can reduce shuffle during join.
• Requires one of the tables to fit into memory.
• Caching RDD/DF after shuffling them can potentially reduce shuffle.
• Has performance implications and limited applicability.
• Bucketing is like caching by materializing DFs while preserving partitioning.
• Much more performant but still subject to limited applicability and requires manual setup effort.
• Adaptive Query Execution in Spark 3.0 with its auto-coalescing feature can optimize
shuffle.
• Slightly increased shuffle block size. However, not sufficient to address the 3 issues.

Magnet Shuffle Service
Magnet shuffle service adopts a push-based shuffle mechanism.
M. Shen, Y. Zhou, C. Singh. “Magnet: Push-based Shuffle Service for Large-scale Data
Processing” Proceedings of the VLDB Endowment, 13(12) (2020)
Convert small random reads in shuffle to large sequential reads.
Shuffle intermediate data becomes 2-replicated, improving reliability.
Locality-aware scheduling of reducers, further improving performance.

Push-based Shuffle
§ Best-effort push/merge operation.
§ Complements existing shuffle.
§ Shuffle blocks get pushed to remote
shuffle services.
§ Shuffle services merge blocks into per-
partition merged shuffle files.

Push-based Shuffle
§ Reducers fetch a hybrid of blocks.
▪ Fetch merged blocks to improve I/O
efficiency.
▪ Fetch original blocks if not merged.
§ Improved shuffle data locality.
§ Merged shuffle files create a second
replica of shuffle data.

Magnet Results – Scalability and Reliability
Currently rolled out to 100% of offline Spark workloads at LinkedIn.
That’s about 15-18 PB of shuffle data per day.
Leveraged a ramp-based rollout to reduce risks and did not
encounter any major issues during the rollout of Magnet.

Magnet Results – Improve Cluster Shuffle efficiency
• 30X increase in shuffle data fetched locally compared with 6 months ago.
Shuffle
Locality
Ratio

Magnet Results – Improve Cluster Shuffle efficiency
• Overall shuffle fetch delay time reduced by 84.8%.
Avg
Shuffle
Fetch
Delay
%

Magnet Results – Improve Job Performance
Magnet brings performance improvements to Spark jobs at scale without any user intervention.
We analyzed 9628 Spark workflows that were onboarded to Magnet.
Overall compute resource consumption reduction is 16.3%.
Among flows previously heavily impacted by shuffle fetch delays (shuffle fetch delay time > 30% of total task
time), overall compute resource consumption reduction is 44.8%. These flows are about 19.5% of the the flows
we analyzed.

Magnet Results – Improve Job Performance
• 50% of heavily impacted
workflows have seen at
least 32% reduction in
their job runtime.
• The percentile time-
series graph represents
thousands of spark
workflows.
App
Duration
Reduction
%
25th percentile 50th
percentile 75th percentile

Key Finding
• Benefits from Magnet increases as adoption of Magnet increases.
App
Duration
Reduction
%

Magnet Results on NVMe
• Magnet still achieves significant benefits with NVMe disks for storing
shuffle data.
• Results of a benchmark job with and without Magnet on a 10-node cluster
with HDD and NVMe disks, respectively:
Runtime with HDD (min) /
Comparison with baseline
Runtime with NVMe (min) /
Comparison with baseline
Magnet disabled 16 (baseline) 10 (-37.5%)
Magnet enabled 7.3 (-54.4%) 4.2 (-73.7%)

Future Work
Contribute Magnet back to OSS
Spark: SPARK-30602
Cloud-native architecture for
Magnet
Bring superior shuffle
architecture to broader set of
use cases

Optimize Shuffle on Disaggregated Storage
Existing shuffle data storage
solutions
Store shuffle data on limited local storage
devices attached with VMs.
Store shuffle data on remote disaggregate
storage.
Drawbacks with these approaches
It hampers the VM elasticity and runs the risk
of exhausting local storage capacity.
It has non-negligible performance overhead.

Optimize Shuffle on Disaggregated Storage
§ Leverage both local and remote storage,
such that the local storage acts as a caching
layer for shuffle storage.
§ The local elastic storage can tolerate
running out of storage space or compute
VMs getting decommissioned.
§ Preserve Magnet’s benefit of much
improved shuffle data locality while
decoupling shuffle storage from compute
VMs

Optimize Python-centric Data Processing
The surge of the usage of AI in recent years has driven the adoption
of Python for building AI data pipelines.
Magnet’s optimization of shuffle is generic enough to benefit both
SQL-centric analytics and Python-centric AI use cases.
Support Magnet in Kubernetes.

Resources
SPARK-30602
Magnet: A scalable and performant shuffle architecture for Apache Spark
Magnet: Push-based Shuffle Service for Large-scale Data Processing
Bringing Next-Gen Shuffle Architecture To Data Infrastructure at LinkedIn Scale
SPARK
SPIP ticket
Blog post
VLDB 2020
Blog post

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Magnet Shuffle Service: Push-based Shuffle at LinkedIn

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Magnet Shuffle Service: Push-based Shuffle at LinkedIn

Similaire à Magnet Shuffle Service: Push-based Shuffle at LinkedIn (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

Magnet Shuffle Service: Push-based Shuffle at LinkedIn