SlideShare a Scribd company logo
1 of 32
Download to read offline
SOS: Optimizing Shuffle I/O
Brian Cho and Ergin Seyfe, Facebook
Haoyu Zhang, Princeton University
1. Shuffle I/O at large scale
Large-scale shuffle
Stage 1 Stage 2
• Shuffle: all-to-all communication between stages
• >10x larger than available memory, strong fault tolerance requirements
→ on-disk shuffle files
Shuffle I/O grows quadratically with data
0 5000 10000
1umber RI TasNs
0
1000
2000
3000
4000
ShuIIOeTime(sec)
ShuIIOe Time
0
40
80
120
5eTuestCRunt/106
I/2 5eTuest
0 5000 10000
1umber of TasNs
0
500
1000
1500
Size(KB)
SKuffle )etFK Size
• M * R (number of mappers * number of reducers) shuffle fetches
• Large amount of fragmented I/O requests
• Adversarial workload for hard drives!
Strawman: tune number of tasks in a job
• Tasks spill intermediate data to disk if data splits exceed memory capacity
• Larger task execution reduces shuffle I/O, but increases spill I/O
Strawman: tune number of tasks in a job
• Need to retune when input data volume changes for each individual job
• Small tasks run into the quadratic I/O problem
• Bulky tasks can be detrimental [Dolly NSDI 13] [SparkPerf NSDI 15] [Monotask SOSP 17]
• straggler problems, imbalanced workload, garbage collection overhead
300
400
500
600
700
800
900
1000
2000
4000
8000
10000
1umber of 0aS 7asNs
0
1000
2000
3000
7ime(sec)
6huffle 6Sill
300
400
500
600
700
800
900
1000
2000
4000
8000
10000
1umber of 0aS 7asNs
0
1000
2000
3000
7ime(sec)
6huffle 6Sill
300
400
500
600
700
800
900
1000
2000
4000
8000
10000
1umber of 0aS 7asNs
0
1000
2000
3000
7ime(sec)
6huffle 6Sill
Small Tasks
Bulky Tasks
Large Amount of
Fragmented Shuffle I/O
Fewer, Sequential
Shuffle I/O
2. SOS: optimizing shuffle I/O
a.k.a. Riffle, presented at Eurosys 2018
Deployed at Facebook scale
SOS: optimizing shuffle I/O
SOS: optimizing shuffle I/O
• Merge map task outputs into larger
shuffle files
1. Combines small shuffle files into
larger ones
2. Keeps partitioned file layout
• Reducers fetch fewer, large blocks
instead of many, small blocks
• Number of requests:
(M * R) / (merge factor)
Optimized Shuffle Service
merge
request
map
map
map
reduce
reduce
reduce
reduce
reduce
reduce
reduce
map
map
map
merge
request
Application Driver
Merge Scheduler
Worker-Side Merger
SOS: optimizing shuffle I/O
• SOS shuffle service: a long running instance on each physical node
• SOS scheduler: keeps track of shuffle files and issues merge requests
Worker NodeWorker NodeTaskTaskTasks Worker Machine
Task Task Task Task
File System
ExecutorExecutor
SOS Shuffle Service
Driver
Job / Task
Scheduler
SOS Merge
Scheduler
assign
report task
statuses
report merge
statuses
send merge
requests
Results on synthetic workload (unoptimized)
1R 0erge 5 10 20 40
1-Way 0erge
0
100
200
300
400
500
Time(sec)
0aS SWage 5educe SWage
•SOS reduces number of fetch requests by 10x
•Reduce stage -393s, map stage +169s → job completes 35% faster
1R 0erge 5 10 20 40
1-Way 0erge
0
1500
3000
4500
6000
6ize(KB)
5ead BlRcN 6ize
0
2000
4000
6000
8000
5equesWCRunW
1umber Rf 5eads
Best-effort merge: mixing merged and unmerged files
1R 0erge 5 10 20 40
1-Way 0erge
0
1500
3000
4500
6000
6ize(KB)
5ead BlRcN 6ize
0
2000
4000
6000
8000
5equesWCRunW
1umber Rf 5eads
1R 0erge 5 10 20 40
1-Way 0erge
0
1500
3000
4500
6000
6ize(KB)
5ead BlRcN 6ize
0
2000
4000
6000
8000
5equesWCRunW
1umber Rf 5eads
1R 0erge 5 10 20 40
1-Way 0erge
0
100
200
300
400
500
Time(sec)
0aS SWage 5educe SWage
• Reduce stage -393s, map stage +52s → job completes 53% faster
• SOS finishes job with only ~50% of cluster resources!
1R 0erge 5 10 20 40
1-Way 0erge
0
100
200
300
400
500
Time(sec)
0aS SWage 5educe SWage
Best-effort merge (95%)
Additional details
• Merge operation fault-tolerance
• Handled by falling back to the unmerged files
• Efficient memory management
• Merger read/write large buffers for performance and IO efficiency
Block 65
Block 66
…
Block 67
…
Block 65
Block 66
…
Block 67
…
Block 65
Block 66
…
Block 67
…
…
Block 65-1
Block 65-2
Block 65-m
…
Block 66-1
Block 66-2
Block 66-m
…
BufferedRead
BufferedWrite
Merge
3. Deployment and
observed gains
Deployment
• Started staged rollout late last year
• Completed in April, running stably for over a month
SOS + zstd
• Rollout includes zstd compression with SOS
• Combined they produce a net gain in IO and Compute efficiency
SOS + zstd
• Rollout includes zstd compression with SOS
• Combined they produce a net gain in IO and Compute efficiency
SOS zstd Net
Spill I/O
Shuffle I/O
SOS + zstd
• Rollout includes zstd compression with SOS
• Combined they produce a net gain in IO and Compute efficiency
SOS zstd Net
Spill I/O Regression Gain Small Gain
Shuffle I/O
SOS + zstd
• Rollout includes zstd compression with SOS
• Combined they produce a net gain in IO and Compute efficiency
SOS zstd Net
Spill I/O Regression Gain Small Gain
Shuffle I/O Gain Small Gain Gain
SOS + zstd
• Rollout includes zstd compression with SOS
• Combined they produce a net gain in IO and Compute efficiency
SOS zstd Net
Spill I/O Regression Gain Small Gain
Shuffle I/O Gain Small Gain Gain
SOS zstd Net
CPU time
Reserved CPU time
SOS + zstd
• Rollout includes zstd compression with SOS
• Combined they produce a net gain in IO and Compute efficiency
SOS zstd Net
Spill I/O Regression Gain Small Gain
Shuffle I/O Gain Small Gain Gain
SOS zstd Net
CPU time No change Small Regression Small Regression
Reserved CPU time
SOS + zstd
• Rollout includes zstd compression with SOS
• Combined they produce a net gain in IO and Compute efficiency
SOS zstd Net
Spill I/O Regression Gain Small Gain
Shuffle I/O Gain Small Gain Gain
SOS zstd Net
CPU time No change Small Regression Small Regression
Reserved CPU time Gain No change Gain
IO Gains: Request-level
Spark-level I/O requests: number of
application-level R/W requests made
• 7.5x less
IO Gains: Disk-level
Disk service time: time spent on
disks in the storage system
• 2x more efficient
IO Gains: Disk-level
Disk service time: time spent on
disks in the storage system
• 2x more efficient
Average IO Size: average size
of IO request at the disks
• 2.5x increase
Compute Gains
Reserved CPU time: resources
allocated for Spark executors
Total 10% Gain
• CPU time: time spent using CPU
à 5% Regression
• I/O time: time spent waiting (not
using CPU)
à 75% Gain
Currently working on increasing
these gains
4. Summary
1) Shuffle at large scale induces large fragmented shuffle I/Os
2) SOS provides a solution to optimize these I/Os
3) SOS deployed and running stably at Facebook scale
4) Observed gains of 2x more efficient I/O which translates to
10% more efficient compute
5) Plan to contribute back to Apache Spark
Summary
Questions?
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe

More Related Content

What's hot

Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 

What's hot (20)

Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Apache spark
Apache sparkApache spark
Apache spark
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with Cosco
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 

Similar to SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe

Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Kyle Hailey
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems
Baruch Osoveskiy
 
Deployment Strategy
Deployment StrategyDeployment Strategy
Deployment Strategy
MongoDB
 
UKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsUKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O Statistics
Kyle Hailey
 
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red_Hat_Storage
 
High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013
Server Density
 

Similar to SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe (20)

IO Dubi Lebel
IO Dubi LebelIO Dubi Lebel
IO Dubi Lebel
 
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
 
Understanding and Measuring I/O Performance
Understanding and Measuring I/O PerformanceUnderstanding and Measuring I/O Performance
Understanding and Measuring I/O Performance
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems
 
Deployment
DeploymentDeployment
Deployment
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: Sharding
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Deployment Strategies
Deployment StrategiesDeployment Strategies
Deployment Strategies
 
Deployment Strategy
Deployment StrategyDeployment Strategy
Deployment Strategy
 
LUG 2014
LUG 2014LUG 2014
LUG 2014
 
UKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsUKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O Statistics
 
Java Garbage Collectors – Moving to Java7 Garbage First (G1) Collector
Java Garbage Collectors – Moving to Java7 Garbage First (G1) CollectorJava Garbage Collectors – Moving to Java7 Garbage First (G1) Collector
Java Garbage Collectors – Moving to Java7 Garbage First (G1) Collector
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
 
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
 
High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 

Recently uploaded (20)

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 

SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe

  • 1.
  • 2. SOS: Optimizing Shuffle I/O Brian Cho and Ergin Seyfe, Facebook Haoyu Zhang, Princeton University
  • 3. 1. Shuffle I/O at large scale
  • 4. Large-scale shuffle Stage 1 Stage 2 • Shuffle: all-to-all communication between stages • >10x larger than available memory, strong fault tolerance requirements → on-disk shuffle files
  • 5. Shuffle I/O grows quadratically with data 0 5000 10000 1umber RI TasNs 0 1000 2000 3000 4000 ShuIIOeTime(sec) ShuIIOe Time 0 40 80 120 5eTuestCRunt/106 I/2 5eTuest 0 5000 10000 1umber of TasNs 0 500 1000 1500 Size(KB) SKuffle )etFK Size • M * R (number of mappers * number of reducers) shuffle fetches • Large amount of fragmented I/O requests • Adversarial workload for hard drives!
  • 6. Strawman: tune number of tasks in a job • Tasks spill intermediate data to disk if data splits exceed memory capacity • Larger task execution reduces shuffle I/O, but increases spill I/O
  • 7. Strawman: tune number of tasks in a job • Need to retune when input data volume changes for each individual job • Small tasks run into the quadratic I/O problem • Bulky tasks can be detrimental [Dolly NSDI 13] [SparkPerf NSDI 15] [Monotask SOSP 17] • straggler problems, imbalanced workload, garbage collection overhead 300 400 500 600 700 800 900 1000 2000 4000 8000 10000 1umber of 0aS 7asNs 0 1000 2000 3000 7ime(sec) 6huffle 6Sill 300 400 500 600 700 800 900 1000 2000 4000 8000 10000 1umber of 0aS 7asNs 0 1000 2000 3000 7ime(sec) 6huffle 6Sill 300 400 500 600 700 800 900 1000 2000 4000 8000 10000 1umber of 0aS 7asNs 0 1000 2000 3000 7ime(sec) 6huffle 6Sill
  • 8. Small Tasks Bulky Tasks Large Amount of Fragmented Shuffle I/O Fewer, Sequential Shuffle I/O
  • 9. 2. SOS: optimizing shuffle I/O
  • 10. a.k.a. Riffle, presented at Eurosys 2018 Deployed at Facebook scale SOS: optimizing shuffle I/O
  • 11. SOS: optimizing shuffle I/O • Merge map task outputs into larger shuffle files 1. Combines small shuffle files into larger ones 2. Keeps partitioned file layout • Reducers fetch fewer, large blocks instead of many, small blocks • Number of requests: (M * R) / (merge factor) Optimized Shuffle Service merge request map map map reduce reduce reduce reduce reduce reduce reduce map map map merge request Application Driver Merge Scheduler Worker-Side Merger
  • 12. SOS: optimizing shuffle I/O • SOS shuffle service: a long running instance on each physical node • SOS scheduler: keeps track of shuffle files and issues merge requests Worker NodeWorker NodeTaskTaskTasks Worker Machine Task Task Task Task File System ExecutorExecutor SOS Shuffle Service Driver Job / Task Scheduler SOS Merge Scheduler assign report task statuses report merge statuses send merge requests
  • 13. Results on synthetic workload (unoptimized) 1R 0erge 5 10 20 40 1-Way 0erge 0 100 200 300 400 500 Time(sec) 0aS SWage 5educe SWage •SOS reduces number of fetch requests by 10x •Reduce stage -393s, map stage +169s → job completes 35% faster 1R 0erge 5 10 20 40 1-Way 0erge 0 1500 3000 4500 6000 6ize(KB) 5ead BlRcN 6ize 0 2000 4000 6000 8000 5equesWCRunW 1umber Rf 5eads
  • 14. Best-effort merge: mixing merged and unmerged files 1R 0erge 5 10 20 40 1-Way 0erge 0 1500 3000 4500 6000 6ize(KB) 5ead BlRcN 6ize 0 2000 4000 6000 8000 5equesWCRunW 1umber Rf 5eads 1R 0erge 5 10 20 40 1-Way 0erge 0 1500 3000 4500 6000 6ize(KB) 5ead BlRcN 6ize 0 2000 4000 6000 8000 5equesWCRunW 1umber Rf 5eads 1R 0erge 5 10 20 40 1-Way 0erge 0 100 200 300 400 500 Time(sec) 0aS SWage 5educe SWage • Reduce stage -393s, map stage +52s → job completes 53% faster • SOS finishes job with only ~50% of cluster resources! 1R 0erge 5 10 20 40 1-Way 0erge 0 100 200 300 400 500 Time(sec) 0aS SWage 5educe SWage Best-effort merge (95%)
  • 15. Additional details • Merge operation fault-tolerance • Handled by falling back to the unmerged files • Efficient memory management • Merger read/write large buffers for performance and IO efficiency Block 65 Block 66 … Block 67 … Block 65 Block 66 … Block 67 … Block 65 Block 66 … Block 67 … … Block 65-1 Block 65-2 Block 65-m … Block 66-1 Block 66-2 Block 66-m … BufferedRead BufferedWrite Merge
  • 17. Deployment • Started staged rollout late last year • Completed in April, running stably for over a month
  • 18. SOS + zstd • Rollout includes zstd compression with SOS • Combined they produce a net gain in IO and Compute efficiency
  • 19. SOS + zstd • Rollout includes zstd compression with SOS • Combined they produce a net gain in IO and Compute efficiency SOS zstd Net Spill I/O Shuffle I/O
  • 20. SOS + zstd • Rollout includes zstd compression with SOS • Combined they produce a net gain in IO and Compute efficiency SOS zstd Net Spill I/O Regression Gain Small Gain Shuffle I/O
  • 21. SOS + zstd • Rollout includes zstd compression with SOS • Combined they produce a net gain in IO and Compute efficiency SOS zstd Net Spill I/O Regression Gain Small Gain Shuffle I/O Gain Small Gain Gain
  • 22. SOS + zstd • Rollout includes zstd compression with SOS • Combined they produce a net gain in IO and Compute efficiency SOS zstd Net Spill I/O Regression Gain Small Gain Shuffle I/O Gain Small Gain Gain SOS zstd Net CPU time Reserved CPU time
  • 23. SOS + zstd • Rollout includes zstd compression with SOS • Combined they produce a net gain in IO and Compute efficiency SOS zstd Net Spill I/O Regression Gain Small Gain Shuffle I/O Gain Small Gain Gain SOS zstd Net CPU time No change Small Regression Small Regression Reserved CPU time
  • 24. SOS + zstd • Rollout includes zstd compression with SOS • Combined they produce a net gain in IO and Compute efficiency SOS zstd Net Spill I/O Regression Gain Small Gain Shuffle I/O Gain Small Gain Gain SOS zstd Net CPU time No change Small Regression Small Regression Reserved CPU time Gain No change Gain
  • 25. IO Gains: Request-level Spark-level I/O requests: number of application-level R/W requests made • 7.5x less
  • 26. IO Gains: Disk-level Disk service time: time spent on disks in the storage system • 2x more efficient
  • 27. IO Gains: Disk-level Disk service time: time spent on disks in the storage system • 2x more efficient Average IO Size: average size of IO request at the disks • 2.5x increase
  • 28. Compute Gains Reserved CPU time: resources allocated for Spark executors Total 10% Gain • CPU time: time spent using CPU à 5% Regression • I/O time: time spent waiting (not using CPU) à 75% Gain Currently working on increasing these gains
  • 30. 1) Shuffle at large scale induces large fragmented shuffle I/Os 2) SOS provides a solution to optimize these I/Os 3) SOS deployed and running stably at Facebook scale 4) Observed gains of 2x more efficient I/O which translates to 10% more efficient compute 5) Plan to contribute back to Apache Spark Summary