SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
SPARK SHUFFLE
INTRODUCTION
天火@蘑菇街
About Me
Spark / Hadoop / Hbase / Phoenix
contributor
For spark mainly contributes in:
•  Yarn
•  Shuffle
•  BlockManager
•  Scala2.10 update
•  Standalone HA
•  Various other fixes.
tianhuo@mogujie.com
Weibo @冷冻蚂蚁
blog.csdn.net/colorant
Why Spark is fast(er)
•  Whom do we compare to?
•  What do we mean by fast?
•  fast to write
•  fast to run
Why Spark is fast(er) cont.
•  But the figure in previous page is some how misleading.
•  The key is the flexible programming mode.
•  Which lead to more reasonable data flow.
•  Which lead to less IO operation.
•  Especially for iterative heavy workloads like ML.
•  Which potentially cut off a lot of shuffle operations needed.
•  But, you won’t always be lucky.
•  Many app logic did need to exchange a lot of data.
•  In the end, you will still need to deal with shuffle
•  And which usually impact performance a lot.
What is shuffle
7
Shuffle overview
Aggregator
Aggregator
AggregatorAggregator
AggregatorAggregator
How does shuffle come into the picture
•  Spark run job stage by stage.
•  Stages are build up by DAGScheduler according to RDD’s
ShuffleDependency
•  e.g. ShuffleRDD / CoGroupedRDD will have a ShuffleDependency
•  Many operator will create ShuffleRDD / CoGroupedRDD under
the hook.
•  Repartition/CombineByKey/GroupBy/ReduceByKey/cogroup
•  many other operator will further call into the above operators
•  e.g. various join operator will call cogroup.
•  Each ShuffleDependency maps to one stage in Spark Job
and then will lead to a shuffle.
So everyone should have seen this before
join	
  
union	
  
groupBy	
  
map	
  
Stage	
  3	
  
Stage	
  1	
  
Stage	
  2	
  
A:	
   B:	
  
C:	
   D:	
  
E:	
  
F:	
  
G:	
  
why shuffle is expensive
•  When doing shuffle, data no longer stay in memory only
•  For spark, shuffle process might involve
•  data partition: which might involve very expensive data sorting
works etc.
•  data ser/deser: to enable data been transfer through network or
across processes.
•  data compression: to reduce IO bandwidth etc.
•  DISK IO: probably multiple times on one single data block
•  E.g. Shuffle Spill, Merge combine
History
•  Spark 0.6-0.7, same code path with RDD’s persistent
method, can choose MEMORY_ONLY and DISK_ONLY
(default).
•  Spark 0.8-0.9:
•  separate shuffle code path from BM and create
ShuffleBlockManager and BlockObjectWriter only for shuffle, now
shuffle data can only be written to disk.
•  Shuffle optimization: Consolidate shuffle write.
•  Spark 1.0, pluggable shuffle framework.
•  Spark 1.1, sort-based shuffle implementation.
•  Spark 1.2 netty transfer service reimplementation. sort-
based shuffle by default
•  Spark 1.2+ on the go: external shuffle service etc.
LOOK
INSIDE
13
Pluggable Shuffle Framework
•  ShuffleManager
•  Manage shuffle related components, registered in SparkEnv,
configured through SparkConf, default is sort (pre 1.2 is hash),
•  ShuffleWriter
•  Handle shuffle data output logics. Will return MapStatus to be
tracked by MapOutputTracker.
•  ShuffleReader
•  Fetch shuffle data to be used by e.g. ShuffleRDD
•  ShuffleBlockManager
•  Manage the mapping relation between abstract bucket and
materialized data block.
High level data flow
BlockManager
HashShuffleManager
DiskBlockManager
FileShuffleBlockManager
Local File System
SortShuffleManager
IndexShuffleBlockManager
GetBlockData
BlockTransferService
GetBlockData
Direct mapping or
mapping by File Groups
Map to One Data File and
One Index File per mapId
Just do one-one File mapping
15
Hash Based Shuffle - Shuffle Writer
•  Basic shuffle writer
Map	
  Task Map	
  Task Map	
  Task Map	
  Task
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
Aggregator AggregatorAggregatorAggregator
Each bucket is mapping to a single file
16
Hash Based Shuffle - Shuffle Writer
•  Consolidate Shuffle Writer
Each bucket is mapping to a segment of file
Aggregator
Aggregator
Aggregator
Aggregator
17
Hash Based Shuffle - Shuffle Writer
•  Basic Shuffle Writer
•  M * R shuffle spill files
•  Concurrent C * R opened shuffle files.
•  If shuffle spill enabled, could generate more tmp spill files say N.
•  Consolidate Shuffle Writer
•  Reduce the total spilled files into C * R if (M >> C)
•  Concurrent opened is the same as the basic shuffle writer.
•  Memory consumption
•  Thus Concurrent C * R + N file handlers.
•  Each file handler could take up to 32~100KB+ Memory for various
buffers across the writer stream chain.
18
Sort Based Shuffle - Shuffle Writer
•  Sort Shuffle Writer
Map	
  Task Map	
  Task Map	
  Task Map	
  Task
FileSegment
FileSegment
FileSegment
FileSegment
FileSegment
FileSegment
FileSegment
FileSegment
FileSegment
FileSegment
FileSegment
FileSegment
FileSegment
FileSegment
FileSegment
FileSegment
FileFile
FileFile FileFile
FileFile
ExternalSorter ExternalSorterExternalSorterExternalSorter
19
Sort Based Shuffle - Shuffle Writer
•  Each map task generates 1 shuffle data file + 1 index file
•  Utilize ExternalSorter to do the sort works.
•  If map-side combine is required, data will be sorted by
key and partition for aggregation. Otherwise data will
only be sorted by partition.
•  If reducer number <= 200 and no need to do aggregation
or ordering, data will not be sorted at all.
•  Will go with hash way and spill to separate files for each reduce
partition, then merge them into one per map for final output.
20
Hash Based Shuffle - Shuffle Reader
•  Actually, at present, Sort Based Shuffle also go with
HashShuffleReader
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Reduce	
  TaskReduce	
  Task Reduce	
  Task Reduce	
  Task Reduce	
  Task
Aggregator AggregatorAggregatorAggregator
BLOCK
TRANSFER
SERVICE
Related conceptions
•  BlockTransferService
•  Provide a general interface for ShuffleFetcher and working with
BlockDataManager to get local data.
•  ShuffleClient
•  Wrap up the fetching data process for the client side, say setup
TransportContext, new TransportClient etc.
•  TransportContext
•  Context to setup the transport layer
•  TransportServer
•  low-level streaming service server
•  TransportClient
•  Client for fetching consecutive chunks TransportServer
ShuffleManager
Data Flow
23
BlockManager
NioBlockTransferService
GetBlockData
BlockDataManager
ConnectionManager
NioBlockTransferService
ConnectionManager
GetBlock
GotBlock
BlockStoreShuffleFetcher
ShuffleBlockFetcherIterator
Block
Manager
Local Blocks
Remote Blocks
Local Remote
HashShuffleReader
fetch
ShuffleManager
ShuffleBlockManager
GetBlockData
Can Switch to different BlockTransferService
ShuffleManager
Data Flow
24
BlockManager
NettyBlockTransferService
GetBlockData
BlockDataManager
TransportClient
NettyBlockTransferService
TransportServer
BlockStoreShuffleFetcher
ShuffleBlockFetcherIterator
Block
Manager
Local Blocks
Remote Blocks
Local Remote
HashShuffleReader
fetch
ShuffleManager
ShuffleBlockManager
GetBlockData
clientHandler
TransportChannel
HandlerclientHandler
TransportChannel
Handler
Fetch
Request
Fetch
Results
External Shuffle Service
•  Design goal
•  allow for the service to be long-running
•  possibly much longer-running than Spark
•  support multiple version of Spark simultaneously etc.
•  can be integrated into YARN NodeManager, Standalone Worker, or
on its own
•  The entire service been ported to Java
•  do not include Spark's dependencies
•  full control over the binary compatibility of the components
•  not depend on the Scala runtime or version.
External Shuffle Service
•  Current Status
•  Basic framework seems ready.
•  A Network module extracted from the core module
•  BlockManager could be configured with executor built-in shuffle
service or external standalone shuffle service
•  A standaloneWorkerShuffleService could be launched by worker
•  Disabled by default.
•  How it works
•  Shuffle data is still written by the shuffleWriter to local disks.
•  The external shuffle service knows how to read these files on disks
(executor will registered related info to it, e.g. shuffle manager type,
file dir layout etc.), it follow the same rules applied for written these
file, so it could serve the data correctly.
28
Sort Merge Shuffle Reader
•  Background:
•  Current HashShuffleReader does not utilize the sort result within
partition in map-side.
•  The actual by key sort work is always done at reduce side.
•  While the map side will do by-partition sort anyway ( sort shuffle )
•  Change it to a by-key-and-partition sort does not bring many extra
overhead.
•  Current Status
•  [WIP] https://github.com/apache/spark/pull/3438
Some shuffle related configs
•  spark.shuffle.spill (true)
•  spark.shuffle.memoryFraction (0.2)
•  spark.shuffle.manager [sort]/hash
•  spark.shuffle.sort.bypassMergeThreshold (200)
•  spark.shuffle.blockTransferService [netty]/nio
•  spark.shuffle.consolidateFiles (false)
•  spark.shuffle.service.enabled (false)
What’s next?
•  Other custom shuffle logic?
•  Alternative way to save shuffle data blocks
•  E.g. in memory (again)
•  Other transport mechanism?
•  Break stage barrier?
•  To fetch shuffle data when part of the map tasks are done.
•  Push mode instead of pull mode?
Thanks to Jerry Shao
•  Some of this ppt’s material came from Jerry Shao@Intel
weibo: @saisai_shao
•  Jerry also contributes a lot of essential patches for spark
core / spark streaming etc.
Join Us !
• 加盟 / 合作 / 讨论 统统欢迎
• 数据平台开发,大数据相关技术,只要够
Cool,我们都玩
• tianhuo@mogujie.com
Spark shuffle introduction

Contenu connexe

Tendances

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoDatabricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 

Tendances (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with Cosco
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 

Similaire à Spark shuffle introduction

Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkDongwon Kim
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Optimization of modern web applications
Optimization of modern web applicationsOptimization of modern web applications
Optimization of modern web applicationsEugene Lazutkin
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaData Con LA
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Spark_Intro_Syed_Academy
Spark_Intro_Syed_AcademySpark_Intro_Syed_Academy
Spark_Intro_Syed_AcademySyed Hadoop
 

Similaire à Spark shuffle introduction (20)

Apache Spark
Apache SparkApache Spark
Apache Spark
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Optimization of modern web applications
Optimization of modern web applicationsOptimization of modern web applications
Optimization of modern web applications
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Spark_Intro_Syed_Academy
Spark_Intro_Syed_AcademySpark_Intro_Syed_Academy
Spark_Intro_Syed_Academy
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Scalable Web Apps
Scalable Web AppsScalable Web Apps
Scalable Web Apps
 

Dernier

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 

Dernier (20)

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 

Spark shuffle introduction

  • 2. About Me Spark / Hadoop / Hbase / Phoenix contributor For spark mainly contributes in: •  Yarn •  Shuffle •  BlockManager •  Scala2.10 update •  Standalone HA •  Various other fixes. tianhuo@mogujie.com Weibo @冷冻蚂蚁 blog.csdn.net/colorant
  • 3. Why Spark is fast(er) •  Whom do we compare to? •  What do we mean by fast? •  fast to write •  fast to run
  • 4. Why Spark is fast(er) cont. •  But the figure in previous page is some how misleading. •  The key is the flexible programming mode. •  Which lead to more reasonable data flow. •  Which lead to less IO operation. •  Especially for iterative heavy workloads like ML. •  Which potentially cut off a lot of shuffle operations needed. •  But, you won’t always be lucky. •  Many app logic did need to exchange a lot of data. •  In the end, you will still need to deal with shuffle •  And which usually impact performance a lot.
  • 5.
  • 8. How does shuffle come into the picture •  Spark run job stage by stage. •  Stages are build up by DAGScheduler according to RDD’s ShuffleDependency •  e.g. ShuffleRDD / CoGroupedRDD will have a ShuffleDependency •  Many operator will create ShuffleRDD / CoGroupedRDD under the hook. •  Repartition/CombineByKey/GroupBy/ReduceByKey/cogroup •  many other operator will further call into the above operators •  e.g. various join operator will call cogroup. •  Each ShuffleDependency maps to one stage in Spark Job and then will lead to a shuffle.
  • 9. So everyone should have seen this before join   union   groupBy   map   Stage  3   Stage  1   Stage  2   A:   B:   C:   D:   E:   F:   G:  
  • 10. why shuffle is expensive •  When doing shuffle, data no longer stay in memory only •  For spark, shuffle process might involve •  data partition: which might involve very expensive data sorting works etc. •  data ser/deser: to enable data been transfer through network or across processes. •  data compression: to reduce IO bandwidth etc. •  DISK IO: probably multiple times on one single data block •  E.g. Shuffle Spill, Merge combine
  • 11. History •  Spark 0.6-0.7, same code path with RDD’s persistent method, can choose MEMORY_ONLY and DISK_ONLY (default). •  Spark 0.8-0.9: •  separate shuffle code path from BM and create ShuffleBlockManager and BlockObjectWriter only for shuffle, now shuffle data can only be written to disk. •  Shuffle optimization: Consolidate shuffle write. •  Spark 1.0, pluggable shuffle framework. •  Spark 1.1, sort-based shuffle implementation. •  Spark 1.2 netty transfer service reimplementation. sort- based shuffle by default •  Spark 1.2+ on the go: external shuffle service etc.
  • 13. 13 Pluggable Shuffle Framework •  ShuffleManager •  Manage shuffle related components, registered in SparkEnv, configured through SparkConf, default is sort (pre 1.2 is hash), •  ShuffleWriter •  Handle shuffle data output logics. Will return MapStatus to be tracked by MapOutputTracker. •  ShuffleReader •  Fetch shuffle data to be used by e.g. ShuffleRDD •  ShuffleBlockManager •  Manage the mapping relation between abstract bucket and materialized data block.
  • 14. High level data flow BlockManager HashShuffleManager DiskBlockManager FileShuffleBlockManager Local File System SortShuffleManager IndexShuffleBlockManager GetBlockData BlockTransferService GetBlockData Direct mapping or mapping by File Groups Map to One Data File and One Index File per mapId Just do one-one File mapping
  • 15. 15 Hash Based Shuffle - Shuffle Writer •  Basic shuffle writer Map  Task Map  Task Map  Task Map  Task File File File File File File File File File File File File File File File File Aggregator AggregatorAggregatorAggregator Each bucket is mapping to a single file
  • 16. 16 Hash Based Shuffle - Shuffle Writer •  Consolidate Shuffle Writer Each bucket is mapping to a segment of file Aggregator Aggregator Aggregator Aggregator
  • 17. 17 Hash Based Shuffle - Shuffle Writer •  Basic Shuffle Writer •  M * R shuffle spill files •  Concurrent C * R opened shuffle files. •  If shuffle spill enabled, could generate more tmp spill files say N. •  Consolidate Shuffle Writer •  Reduce the total spilled files into C * R if (M >> C) •  Concurrent opened is the same as the basic shuffle writer. •  Memory consumption •  Thus Concurrent C * R + N file handlers. •  Each file handler could take up to 32~100KB+ Memory for various buffers across the writer stream chain.
  • 18. 18 Sort Based Shuffle - Shuffle Writer •  Sort Shuffle Writer Map  Task Map  Task Map  Task Map  Task FileSegment FileSegment FileSegment FileSegment FileSegment FileSegment FileSegment FileSegment FileSegment FileSegment FileSegment FileSegment FileSegment FileSegment FileSegment FileSegment FileFile FileFile FileFile FileFile ExternalSorter ExternalSorterExternalSorterExternalSorter
  • 19. 19 Sort Based Shuffle - Shuffle Writer •  Each map task generates 1 shuffle data file + 1 index file •  Utilize ExternalSorter to do the sort works. •  If map-side combine is required, data will be sorted by key and partition for aggregation. Otherwise data will only be sorted by partition. •  If reducer number <= 200 and no need to do aggregation or ordering, data will not be sorted at all. •  Will go with hash way and spill to separate files for each reduce partition, then merge them into one per map for final output.
  • 20. 20 Hash Based Shuffle - Shuffle Reader •  Actually, at present, Sort Based Shuffle also go with HashShuffleReader Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Reduce  TaskReduce  Task Reduce  Task Reduce  Task Reduce  Task Aggregator AggregatorAggregatorAggregator
  • 22. Related conceptions •  BlockTransferService •  Provide a general interface for ShuffleFetcher and working with BlockDataManager to get local data. •  ShuffleClient •  Wrap up the fetching data process for the client side, say setup TransportContext, new TransportClient etc. •  TransportContext •  Context to setup the transport layer •  TransportServer •  low-level streaming service server •  TransportClient •  Client for fetching consecutive chunks TransportServer
  • 24. ShuffleManager Data Flow 24 BlockManager NettyBlockTransferService GetBlockData BlockDataManager TransportClient NettyBlockTransferService TransportServer BlockStoreShuffleFetcher ShuffleBlockFetcherIterator Block Manager Local Blocks Remote Blocks Local Remote HashShuffleReader fetch ShuffleManager ShuffleBlockManager GetBlockData clientHandler TransportChannel HandlerclientHandler TransportChannel Handler Fetch Request Fetch Results
  • 25.
  • 26. External Shuffle Service •  Design goal •  allow for the service to be long-running •  possibly much longer-running than Spark •  support multiple version of Spark simultaneously etc. •  can be integrated into YARN NodeManager, Standalone Worker, or on its own •  The entire service been ported to Java •  do not include Spark's dependencies •  full control over the binary compatibility of the components •  not depend on the Scala runtime or version.
  • 27. External Shuffle Service •  Current Status •  Basic framework seems ready. •  A Network module extracted from the core module •  BlockManager could be configured with executor built-in shuffle service or external standalone shuffle service •  A standaloneWorkerShuffleService could be launched by worker •  Disabled by default. •  How it works •  Shuffle data is still written by the shuffleWriter to local disks. •  The external shuffle service knows how to read these files on disks (executor will registered related info to it, e.g. shuffle manager type, file dir layout etc.), it follow the same rules applied for written these file, so it could serve the data correctly.
  • 28. 28 Sort Merge Shuffle Reader •  Background: •  Current HashShuffleReader does not utilize the sort result within partition in map-side. •  The actual by key sort work is always done at reduce side. •  While the map side will do by-partition sort anyway ( sort shuffle ) •  Change it to a by-key-and-partition sort does not bring many extra overhead. •  Current Status •  [WIP] https://github.com/apache/spark/pull/3438
  • 29. Some shuffle related configs •  spark.shuffle.spill (true) •  spark.shuffle.memoryFraction (0.2) •  spark.shuffle.manager [sort]/hash •  spark.shuffle.sort.bypassMergeThreshold (200) •  spark.shuffle.blockTransferService [netty]/nio •  spark.shuffle.consolidateFiles (false) •  spark.shuffle.service.enabled (false)
  • 30. What’s next? •  Other custom shuffle logic? •  Alternative way to save shuffle data blocks •  E.g. in memory (again) •  Other transport mechanism? •  Break stage barrier? •  To fetch shuffle data when part of the map tasks are done. •  Push mode instead of pull mode?
  • 31. Thanks to Jerry Shao •  Some of this ppt’s material came from Jerry Shao@Intel weibo: @saisai_shao •  Jerry also contributes a lot of essential patches for spark core / spark streaming etc.
  • 32. Join Us ! • 加盟 / 合作 / 讨论 统统欢迎 • 数据平台开发,大数据相关技术,只要够 Cool,我们都玩 • tianhuo@mogujie.com