Leveraging Endpoint Flexibility in Data-Intensive Clusters

Leveraging Endpoint Flexibility
in Data-Intensive Clusters
Mosharaf Chowdhury
Srikanth Kandula
Ion Stoica
Presented by Ran Ziv UC Berkeley
Ran Ziv© 2013 1

What’s Ahead?
• Intro - Data Intensive Cluster
• Proposed solution
• Evaluation
• Conclusion
Ran Ziv© 2013 2

What is Data Intensive Cluster?
• Scalable data storage and processing
• “Core” consists of two main parts
• Distributed File System (DFS)
• Processing (MapReduce)
Ran Ziv© 2013 3

Motivation
Store and analyze PBs of information
Ran Ziv© 2013 4

How was it Originated?
• Heavily inﬂuenced by Google’s architecture
• Other Web companies quickly saw the beneﬁts
Ran Ziv© 2013 5

DFS: How does it work?
• Moore’s law… and not
Ran Ziv© 2013 6

Disk Capacity and Price
• We’re generating more data than ever before
• Fortunately, the size and cost of storage has kept
pace
Ran Ziv© 2013 7

Disk Capacity and Performance
• Disk performance has also increased in the last 15
years
• Unfortunately, transfer rates haven’t kept pace with
capacity
Ran Ziv© 2013 8

Architecture of a Typical HPC System
Ran Ziv© 2013 9

You Don’t Just Need Speed…
• The problem is that we have way more data than
code
Ran Ziv© 2013 13

You Need Speed At Scale
Ran Ziv© 2013 14

DISTRIBUTED FILESYSTEM
Ran Ziv© 2013 15

Benefits of DFS
• Previously impossible/impractical to do this analysis
• Analysis conducted at lower cost
• Analysis conducted in less time
• Linear scalability
Ran Ziv© 2013 16

Collocated Storage and Processing
• Solution: store and process data on the same nodes
• Data Locality: “Bring the computation to the data”
• Reduces I/O and boosts performance
Ran Ziv© 2013 17

DFS High-Level Architecture
• DFS follows a master-slave architecture
• Master: NameNode
• Responsible for namespace and metadata
• Namespace: file hierarchy
• Metadata: ownership, permissions, block locations, etc.
• Slave: DataNode
• Responsible for storing actual datablocks
Ran Ziv© 2013 19

DFS Blocks
• When a ﬁle is added to DFS, it’s split into blocks
• DFS uses a much larger block size (>= 64MB), for
performance
Ran Ziv© 2013 20

DFS Replication
• Those blocks are then replicated across machines
• The first block might be replicated to A, C and D
Ran Ziv© 2013 21

DFS Replication
• The next block might be replicated to B, D and E
Ran Ziv© 2013 22

DFS Replication
• The last block might be replicated to A, C and E
Ran Ziv© 2013 23

DFS Reliability
• Replication helps to achieve reliability
• Even when a node fails, two copies of the block remain
• These will be re-replicated to other nodes automatically
Ran Ziv© 2013 24

MapReduce High-Level Architecture
Like DFS, MapReduce has a master-slave Architecture
• Master: JobTracker
• Responsible for dividing, scheduling and monitoring work
• Slave: TaskTracker
• Responsible for actual processing
Ran Ziv© 2013 26

Gentle Introduction to MapReduce
• MapReduce is conceptually like a UNIX pipeline
• One function (Map) processes data
• That output is ultimately input to another function
(Reduce)
Ran Ziv© 2013 27

The Map Function
• Operates on each record individually
• Typical uses include ﬁltering, parsing, or transforming
Ran Ziv© 2013 28

Intermediate Processing
• The Map function’s output is grouped and sorted
• This is the automatic “sort and shuﬄe” process
Ran Ziv© 2013 29

The Reduce Function
• Operates on all records in a group
• Often used for sum, average or other aggregate functions
Ran Ziv© 2013 30

MapReduce Flow
Job Tracker
Machine
Intermediate Files
Output Files
Maper
(Task)
Maper
(Task)
Maper
(Task)
Maper
(Task)
Input Files
Reducers
(Task)
Reducers
(Task)
Reducers
(Task)
Ran Ziv© 2013 31

Communication is Crucial
Performance
Facebook analytics jobs spend 33% of their runtime in
communication
Ran Ziv© 2013 32

Cross-Rack Traffic
DFS
Reads
14%
Inter.
46%
DFS
Writes
40%
DFS
Reads
31%
Inter.
15%
DFS
Writes
54%
Facebook Bing
Ran Ziv© 2013 33

DFS
Core
Rack 1 Rack 2 Rack 3
F
F F
Files are divided into
blocks
• 64MB to 1GB in size
Each block is replicated
• To 3 machines for fault
tolerance
• In 2 fault domains for partition
tolerance.
Synchronous operations
F I L E
III
E LL E
L E
Ran Ziv© 2013 34

DFS
• Files are divided into
blocks
• Each block is replicated
tolerance
tolerance.
Fixed Sources
Destinations
Flexible Paths
Rates
Core
F
FII
E LL E
How to handle
DFS flows?
Hedera, VLB,
Orchestra, Coflow,
MicroTE, DevoFlow, …
Ran Ziv© 2013 35

DFS
• Files are divided into
blocks
• Each block is replicated
tolerance
tolerance.
Fixed Sources
Destinations
Flexible Paths
Rates
Core
F
FII
E LL E
Replica location don’t matter
As long as constraints are met
Flexible Sources
Destinations
How to handle
DFS flows?
Hedera, VLB,
Orchestra, Coflow,
Ran Ziv© 2013 36

Sinbad
Steers flexible replication traffic away from hotspots
• Improve write rates
• More balanced network
Ran Ziv© 2013 37

The Distributed Writing Problem
Core
Given
• Blocks of different size
• Links of different capacities
Place blocks to minimize
• The average block write time
• The average file write time
F EI L
Given
• Jobs of different length, and
• Machines of different speed,
Schedule jobs to minimize
• The average job completion time
Machine 1
Machine 2
Machine 3
Job Shop Scheduling
J O B is NP-Hard
Ran Ziv© 2013 38

How to Make it Easy?
Assumptions:
• All blocks have the same size
• Link utilizations are stable
Theorem:
Greedy placement minimizes
average block/file write times
Ran Ziv© 2013 39

How to Make it Easy? – In Practice
• Link utilizations are stable
In Reality: Average link utilizations are temporarily stable1,2
• All blocks have the same size
In Reality: Fixed-size large blocks write 93% of all bytes
1. Utilization is considered stable if its average over next x seconds remains within ±5% of the initial value
2. Typically, x ranges from 5 to 10 seconds.
Ran Ziv© 2013 40

Greedy Algorithm
two-step greedy replica placement:
1. Pick the least-loaded link
2. Send a block from the file with the least-
remaining blocks through the selected link
1C B
TT+1
Time
A2
Ran Ziv© 2013 42

Sinbad Overview
follows a master-slave architecture
• Master:
• collocated with the CFS master
• Decides where to place each block
• Slave:
• periodically report information
Sinbad
Master
DFS
Master
DFS
Slave
Sinbad
Slave
DFS
Slave
Sinbad
Slave
DFS
Slave
Sinbad
Slave
Machine
Ran Ziv© 2013 43

Evaluation
A 3000-node trace-driven simulation matched against a
100-node EC2 deployment
1. Does it improve performance?
2. Does it balance the network?
3. Does the storage remain balanced? YES
Ran Ziv© 2013 47

More Balanced
EC2 Deployment
0
0.25
0.5
0.75
1
0 1 2 3 4
FractionofTime
Coeff. of Var. of Load
Across Rack-to-Host Links
Default
Network-Aware
Facebook Trace
Simulation
0
0.25
0.5
0.75
1
0 1 2 3 4
FractionofTime
Coeff. of Var. of Load
Across Core-to-Rack Links
Default
Network-Aware
Imbalance
(Coeff. of Var.1 of Link
Utilization)
Imbalance
(Coeff. of Var.1 of Link
Utilization)Leveraging Endpoint Flexibility
Ran Ziv© 2013 50

What About Storage Balance?
Imbalanced in the short term
But, in the long term,
hotspots are uniformly distributed
Ran Ziv© 2013 51

Conclusions
Three
Approaches
Toward
Contention
Mitigation
#3
Balance
Usage
Manage elephant
flows
Optimize
intermediate comm.
Valiant load balancing (VLB),
Hedera, Orchestra, Coflow,
#1
Increase
Capacity
Fatter
links/interfaces
Increase Bisection
B/W
Fat tree, VL2, DCell, BCube,
F10, …
#2
Decrease
Load
Data locality
Static optimization
Fair scheduling, Delay
scheduling, Mantri, Quincy,
PeriSCOPE, RoPE, Rhea, …
Ran Ziv© 2013 54

• Improves job performance by making the network more
balanced
• Improves DFS write performance while keeping the
storage balanced
• Sinbad will become increasingly more important as
storage becomes faster
Sinbad
Greedily steers
replication traffic
away from hotspots
Planning to deploy Sinbad at
Ran Ziv© 2013 55

Leveraging Endpoint Flexibility in Data-Intensive Clusters

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Leveraging Endpoint Flexibility in Data-Intensive Clusters

Similaire à Leveraging Endpoint Flexibility in Data-Intensive Clusters (20)

Dernier

Dernier (20)

Leveraging Endpoint Flexibility in Data-Intensive Clusters

Notes de l'éditeur