Papers We Love January 2015 - Flat Datacenter Storage

Flat Datacenter Storage
Presented by Alex Rasmussen
Papers We Love SF #11
2015-01-22
Edmund B. Nightingale, Jeremy Elson, Jinliang Fan, 
Owen Hofmann, Jon Howell, and Yutaka Suzue

Sort Really Fast
THEMIS
MapReduce Really Fast

Image Credit: http://bit.ly/17Vf8Hb

Move the
Computation to
the Data!

Location
Awareness
Adds Complexity

Why
“Move the
Computation
to the Data”?

Remote Data
Access is Slow.
Why?

Core
Aggregation
Edge
ure 1: Common data center interconnect topology. Host to switch links are GigE and links between switches are 10 G
25
30
35
40
1:1
3:1
7:1
Fat-tree
Hierarchical design Fat-tree
Year 10 GigE Hosts
Cost/
GigE Hosts
C
GigE G
2002 28-port 4,480 $25.3K 28-port 5,488 $
2004 32-port 7,680 $4.4K 48-port 27,648 $
2006 64-port 10,240 $2.1K 48-port 27,648 $
2008 128-port 20,480 $1.8K 48-port 27,648 $
Aggregate Bandwidth Above Less Than
Aggregate Demand Below
Sometimes by 100x or more
A B

What if I told you
the network isn’t
oversubscribed?

Consequences
• No local vs. remote disk distinction
• Simpler work schedulers
• Simpler programming models

FDS
Object Storage
Assuming
No Oversubscription

Motivation
Architecture and API
Metadata Management
Replication and Recovery
Data Transport
Why FDS Matters

Blob 0xbadf00d
Tract 0 Tract 1 Tract 2 Tract n...
8 MB
CreateBlob
OpenBlob
CloseBlob
DeleteBlob
GetBlobSize
ExtendBlob
ReadTract
WriteTract

API Guarantees
• Tractserver writes are atomic
• Calls are asynchronous
- Allows deep pipelining
• Weak consistency to clients

Tract Locator Version TS
1 0 A
2 0 B
3 2 D
4 0 A
5 3 C
6 0 F
... ... ...
Tract Locator Table

Tract_Locator =  
TLT[(Hash(GUID) + Tract) % len(TLT)]

Randomize blob’s tractserver,
even if GUIDs aren’t random
(uses SHA-1)
Tract_Locator =  

Large blobs use all TLT
entries uniformly
Tract_Locator =  

Blob Metadata is Distributed
Tract_Locator =  
TLT[(Hash(GUID) - 1) % len(TLT)]

TLT Construction
• m Permutations of Tractserver List
• Weighted by disk speed
• Served by metadata server to clients
• Only update when cluster changes

1 0 A
2 0 B
3 2 D
4 0 A
5 3 C
6 0 F
... ... ...
Cluster Growth

1 1 NEW / A
2 0 B
3 2 D
4 1 NEW / A
5 4 NEW / C
6 0 F
... ... ...
Cluster Growth

1 2 NEW
2 0 A
3 2 A
4 2 NEW
5 5 NEW
6 0 A
... ... ...
Cluster Growth

Tract Locator Version Replica 1 Replica 2 Replica 3
1 0 A B C
2 0 A C Z
3 0 A D H
4 0 A E M
5 0 A F G
6 0 A G P
... ... ... ... ...
Replication

Replication
• Create, Delete, Extend:
- client writes to primary
- primary 2PC to replicas
• Write to all replicas
• Read from random replica

1 0 A B C
2 0 A C Z
3 0 A D H
4 0 A E M
5 0 A F G
6 0 A G P
... ... ... ... ...
Recovery

1 0 A B C
2 0 A C Z
3 0 A D H
4 0 A E M
5 0 A F G
6 0 A G P
... ... ... ... ...
Recovery
Recover 1TB from 3000 disks in < 20 seconds
H
E
A
L
M
E

Networking
Pod 0
10.0.2.1
10.0.1.1
Pod 1 Pod 3Pod 2
10.2.0.2 10.2.0.3
10.2.0.1
10.4.1.1 10.4.1.2 10.4.2.1 10.4.2.2
Core
10.2.2.1
10.0.1.2
Edge
Aggregation
Figure 3: Simple fat-tree topology. Using the two-level routing tables described in Section 3.3, packets from source 10.0.1.2 to
destination 10.2.0.3 would take the dashed path.
Prefix
10.2.0.0/24
10.2.1.0/24
0.0.0.0/0
Output port
0
1
Suffix Output port
Next hop
10.2.0.1
10.2.1.1
10.4.1.1
Address
00
01
10
Output port
0
1
2
RAM
Encoder
10.2.0.X
10.2.1.X
X.X.X.2
TCAM
CLOS topology: small switches + ECMP  
= full bisection bandwidth

Networking
• Network bandwidth = disk bandwidth
• Full bisection bandwidth is stochastic
• Short ﬂows good for ECMP
• TCP hates short ﬂows
• RTS/CTS to mitigate incast; see paper

Hardware/Software
Combination
Designed for a
Speciﬁc Workload

FDS Works Great for
Blob Storage on
CLOS Networks

MinuteSort - Daytona
System!
(Nodes)
Year
Data
Sorted
Speed  
per Disk
Hadoop
(1408)
2009 500GB 3 MB/s
FDS
(256)
2012 1470GB 46 MB/s

MinuteSort - Indy
System!
(Nodes)
Year
Data
Sorted
Speed  
per Disk
TritonSort
(66)
2012 1353GB 43.3 MB/s
FDS
(256)
2012 1470GB 47.9 MB/s

FDS isn’t built for
oversubscribed
networks. 
It’s also not a DBMS.

MapReduce and GFS:
Thousands of
Cheap PCs,
Bulk Synchronous
Processing

10x
MapReduce and GFS
Aren’t Designed for
or Iterative, or OLAP

FDS’ Lessons
• Great example of ground-up rethink
- Ambitious but implementable
• Big wins possible with co-design
• Constantly re-examine assumptions

TritonSort & Themis
• Balanced hardware architecture
• Full bisection-bandwidth network
• Job-level fault tolerance
• Huge wins possible
- Beat 3000+ node cluster by 35%
with 52 nodes
• NSDI 2012, SoCC 2013

Thanks
@alexras • alexras@acm.org • alexras.info

Papers We Love January 2015 - Flat Datacenter Storage

Recommandé

Recommandé

Contenu connexe

Similaire à Papers We Love January 2015 - Flat Datacenter Storage

Similaire à Papers We Love January 2015 - Flat Datacenter Storage (20)

Plus de Alex Rasmussen

Plus de Alex Rasmussen (7)

Dernier

Dernier (20)

Papers We Love January 2015 - Flat Datacenter Storage