Flat Datacenter Storage (FDS) is, as the intro describes, "a high-performance, fault-tolerant, large-scale, locality-oblivious blob store". It's also a great example of how carefully thought-out co-design of software and hardware for a target workload can yield really impressive performance results, even in the presence of heterogeneity and operating at scale. In my (admittedly biased) opinion, this style of system design doesn't get enough attention outside of academia, and has a lot to teach us about how data-intensive systems should be designed.
Papers We Love January 2015 - Flat Datacenter Storage
1. Flat Datacenter Storage
Presented by Alex Rasmussen
Papers We Love SF #11
2015-01-22
Edmund B. Nightingale, Jeremy Elson, Jinliang Fan,
Owen Hofmann, Jon Howell, and Yutaka Suzue
16. Core
Aggregation
Edge
ure 1: Common data center interconnect topology. Host to switch links are GigE and links between switches are 10 G
25
30
35
40
1:1
3:1
7:1
Fat-tree
Hierarchical design Fat-tree
Year 10 GigE Hosts
Cost/
GigE Hosts
C
GigE G
2002 28-port 4,480 $25.3K 28-port 5,488 $
2004 32-port 7,680 $4.4K 48-port 27,648 $
2006 64-port 10,240 $2.1K 48-port 27,648 $
2008 128-port 20,480 $1.8K 48-port 27,648 $
Aggregate Bandwidth Above Less Than
Aggregate Demand Below
Sometimes by 100x or more
A B
17. What if I told you
the network isn’t
oversubscribed?
18. Consequences
• No local vs. remote disk distinction
• Simpler work schedulers
• Simpler programming models
30. TLT Construction
• m Permutations of Tractserver List
• Weighted by disk speed
• Served by metadata server to clients
• Only update when cluster changes
31. Tract Locator Version TS
1 0 A
2 0 B
3 2 D
4 0 A
5 3 C
6 0 F
... ... ...
Cluster Growth
32. Tract Locator Version TS
1 1 NEW / A
2 0 B
3 2 D
4 1 NEW / A
5 4 NEW / C
6 0 F
... ... ...
Cluster Growth
33. Tract Locator Version TS
1 2 NEW
2 0 A
3 2 A
4 2 NEW
5 5 NEW
6 0 A
... ... ...
Cluster Growth
35. Tract Locator Version Replica 1 Replica 2 Replica 3
1 0 A B C
2 0 A C Z
3 0 A D H
4 0 A E M
5 0 A F G
6 0 A G P
... ... ... ... ...
Replication
36. Tract Locator Version Replica 1 Replica 2 Replica 3
1 0 A B C
2 0 A C Z
3 0 A D H
4 0 A E M
5 0 A F G
6 0 A G P
... ... ... ... ...
Replication
37. Replication
• Create, Delete, Extend:
- client writes to primary
- primary 2PC to replicas
• Write to all replicas
• Read from random replica
38. Tract Locator Version Replica 1 Replica 2 Replica 3
1 0 A B C
2 0 A C Z
3 0 A D H
4 0 A E M
5 0 A F G
6 0 A G P
... ... ... ... ...
Recovery
39. Tract Locator Version Replica 1 Replica 2 Replica 3
1 0 A B C
2 0 A C Z
3 0 A D H
4 0 A E M
5 0 A F G
6 0 A G P
... ... ... ... ...
Recovery
Recover 1TB from 3000 disks in < 20 seconds
H
E
A
L
M
E
42. Networking
Pod 0
10.0.2.1
10.0.1.1
Pod 1 Pod 3Pod 2
10.2.0.2 10.2.0.3
10.2.0.1
10.4.1.1 10.4.1.2 10.4.2.1 10.4.2.2
Core
10.2.2.1
10.0.1.2
Edge
Aggregation
Figure 3: Simple fat-tree topology. Using the two-level routing tables described in Section 3.3, packets from source 10.0.1.2 to
destination 10.2.0.3 would take the dashed path.
Prefix
10.2.0.0/24
10.2.1.0/24
0.0.0.0/0
Output port
0
1
Suffix Output port
Next hop
10.2.0.1
10.2.1.1
10.4.1.1
Address
00
01
10
Output port
0
1
2
RAM
Encoder
10.2.0.X
10.2.1.X
X.X.X.2
TCAM
CLOS topology: small switches + ECMP
= full bisection bandwidth
43. Networking
• Network bandwidth = disk bandwidth
• Full bisection bandwidth is stochastic
• Short flows good for ECMP
• TCP hates short flows
• RTS/CTS to mitigate incast; see paper
55. FDS’ Lessons
• Great example of ground-up rethink
- Ambitious but implementable
• Big wins possible with co-design
• Constantly re-examine assumptions
56.
57. TritonSort & Themis
• Balanced hardware architecture
• Full bisection-bandwidth network
• Job-level fault tolerance
• Huge wins possible
- Beat 3000+ node cluster by 35%
with 52 nodes
• NSDI 2012, SoCC 2013