3. Outline of talk
Power & Energy in Data
Centers
Network architecture
Protocol interactions
Conclusions
Partha Kundu Special Session NOCS 2011 3
4. Power & Energy in
the Data Center
Partha Kundu Special Session NOCS 2011 4
5. Data Center Energy breakdown Server Peak power usage profile
Source: ASHRAE Source: Google 2007
• Power delivery and Cooling overheads
CPU power contribution is less than 1/3
are quantified in PUE metric
of server power
• Cooling is the most significant source of
energy inefficiency
Partha Kundu Special Session NOCS 2011 5
6. Energy Efficiency
Source : Barroso, Holzle: Data Center as a Computer, Morgan Claypool (publishers), 2009
Servers are never
completely idle Most of the time server But, server is least energy
load is around 30% efficient in it’s most common
operating region!
Partha Kundu Special Session NOCS 2011 6
7. Dynamic Power Range
CPU power component (peak & idle) in
servers has reduced over the years
Dynamic Power range:
• CPU power range is 3x for servers
• DRAM range is 2X
• Disk and Networking is < 1.2X
Disk and Network switches need to
learn from the CPU’s power Source : Barroso, Holzle: Data Center as a Computer, Morgan Claypool
proportionality gains (publishers), 2009
Partha Kundu Special Session NOCS 2011 7
8. Energy Proportionality
Goal:
Achieve best energy efficiency
(~80%) in the common operating
regions (20 – 30% load)
Challenges to proportionality:
• Most proportionality tricks in embedded/mobile devices are not useable in DC due to
huge activation penalties
• Distributed structure of data and application doesn’t allow powering down during low
use
• Disk drives spin >50% of time even when there is no activity
[Sankar et al, ISCA ‘08] smaller rotational speeds, multiple heads
Partha Kundu Special Session NOCS 2011 8
9. Application Behavior in Data Centers
• Cosmos is similar to data
mining workload
• Bing preloads web
index in memory
• But, peak disk
bandwidth can be high
Source : Kozyrakis et al, IEEE Micro 2010
Significant variation in disk, memory and network
capacity and bandwidth usage across Apps
Partha Kundu Special Session NOCS 2011 9
10. Dynamic Resource requirements
in the Data-center
Intra-server variation (TPC-H, log scale) Inter-server variation (rendering farm)
100GB
Server Memory Allocation
10GB
1GB
100MB
10MB
1MB
0.1MB
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Time
Query
Huge variations even within a single Application running in a
large cluster
Partha Kundu Special Session NOCS 2011 10
11. Motivating Disaggregated memory*
*Lim et al: Disaggregated Memory for expansion and sharing in Blade Servers, ISCA 2009
Conventional blade systems
DIMM DIMM
DIMM DIMM
DIMM CPUs CPUs DIMM
DIMM DIMM
Backplane
DIMM DIMM
DIMM DIMM
DIMM CPUs CPUs DIMM
DIMM DIMM
Partha Kundu Special Session NOCS 2011 11
12. Disaggregated Memory*
Blade systems with disaggregated memory
DIMM DIMM
DIMM CPUs CPUs DIMM
Backplane
DIMM DIMM
DIMM CPUs CPUs DIMM
Leverage fast, shared DIMM DIMM
communication fabrics DIMM DIMM
DIMM DIMM
Break CPU-memory co-location DIMM DIMM
*Lim et al: Disaggregated Memory for expansion and sharing in Blade Servers, ISCA 2009 Memory blade
Partha Kundu Special Session NOCS 2011 12
13. Disaggregated Memory*
*Lim et al: Disaggregated Memory for expansion and sharing in Blade Servers, ISCA 2009
Blade systems with disaggregated memory
DIMM CPUs CPUs DIMM
DIMM DIMM
Backplane
DIMM CPUs CPUs DIMM
DIMM DIMM
DIMMDIMM
DIMMDIMM
DIMMDIMM
DIMMDIMM
Authors claim: Memory
8X improvement on memory constrained blade
environments
80+% improvement in performance per $
3x consolidation
Partha Kundu Special Session NOCS 2011 13
14. Disaggregated Server
Servers with Consolidated
Power Fabric
DRAM Disk drives
supply connectivity
High Density, Low Power SM10000 Servers*
• Designed to replace 40 1 RU servers in a single 10 RU system.
• 512 1.66 GHz 64 bit X86 Intel Atom cores in 10 RU; 2,048 CPUs/rack
• 1.28 Terabit interconnect fabric
• Up to 64 1 Gbps or 16 10 Gbps uplinks
SeaMicro SM10000 server* • 0-64 SATA SSD/Hard disk
• Integrated load balancing, Ethernet switching, and server
Claim: management
• Uses less than 2.5 KW of power
Achieves 4x Space &
Power consolidation
*Source : Seamicro URL http://www.seamicro.com/?q=node/102
Partha Kundu Special Session NOCS 2011 14
15. Network
Architecture
Partha Kundu Special Session NOCS 2011 15
16. Requirements of a Cloud-enabled
Data Center
Economic & Technical Motivations:
Use commodity hardware &
components
Dynamically distribute compute
resources
Capacity re- Economies
allocation of Scale
Partha Kundu Special Session NOCS 2011 16
17. Status Quo: Conventional DC Network
Internet
CR CR
DC-Layer 3
AR AR
... AR AR
DC-Layer 2
S S Key
• CR = Core Router (L3)
• AR = Access Router (L3)
S S S S
... • S = Ethernet Switch (L2)
• A = Rack of app. servers
… …
~ 1,000 servers/pod == IP subnet
Ref: “Data Center: Load balancing Data Center Services”, Cisco 2004
Partha Kundu Special Session NOCS 2011 17
18. Conventional DC Network Problems
CR CR
~ 200:1
AR AR AR AR
S S S S
~ 40:1
S S S S
S ~S
5:1 S S
...
… … … …
• Cost of network equipment is prohibitive
• Limited server-to-server capacity
Partha Kundu Special Session NOCS 2011 18
19. And More Problems …
CR CR
~ 200:1
AR AR AR AR
S S S S
S S S S S S S S
… … … …
IP subnet (VLAN) #1 IP subnet (VLAN) #2
• Resource fragmentation, significantly lowering cloud
utilization (and cost-efficiency)
Partha Kundu Special Session NOCS 2011 19
20. And More Problems …
CR CR
~ 200:1
AR AR AR AR
Complicated manual
S S L2/L3 re-configuration S S
S S S S S S S S
… … … …
IP subnet (VLAN) #1 IP subnet (VLAN) #2
• Server IP address assignments are topological
• IP movement from contained VLAN is hard
Partha Kundu Special Session NOCS 2011 20
21. What We Need is…..
1. L2 semantics
2. Uniform High 3. Performance
capacity isolation
… … … …
Partha Kundu Special Session NOCS 2011 21
22. Achieve Uniform High Capacity :
Clos Network Topology*
*Ref: A Scalable, Commodity, Data Center architecture, Al-Fares et al, SIGCOMM 2008
Int
..
.
Aggr
..
K aggr switches with D ports
.
.. TOR .....
. .
.. 20
Servers
.......
20*(DK/4)
. .
Servers
• Large bisection BW
• Multi paths at modest cost
• Tolerates Fabric Failure
Partha Kundu Special Session NOCS 2011 22
23. Addressing and Routing:
Name-Location Separation
Switches run link-state routing and Directory
maintain only switch-level topology Service
…
x ToR2
y ToR3
z ToR4
ToR1 . . . ToR2 ... ToR3 ... ToR4 …
ToR3 y payload Lookup &
x y z Response
ToR3 z payload
4
Servers use flat names
*VL2: A Scalable and Flexible Data Center Network, Greenberg et al, SIGCOMM 2009
Partha Kundu Special Session NOCS 2011 23
24. Addressing and Routing:
Name-Location Separation
Switches run link-state routing and Directory
maintain only switch-level topology Service
…
x ToR2
y ToR3
z ToR4
3
ToR1 . . . ToR2 ... ToR3 ... ToR4 …
ToR3 y payload Lookup &
x yz Response
ToR3 z payload
4
Servers use flat names
*VL2: A Scalable and Flexible Data Center Network, Greenberg et al, SIGCOMM 2009
Partha Kundu Special Session NOCS 2011 24
25. VL2 Fabric
Objectives and Solutions
Objective Approach Solution
Name-location
1. Layer-2 Employ flat
separation &
semantics addressing
resolution service
2. Uniform Guarantee Clos based network,
high capacity bandwidth for Valiant LB flow
between servers hose-model traffic routing
Enforce hose model
3. Performance
using existing TCP
Isolation
mechanisms only
Partha Kundu Special Session NOCS 2011 25
26. Protocol
Interactions
Partha Kundu Special Session NOCS 2011 26
27. TCP InCast Collapse : Problem
Source : Nagle et al, The Panasas ActiveScale Storage
Cluster – Delivering Scalable High Bandwidth Storage,
SC2004
Affects key datacenter applications with barrier synchronization boundaries
e.g. DFS, web search, MapReduce
Partha Kundu Special Session NOCS 2011 27
31. Solution: TCP with ms-RTO*
*Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication, Vasudevan et al,
SIGCOMM 2009
• Little adverse effect on WAN traffic
Partha Kundu Special Session NOCS 2011 31
32. Incast Collapse :
an unsolved problem at scale*
*Understanding TCP Incast Throughput Collapse in Datacenter Networks, Griffith et al WREN 2009
Solution space is complex:
• Network conditions can impact RTT
• Switch buffer management strategies
• Goodput can be unstable with load/num. senders
Partha Kundu Special Session NOCS 2011 32
34. Data Center Computing
• Opportunities to realize energy efficiency
particularly in IO sub-systems
• Data Center fabrics need to be re-architected
for application scalability and cost
• WAN artifacts can create bottlenecks
Partha Kundu Special Session NOCS 2011 34
35. NOCs in the Data Center
• Energy Efficiency:
Local (distributed) energy management decision
& coordination by NOC
• Fabric communication:
NOC can reduce intra-chip/socket communication
latencies between VMs
• Congestion Mgt:
NOC can assist in traffic orchestration across VMs
Partha Kundu Special Session NOCS 2011 35