4. Background Cont.
Oversubscription:
Ratio of the worst-case achievable aggregate bandwidth among
the end hosts to the total bisection bandwidth of a particular
communication topology
(1:1 indicates that all hosts may communicate with any arbitrary
host at full BW. Ex) 1Gb/s for commodity Ethernet)
Lower the total cost of the design
Typical designs: factor of 2.5:1 (400 Mbps)to 8:1(125 Mbps)
Cost:
Edge: $7,000 for each 48-port GigE switch
Aggregation and core: $700,000 for 128-port 10GigE switches
Cabling costs are not considered!
5. Current DC Network Architectures
2High level choices for building comm fabric for large scale clusters:
Leverages specialized hardware and communication protocols,
such as InfiniBand, Myrinet.
– These solutions can scale to clusters of thousands of nodes with high
bandwidth
– Expensive infrastructure, incompatible with TCP/IP applications
Leverages commodity Ethernet switches and routers to
interconnect cluster machines
– Backwards compatible with existing infrastructures, low-cost
– Aggregate cluster bandwidth scales poorly with cluster size, and achieving
the highest levels of bandwidth incurs non-linear cost increase with cluster
size
6. Problem With common DC topology
• Single point of failure
• Over subscript of links higher up in the
topology
7. Properties of the solution(FAT)
• Backwards compatible with existing
infrastructure
– No changes in applications.
– Support of layer 2 (Ethernet)
• Cost effective
– Low power consumption & heat emission
– Cheap infrastructure
• Allows host communication at line speed.
8. Clos Networks/Fat-Trees
• Adopt a special instance of a Clos topology:
“K-ary FAT tree”
• Similar trends in telephone switches led to
designing a topology with high bandwidth by
interconnecting smaller commodity switches.
9. Fat-Tree Based DC Architecture
• Inter-connect racks (of servers) using a fat-tree topology
K-ary fat tree: three-layer topology (edge, aggregation and core)
– each pod consists of (k/2)2 servers & 2 layers of k/2 k-port switches
– each edge switch connects to k/2 servers & k/2 aggr. switches
– each aggr. switch connects to k/2 edge & k/2 core switches
– (k/2)2 core switches: each connects to k pods
Fat-tree with
K=4
10. Fat-Tree Based Topology
• Why Fat-Tree?
– Fat tree has identical bandwidth at any bisections
– Each layer has the same aggregated bandwidth
• Can be built using cheap devices with uniform capacity
– Each port supports same speed as end host
– All devices can transmit at line speed if packets are distributed uniform along
available paths
• Great scalability: k-port switch supports k3/4 servers/hosts:
(k/2 hosts/switch * k/2 switches/pod * k pods)
Fat tree network with K = 6 supporting 54 hosts
11. FATtree DC meets the following goals:
• 1. Scalable interconnection bandwidth:
• Achieve the full bisetion bandwidth of clusters consisting
of tens of thousands of nodes.
• 2. Backward compatibility. No changes to end
hosts.
• 3. Cost saving.
12. • Enforce a special (IP) addressing scheme in
DC
– unused.PodNumber.switchnumber.Endhost
– Allows host attached to same switch to route
only through that switch
– Allows inter-pod traffic to stay within pod
• Exising routing protocals such as OSPF,
OSPF-ECMP are unavailable.
– Concentrate on traffic.
– Avoid congestion
– Fast Switch must be able to recognize!
FAT-Tree (Addressing)
13. FAT-Tree (2-level Look-ups)
• Use two level look-ups to distribute traffic
and maintain packet ordering:
– First level is prefix lookup
• used to route down the topology to hosts.
– Second level is a suffix lookup
• used to route up towards core
• maintain packet ordering by using same ports for
same server.
• Diffuses and spreads out traffic.
14. More on Fat-Tree DC Architecture
Diffusion Optimizations (optional dynamic
routing techniques)
1. Flow classification, Define a flow as a sequence
of packets; pod switches forward subsequent packets
of the same flow to same outgoing port. And
periodically reassign a minimal number of output
ports
– Eliminates local congestion
– Assign traffic to ports on a per-flow basis
instead of a per-host basis, Ensure fair
distribution on flows
15. More on Fat-Tree DC Architecture
2. Flow scheduling, Pay attention to routing large flows,
edge switches detect any outgoing flow whose size grows
above a predefined threshold, and then send notification to a
central scheduler. The central scheduler tries to assign non-
conflicting paths for these large flows.
– Eliminates global congestion
– Prevent long lived flows from sharing the same
links
– Assign long lived flows to different links
16. Fault-Tolerance
In this scheme, each switch in the network maintains a BFD
(Bidirectional Forwarding Detection) session with each of its
neighbors to determine when a link or neighboring switch fails. 2
kinds of failures:
Failure between upper layer and core switches
Outgoing inter-pod traffic: local routing table marks
the affected link as unavailable and chooses another
core switch
Incoming inter-pod traffic, core switch broadcasts a tag
to upper switches directly connected signifying its
inability to carry traffic to that entire pod, then upper
switches avoid that core switch when assigning flows
destined to that pod
17. Fault-Tolerance
• Failure b/w lower and upper layer switches:
– Outgoing inter- and intra pod traffic from lower-layer:
– the local flow classifier sets the cost to infinity and
does not assign it any new flows, chooses another
upper layer switch
– Intra-pod traffic using upper layer switch as intermediary:
– Switch broadcasts a tag notifying all lower level
switches, these would check when assigning new
flows and avoid it
– Inter-pod traffic coming into upper layer switch:
– Tag to all its core switches signifying its ability to carry
traffic, core switches mirror this tag to all upper layer
switches, then upper switches avoid affected core
switch when assigning new flaws
19. Evaluation
• Benchmark suite of communication mappings
to evaluate the performance of the 4-port fat-
tree using the TwoLevelTable switches,
FlowClassifier ans the FlowScheduler and
compare to hirarchical tree with 3.6:1
oversubscription ratio
22. Conclusion
Bandwidth is the scalability bottleneck in large
scale clusters
Existing solutions are expensive and limit cluster
size
Fat-tree topology with scalable routing and
backward compatibility with TCP/IP and Ethernet
Large number of commodity switches have the
potential of displacing high end switches in DC
the same way clusters of commodity PCs have
displaced supercomputers for high end
computing environments
24. Draw Backs
• Scalability
• Unable to support different traffic types
efficiently
• Simple load balancing
• Specialized hardware
• No inherent support for VLan traffic
• Ignored connectivity to the Internet
• Waste of address space
9/22/2013 COMP6611A In-class Presentation 24
25. Scalability
9/22/2013 COMP6611A In-class Presentation 25
• The size of the data center is fixed.
– For commonly available 24, 36, 48 and 64 ports
commodity switches, such a Fat-tree structure is
limited to sizes 3456, 8192, 27648 and 65536,
respectively.
• The cost of incremental deployment is
significant.
26. Unable to support different traffic
types efficiently
9/22/2013 COMP6611A In-class Presentation 26
• The proposed structure is switch-
oriented, which might not be sufficient to
support 1-to-x traffic.
• A flow can be setup between 2 any pairs of
servers with the help of the switching
fabric, but it is difficult to setup 1-to-many and
1-to-all flows efficiently using the proposed
algorithms.
27. Specialized hardware
9/22/2013 COMP6611A In-class Presentation 27
• This architecture still relies on a customized
routing primitive that does not exist in
commodity switches.
• To implement the 2-table lookup, the lookup
routine in a router have to be modified.
– The authors achieves this by modifying a NetFPGA
router.
– This modification is yet to be incorporated into
current commodity switches.
28. No inherent support for VLan traffic
9/22/2013 COMP6611A In-class Presentation 28
• Agility
– The capacity to assign any server to any services
• Performance Isolation
– Services should not affected each othe
29. Waste of address space
9/22/2013 COMP6611A In-class Presentation 29
• Semantic information in the IP address of
servers and switches.
– e.g. 10.pod.switch.0/24, port
– A large portion of the IP address is wasted.
• NAT is required at the border.
30. Possible Future Research
9/22/2013 COMP6611A In-class Presentation 30
• Flexible incremental expandability in data
center networks.
• Support Valiant Load Balancing or other load
balancing techniques in the network.
• Reduce cabling, or increase tolerance to mis-
wiring
31. References:
• A scalable Commodity DCN Archi, dept. of CSE- Univ of California
• pages.cs.wisc.edu/~akella/CS740/F08/DataCenters.ppt
• http://networks.cs.northwestern.edu/EECS495-
s11/prez/DataCenters-defence.ppt
• University of Hong Kong:
http://www.cse.ust.hk/~kaichen/courses/spring2013/comp6611/