Cost Effective centralized adpative routing for networks on chip
1. A Cost Effective Centralized
Adaptive Routing for Networks
on Chip
Ran Manevich*, Israel Cidon*, Avinoam Kolodny*,
Isask’har (Zigi) Walter* and Shmuel Wimer#
*Technion – Israel Institute #Bar-Ilan University
of Technology
QNoC
Module Module Module
Research
Module Module Module Module
Module Module
Group Module
Module
Module
May 2, 2011
4. Adaptive Routing in NoCs – Local
vs. Global Information
I CAN MAKE
Source IT!!!
A Packet routed Low
from upper left to Congestion
bottom right Medium
Congestion
corner utilizing
High
local congestion Congestion
information.
The same packet
routed using
global
information. Destination
May 2, 2011
5. Route Selection - ATDOR
ATDOR - Adaptive Toggle Dimension Ordered Routing
Keep it simple! Centralized selection:
The option with less congested bottleneck link is
preferred.
Routing tables in sources. One bit per destination.
May 2, 2011
6. ATDOR Illustration 1
Five identical flows, 100
MB/s each.
Initial routing - XY
Links modeled as M/M/1
queues. Delay of a single
link:
Traffic
DLINK
Capacity Traffic
Links capacity is 210
MB/s.
May 2, 2011
7. Centralized Routing – How?
• Option 1 – Continuous calculation of optimal routing
for the active sessions:
Achievable load balancing
Speed and computation
complexity
System complexity
May 2, 2011
8. Centralized Routing – How?
• Option 2 – Iterative serial selection based on traffic
load measurements between XY and YX for all source-
destination pairs:
Achievable load balancing
Speed and computation
complexity
System complexity
May 2, 2011
10. What did we just see?
For each flow we:
1. Calculated the better route.
2. Updated routing table of the source.
3. Waited for the update to take effect
and measured global traffic load.
Performing steps 1-3 for each flow is slow
and not scalable.
Steps 2 and 3 are unified for all destinations of a single source:
Achievable load balancing
Speed and computation complexity
Scalability
May 2, 2011
11. Back illustration 1
Step # Re-Routed
Flow
4
3
1 4->15
1->15
2->8
5
2
2->15
Average Delay
22 ns
∞
May 2, 2011
12. Problem #1
Changing routing may enhance
congestion and cause fluctuations.
Solution: Change routing only if the
alternative is better by the margin α, 0<
α <1:
if (Current Route = XY)
YX if MAX[Load YX ] a MAX[Load XY ]
NextRoute =
XY if MAX[Load YX ] > a MAX[Load XY ]
elseif (Current Route = YX)
XY if MAX[Load XY ] a MAX[Load YX ]
NextRoute =
YX if MAX[Load XY ] > a MAX[Load YX ]
May 2, 2011
14. Problem #2
Coupling among flows sharing the same
source.
Solution: Re-Routing counters CI,J count
routing changes of flows from source I to
destination J (FI,J). When CI,J reaches a
limit LI,J, routing of FI,J is locked. A
Possible definition of Limits LI,J :
LI , J (I J ) mod 3
May 2, 2011
15. Back to illustration 2
Flows R. Changes
Left
1->16 0
1
2
1->15 0
1
2->14
1->14 0
LI , J (I J ) mod 3
Average Delay
22
73 ns
∞
May 2, 2011
16. Bring it all together
Flows R. Changes
Left
1->15 0
1
2->8 0
1
2->15 0
1
2
4->15 0
1
LI , J (I J ) mod 3
Average Delay
14
22 ns
∞
May 2, 2011
17. Centralized Adaptive Routing for
NoCs - Architecture
Local traffic load
measurements
inside the routers.
Traffic load
measurements
aggregation into
Traffic Load Maps.
Routing control.
May 2, 2011
18. Load Measurements Aggregation
An illustration of
aggregation of load
values in a 4X4 2D
mesh.
A congestion value is
written to each traffic
load map every clock
cycle.
May 2, 2011
19. ATDOR – Route Selection Circuit
Maximally loaded links of the two
alternatives are compared. Next
route:
if(Current Route = XY)
YX if MAX[Load YX ] a MAX[Load XY ]
NextRoute =
XY if MAX[Load YX ] > a MAX[Load XY ]
elseif(Current Route = YX)
XY if MAX[Load XY ] a MAX[Load YX ]
NextRoute =
YX if MAX[Load XY ] > a MAX[Load YX ]
0 < a <1
• Combinatorial pipelined
implementation.
Result every ATDOR clock cycle.
May 2, 2011
20. Hardware Requirements
The whole mechanism
was implemented on
xc5vlx50t VIRTEX 5
FPGA.
Estimated area for 45nm
technology node.
Per-Router hardware overheads in % for a NoC with typical size
(50 KGates) virtual channel routers.
May 2, 2011
21. Average Packet Delay – Uniform
Traffic
• Average delay vs. average load in links normalized to links
capacity. 8X8 2D Mesh. Uniform traffic pattern.
May 2, 2011
22. Average Packet Delay – Transpose
Traffic
• Average delay vs. average load in links normalized to links
capacity. 8X8 2D Mesh. Transpose traffic pattern.
May 2, 2011
23. Average Packet Delay – Hotspot
Traffic
• Average delay vs. average load in links normalized to links
capacity. 8X8 2D Mesh. 4 Hotspots traffic pattern.
May 2, 2011
24. Control Iteration Duration
• Number of re-routed flows vs. time.
• 8X8 2D Mesh, ATDOR clock of 100 MHz.
α = 15/16 α = 3/4
May 2, 2011
25. CMP DNUCA - Architecture
• 8X8 CMP DNUCA (Dynamic Non Uniform Cache Array)
with 8 CPUs and 56 cache banks:
May 2, 2011
26. CMP DNUCA – Saturation
Throughput
• Saturation throughput - Splash 2 and Parsec benchmarks
on 8X8 CMP DNUCA with 8 CPUs and 56 cache banks:
May 2, 2011
27. Conclusions
• Centralized adaptive routing is feasible for NoCs.
ATDOR: Centralized selection between XY and
YX for each source-destination pair.
Hardware overhead: <4% of an 8X8 typical NoC.
Average saturation throughput improvement:
Vs. O1TURN Vs. RCA
Synthetic Patterns 19.3% 12.1%
Spash 2 and Parsec 22.8% 12.8%
Benchmarks
May 2, 2011