This document contains the agenda for a workshop on Multi-Gbps TCP. The workshop includes presentations on high speed networks, LHC networks, TCP/AQM protocols and their duality model, control theory and stability of TCP/AQM, FAST simulations, and related TCP kernel projects. Presentations will be given by researchers from Caltech, CERN, UCLA, and INRIA on topics such as TCP protocols, active queue management, stability analysis, and simulations. There will also be discussion periods following some of the presentations.
1. Multi-Gbps TCP
9:00-10:00 Harvey Newman (Physics, Caltech)
High speed networks & grids
10:00-10:45 Sylvain Ravot (Physics, Caltech/CERN)
LHC networks and TCP Grid
10:45-11:00 Discussion
11:00-11:15 Break
11:15-12:00 Steven Low (CS/EE, Caltech)
TCP/AQM protocols and duality model
12:00-12:15 Dohy Hong (INRIA/Caltech)
Synchronization effect of TCP
12:15- 1:00 Lunch
2. Multi-Gbps TCP
1:00-2:00 Fernando Paganini (EE, UCLA)
Control theory and stability of TCP/AQM
2:00-2:30 Steven Low (CS/EE, Caltech)
Stabilized Vegas (& instability of Reno/RED)
2:30-3:00 Zhikui Wang (EE, UCLA)
FAST simulations
3:00-3:15 Break
3:15-4:00 David Wei, Cheng Hu (CS, Caltech)
Some related projects and TCP kernel
4:00-4:15 Break
4:15-5:00 Discussion
4. Acknowledgment
S. Athuraliya, D. Lapsley, V. Li, Q. Yin
(UMelb)
S. Adlakha (UCLA), D. Choe
(Postech/Caltech), J. Doyle (Caltech), K.
Kim (SNU/Caltech), L. Li (Caltech), F.
Paganini (UCLA), J. Wang (Caltech)
L. Peterson, L. Wang (Princeton)
6. Window Flow Control
~ W packets per RTT
Lost packet detected by missing ACK
RTT
time
time
Source
Destination
1 2 W
1 2 W
1 2 W
data ACKs
1 2 W
7. Source Rate
Limit the number of packets in the network
to window W
Source rate = bps
If W too small then rate « capacity
If W too big then rate > capacity
=> congestion
RTT
MSSW ×
8. TCP Window Flow Controls
Receiver flow control
Avoid overloading receiver
Set by receiver
awnd: receiver (advertised) window
Network flow control
Avoid overloading network
Set by sender
Infer available network capacity
cwnd: congestion window
Set W = min (cwnd, awnd)
9. Receiver Flow Control
Receiver advertises awnd with each ACK
Window awnd
closed when data is received and ack’d
opened when data is read
Size of awnd can be the performance limit
(e.g. on a LAN)
sensible default ~16kB
10. Network Flow Control
Source calculates cwnd from indication of
network congestion
Congestion indications
Losses
Delay
Marks
Algorithms to calculate cwnd
Tahoe, Reno, Vegas, RED, REM …
11. TCP Congestion Controls
Tahoe (Jacobson 1988)
Slow Start
Congestion Avoidance
Fast Retransmit
Reno (Jacobson 1990)
Fast Recovery
Vegas (Brakmo & Peterson 1994)
New Congestion Avoidance
RED (Floyd & Jacobson 1993)
Probabilistic marking
BLUE (Feng, Kandlur, Saha, Shin 1999)
REM/PI (Athuraliya et al 2000/Hollot et al 2001)
Clear buffer, match rate
AVQ (Kunniyur & Srikant 2001)
12. TCP/AQM
Tahoe (Jacobson 1988)
Slow Start
Congestion Avoidance
Fast Retransmit
Reno (Jacobson 1990)
Fast Recovery
Vegas (Brakmo & Peterson 1994)
New Congestion Avoidance
BLUE
RED (Floyd & Jacobson 1993)
REM/PI (Athuraliya et al 2000, Hollot et al 2001)
13. TCP Tahoe (Jacobson 1988)
SS
time
window
CA
SS: Slow Start
CA: Congestion Avoidance
14. Slow Start
Start with cwnd = 1 (slow start)
On each successful ACK increment cwnd
cwnd ← cnwd + 1
Exponential growth of cwnd
each RTT: cwnd ← 2 x cwnd
Enter CA when cwnd >= ssthresh
18. Packet Loss
Assumption: loss indicates congestion
Packet loss detected by
Retransmission TimeOuts (RTO timer)
Duplicate ACKs (at least 3)
1 2 3 4 5 6
1 2 3
Packets
Acknowledgements
3 3
7
3
19. Fast Retransmit
Wait for a timeout is quite long
Immediately retransmits after 3 dupACKs
without waiting for timeout
Adjusts ssthresh
flightsize = min(awnd, cwnd)
ssthresh ← max(flightsize/2, 2)
Enter Slow Start (cwnd = 1)
20. Successive Timeouts
When there is a timeout, double the RTO
Keep doing so for each lost retransmission
Exponential back-off
Max 64 seconds1
Max 12 restransmits1
1 - Net/3 BSD
21. Summary: Tahoe
Basic ideas
Gently probe network for spare capacity
Drastically reduce rate on congestion
Windowing: self-clocking
Other functions: round trip time estimation, error
recovery
for every ACK {
if (W < ssthresh) then W++ (SS)
else W += 1/W (CA)
}
for every loss {
ssthresh = W/2
W = 1
}
23. Fast recovery
Motivation: prevent `pipe’ from emptying after
fast retransmit
Idea: each dupACK represents a packet having
left the pipe (successfully received)
Enter FR/FR after 3 dupACKs
Set ssthresh ← max(flightsize/2, 2)
Retransmit lost packet
Set cwnd ← ssthresh + ndup (window inflation)
Wait till W=min(awnd, cwnd) is large enough; transmit
new packet(s)
On non-dup ACK (1 RTT later), set cwnd ← ssthresh
(window deflation)
Enter CA
24. 9
9
4
0 0
Example: FR/FR
Fast retransmit
Retransmit on 3 dupACKs
Fast recovery
Inflate window while repairing loss to fill pipe
time
S
timeR
1 2 3 4 5 6 87
8
cwnd 8
ssthresh
1
7
4
0 0 0
Exit FR/FR
4
4
4
11
00
10 11
25. Summary: Reno
Basic ideas
Fast recovery avoids slow start
dupACKs: fast retransmit + fast recovery
Timeout: fast retransmit + slow start
slow start retransmit
congestion
avoidance FR/FR
dupACKs
timeout
26. Active queue management
Idea: provide congestion information by
probabilistically marking packets
Issues
How to measure congestion (p and G)?
How to embed congestion measure?
How to feed back congestion info?
x(t+1) = F( p(t), x(t) )
p(t+1) = G( p(t), x(t) )
Reno, Vegas
DropTail, RED, REM
27. RED (Floyd & Jacobson 1993)
Congestion measure: average queue length
pl(t+1) = [pl(t) + xl
(t) - cl]+
Embedding: p-linear probability function
Feedback: dropping or ECN marking
Avg queue
marking
1
30. Match rate
Clear buffer and match rate
Clear buffer
Key features
+
−++=+ )])(ˆ)(()([)1( l
l
llll ctxtbtptp αγ
)()(
11 tptp s
l −−
−⇒− φφ
Sum prices
Theorem (Paganini 2000)
Global asymptotic stability for general utility
function (in the absence of delay)
34. Congestion Control
Heavy tail Mice-elephants
Elephant
Internet
Mice
Congestion control
efficient & fair sharing small delay
queueing + propagation
CDN
41. TCP Reno (Jacobson 1990)
SS
time
window
CA
SS: Slow Start
CA: Congestion Avoidance Fast retransmission/fast recovery
42. TCP Vegas (Brakmo & Peterson 1994)
SS
time
window
CA
Converges, no retransmission
… provided buffer is large enough
43. queue size
for every RTT
{ if W/RTTmin – W/RTT < α then W ++
if W/RTTmin – W/RTT > α then W -- }
for every loss
W := W/2
Vegas model
pl(t+1) = [pl(t) + yl (t)/cl - 1]+
Gl:
( )
<+=+ iiii
i
ii dtqtx
D
txtx α)()(if
1
)(1 2
( ) else)(1 txtx ss =+
Fi:
( )
>−=+ iiii
i
ii dtqtx
D
txtx α)()(if
1
)(1 2
46. Model
c1 c2
Network
Links l of capacities cl
Sources i
L(s) - links used by source i
Ui(xi) - utility if source rate = xi
x1
x2
x3
121 cxx ≤+ 231 cxx ≤+
50. Duality Model of TCP
))(),(()1(
))(),(()1(
txtpGtp
txtpFtx
=+
=+
Primal-dual algorithm:
Reno, Vegas
DropTail, RED, REM
Source algorithm iterates on rates
Link algorithm iterates on prices
With different utility functions
51. Duality Model of TCP
))(),(()1(
))(),(()1(
txtpGtp
txtpFtx
=+
=+
Primal-dual algorithm:
Reno, Vegas
DropTail, RED, REM
(x*,p*) primal-dual optimal if and only if
0ifequalitywith **
>≤ lll pcy
52. Duality Model of TCP
))(),(()1(
))(),(()1(
txtpGtp
txtpFtx
=+
=+
Primal-dual algorithm:
Reno, Vegas
DropTail, RED, REM
Any link algorithm that stabilizes queue
generates Lagrange multipliers
solves dual problem
57. Performance
Delay
Congestion measures: end to end queueing
delay
Sets rate
Equilibrium condition: Little’s Law
Loss
No loss if converge (with sufficient buffer)
Otherwise: revert to Reno (loss unavoidable)
)(
)(
tq
d
tx
s
s
ss α=
60. Persistent congestion
Vegas exploits buffer process to compute prices
(queueing delays)
Persistent congestion due to
Coupling of buffer & price
Error in propagation delay estimation
Consequences
Excessive backlog
Unfairness to older sources
Theorem (Low, Peterson, Wang ’02)
A relative error of εs in propagation delay estimation
distorts the utility function to
sssssssss xdxdxU εαε ++= log)1()(ˆ
61. Validation (L. Wang, Princeton)
Single link, capacity = 6 pkt/ms, αs = 2 pkts/ms, ds = 10 ms
With finite buffer: Vegas reverts to Reno
Without estimation error With estimation error
69. Linearized model
i
ii
i
i
i
i q
asT
a
q
x
x ∂
+
−=∂ l
l
l y
c
p ∂−=∂
γ
F1
FN
G1
GL
Rf(s)
Rb
’
(s)
TCP Network AQM
x y
q p
12
ii
i
Tx
a
π
η
−= γ controls equilibrium delay
70. Stability
Cannot be satisfied with >1 bottleneck link!
Theorem (Choe & Low, ‘02)
Locally asymptotically stable if
c
c
i
i
a
a
ω
ωsin
mintimetriproud
delayqueueinglink
> > 0.63
71. Stabilized Vegas
( ))(1tan
)(
12
)( )()(1-
2
tq
tT
tx iid
tqtx
i ii
ii
κη
π α −−=
( )ii
ii
d
tqtx
i
tT
tx αη
π
)()(1-
2
1tan
)(
12
)( −=
73. Linearized model
i
ii
i
ii q
as
as
bx ∂
+
+
−=∂
µ
l
l
l y
c
p ∂−=∂
γ
F1
FN
G1
GL
Rf(s)
Rb
’
(s)
TCP Network AQM
x y
q p
γ controls equilibrium delaychoose ai = a, αi = µ
75. Stability
Theorem (Choe & Low, ‘02)
Locally asymptotically stable if
),(
queueingtripround
timetripround
µaσM <
Application
Stabilized TCP with current routers
Queueing delay as congestion measure has the right scaling
Incremental deployment with ECN
76. Papers
A duality model of TCP flow controls (ITC, Sept 2000)
Optimization flow control, I: basic algorithm &
convergence (ToN, 7(6), Dec 1999)
Understanding Vegas: a duality model (J. ACM, 2002)
Scalable laws for stable network congestion
control (CDC, 2001)
REM: active queue management (Network, May/June 2001)
netlab.caltech.edu
87. Stability condition
Theorem: TCP/RED stable if
222
2
3
33
)1(4
)1
)(
2 ππβ
βπ
τ
τρ
−+
≤+⋅
-(
Nc
N
c
ω=0
∞=ω
TCP:
Small τ
Small c
Large N
RED:
Small ρ
Large delay
92. Role of AQM
(Sufficient) stability condition
-1
});({co θωHK ⋅
TCP:
N
c
2
22
τ
**
wpj
e j
+
−
ω
ω
AQM: scale down K & shape H
93. Role of AQM
});({co θωHK ⋅
TCP:
N
c
2
22
τ
**
wpj
e j
+
−
ω
ω
RED:
β
ταρ
−1
c
ταωω
θ
cj
e j
+
⋅
−
1
REM/PI:
β
τ
−1
2a
ω
τω
ω
θ
j
aaje j
21 /+
⋅
−
Queue dynamics (windowing)
Problem!!
94. Role of AQM
To stabilize a given TCP Fi with least cost
),(subject to
)()()()(min
0)(
pxFx
dttRptptxQtx TT
p
=
+∫
∞
⋅
Linearized F, single link, identical sources
Any AQM solves
optimal control with appropriate Q and R
))()()(()( 321
*
tyktyktbktp ++−=
))()()(()( 321 tyktyktbktp ++−=
95. Role of AQM
To stabilize a given TCP Fi with least cost
),(subject to
)()()()(min
0)(
pxFx
dttRptptxQtx TT
p
=
+∫
∞
⋅
Linearized F, single link, identical sources
))()()(()( 321
*
tyktyktbktp ++−=
k1=0
Nonzero buffer
k3=0
Slower decay
Editor's Notes
Title: Duality model of TCP
ITC Specialist Seminars, Sept 18-20, 2000 Monterey CA
CCW15, Oct 15-18, 2000, Captiva Island, FL
UCLA, EE Dept, February 7, 2001 (with update)
UC Berkeley, EECS Dept, February 8, 2001
NISS (National Institute of Statistical Sciences), Research Triangle Park, NC, March 9-10, 2001
DIMACS, Rutgers University, NJ, March 26-27, 2001
Boston University, CS Dept, June 20, 2001 (John Byers)
Bell Labs, Holmdel, June 21, 2001 (Murali Kodialam & TV Lakshman)
Title: Equilibrium & Dynamics of TCP/AQM
IMA Workshop on “Mathematical challenges in large-scale networks”, Minniapolis, MN, August 6-7, 2001
Allerton Conference, Monticello, IL, Oct 3-5, 2001
Purdue University, ECE (Ness Shroff), Oct, 2001
ISI/USC, Nov 29, 2001 (Joe Bannister)
UCRiverside, EE, Nov 30, 2001 (Yingbo Hua)
USC, EE, Jan 24, 2002 (Daniel Lee)
Stanford, EE, Feb 1, 2002 (Balaji Prabhakar, Kelly)
IPAM (Institute for Pure & Applied Mathematics), UCLA, March 18, 2002
RPI, EE, March 27, 2002 (Shiv)
Berkeley, CS, April 29, 2002 (Umesh Vazirani)
Infocom 2003 (subset), June 25, 2002 (NYCity, NY)
CIMMS (Center for Integrative Multiscale Modeling and Simulation) Workshop on Networks, Optimization and Duality, July 1, 2002 Caltech (Jerry Marsden)
Add Stabilized Vegas
Add TCP/shortest-path routing
After FR/FR, when CA is entered, cwnd is half of the window when lost was detected.
So the effect of lost is halving the window.
[Source: RFC 2581, Fall & Floyd, “Simulation based Comparison of Tahoe, Reno, and SACK TCP”]
Extensive research over (almost) the past decade has established that whereever you look whatever you measure, you always see heavy tail on the Internet and at web servers. The implication for us is the 20-80 rule: 20% of the files generate 80% of Internet traffic. 20-80 may not be the exact breakdown, but heavy tail does suggests a simple picture where there are two types of files, usually referred to as mice and elephants. While most TCP connections are mice, most packets belong to elephants.
Mice and elephants have different requirements. Elephants desire large average bandwidth while mice desire small delay. The goal of end-to-end congestion control is efficient and fair sharing of network bandwidth among elephants, but the way we perform congestion control also has strong implication on the queueing delay experienced by mice. Ideally, we’d control the elephants to maximally utilize the network, in such a way that leaves the network queues mostly empty, so that mice can fly through the network without much queueing delay. As we will see later, the current protocol, Reno with DropTail, is doing exactly the wrong thing – it maximizes the queue delay (and loss) for mice.
We believe that if we do our congestion control right, then there should not be much loss or queueing delay. Delay then depends mainly on propagation delay. To reduce that, we need content distribution networks.
The idea of this talk is well captured by Richard Gibben’s talk yesterday where we can regard flow control as an interaction of source rates at the hosts and some kind of congestion measures in the network. Example congestion measures include packet loss in Reno, end to end queueing delay, without propagation delay,in Vegas, queue length in RED, and a quantity we call price in REM, a scheme which we will come back to later in the talk.
When source rates increase, this pushes up the congestion measures in the network, say, losses in Reno, which then in turn suppresses the source rates, and the cycle repeats. Duality theory then suggest that we interpret source rates as primal variables, the congestion measures as dual variables, and the flow control process that iterates on these variables as a primal-dual algorithm to maximize aggregate source utility.
The rest of the talk is to make this intuition a bit more precise.
So we have the primal problem that maximizes aggregate utility subject to capacity constraints. Its dual problem is to minimize a certain function we call D(p). We call the vector p the congestion meausre and ….
The primal dual algorithm takes the following general form where the source rate vector x(t+1) in the next period is a certain function F of the current price vector p(t) and source rates x(t), and the price p(t+1) is a certain function G of the current prices and rates.
So the source algorithm is this F that iterates on prices and rates to produce a new rate, and the link algorithm is this G that iterates on prices and rates to produce a new price.
In TCP, F is carried out by Reno and Vegas, they are just different F’s. G is carried out by queue managements, active or inactive, such as DropTail, RED, or REM, they are just different G’s.
The different pairs (F, G) all try to solve the primal problem, but with different utility functions. That’s all.
So the source algorithm is this F that iterates on prices and rates to produce a new rate, and the link algorithm is this G that iterates on prices and rates to produce a new price.
In TCP, F is carried out by Reno and Vegas, they are just different F’s. G is carried out by queue managements, active or inactive, such as DropTail, RED, or REM, they are just different G’s.
The different pairs (F, G) all try to solve the primal problem, but with different utility functions. That’s all.
So the source algorithm is this F that iterates on prices and rates to produce a new rate, and the link algorithm is this G that iterates on prices and rates to produce a new price.
In TCP, F is carried out by Reno and Vegas, they are just different F’s. G is carried out by queue managements, active or inactive, such as DropTail, RED, or REM, they are just different G’s.
The different pairs (F, G) all try to solve the primal problem, but with different utility functions. That’s all.