Congestion control algorithms, particularly in TCP, operate through a sequence of phases to efficiently manage the rate at which data is sent into the network, aiming to prevent network collapse due to oversubscription. The three main phases are Slow Start, Congestion Avoidance, and Fast Recovery.
| Phase | Growth Type | cwnd Increase | Triggered By | Goal |
| Slow Start | Exponential | MSS per ACK | New Connection / Timeout | Rapidly probe capacity |
| Congestion Avoidance | Additive (Linear) | MSS per RTT | Sustain flow, cautiously probe higher capacity | |
| Fast Recovery | Additive (Linear) | MSS per duplicate ACK | Triple Duplicate ACK (TDA) | Quick recovery from single loss without Slow Start |
Slow Start (SS)
The Slow Start phase is used when a connection is initially established or after a timeout has occurred. Its primary goal is to quickly find a significant portion of the available network capacity without overloading it initially.
- Mechanism: The sender starts with a small congestion window (cwnd), typically 1 Maximum Segment Size (MSS) or 2−4 MSS, depending on the implementation.
- Growth: For every Acknowledgment (ACK) received, the cwnd is increased by 1 MSS.
- Effect: Since multiple ACKs are received for the segments sent in the previous round, the cwnd effectively doubles with every Round Trip Time (RTT). This exponential growth is rapid but carefully controlled by the incoming ACKs.
- Exit Condition: Slow Start continues until the cwnd reaches a predefined threshold, known as the slow start threshold (ssthresh), at which point the algorithm transitions to Congestion Avoidance. This ssthresh is often initialized to a very large value or set to half of the cwnd that caused the last known congestion event (a segment loss).
Congestion Avoidance (CA)
Once the cwnd reaches ssthresh, the algorithm enters Congestion Avoidance. This phase switches to a more conservative, linear growth to probe for additional available bandwidth more carefully.
- Mechanism: The cwnd is increased by 1 MSS for every RTT (or, equivalently, by cwndMSS2 for every ACK received).
- Growth: The growth is additive and linear. This slow, steady increase helps prevent overshooting the network's capacity and incurring immediate congestion losses.
- Goal: To sustain a high-rate flow while carefully attempting to use any remaining available capacity.
- Trigger for Transition/Exit:
- Loss detected via Timeout: This indicates severe congestion. The algorithm sets ssthresh to half of the current cwnd, sets cwnd to 1 MSS, and returns to the Slow Start phase.
- Loss detected via Triple Duplicate ACKs (TDAs): This indicates moderate congestion and triggers a transition to the Fast Recovery phase.
Fast Recovery (FR)
Fast Recovery is activated when the sender receives three duplicate ACKs (TDAs). This indicates that a single segment has been lost, but subsequent segments have reached the receiver, suggesting the congestion is moderate and the network isn't completely saturated. This phase attempts to recover from the loss quickly without resorting to Slow Start.
- Entry/Initial Actions (on TDA reception):
- Set ssthresh: The threshold is set to 2cwnd.
- Set cwnd: The congestion window is set to ssthresh+3 MSS (where 3 accounts for the 3 segments already successfully received and acknowledged by the TDAs).
- Resend: The segment presumed lost (indicated by the TDAs) is immediately retransmitted (Fast Retransmit).
- During Fast Recovery:
- For each additional duplicate ACK received, the cwnd is increased by 1 MSS. This "inflates" the window to account for the segment that just left the network (the one the TDA is acknowledging).
- The sender continues to transmit new segments if the inflated cwnd allows.
- Exit Condition (Success):
- When a new ACK arrives (acknowledging the retransmitted segment and all intermediate segments), the loss is considered recovered.
- The cwnd is then set back to the current value of ssthresh, and the algorithm returns to the Congestion Avoidance phase. This is because the TDA indicated the network was only moderately congested.
Congestion Control Phases
| Phase | Trigger | Behavior | cwnd Growth |
|---|---|---|---|
| Slow Start | Connection start or after RTO | cwnd starts small (1–10 MSS), doubles each RTT | Exponential |
| Congestion Avoidance | cwnd ≥ ssthresh | cwnd increases linearly (1 MSS per RTT) | Linear |
| Fast Retransmit / Recovery | 3 duplicate ACKs | Resend lost packet, reduce cwnd | cwnd = cwnd/2 |
| Timeout (RTO) | No ACK within timeout | Retransmit all outstanding, reset cwnd | Reset to 1 MSS |
🧠 How Each Algorithm Phase Works Internally
🐣 Slow Start
When: New connection or after timeout recovery.
Goal: Quickly discover available capacity.
🧮 Growth formula: cwnd_next = cwnd + 1 * (ACKs received)
| Step | Action | Kernel Function | Effect |
|---|---|---|---|
| 1️⃣ | Initialize cwnd = 1 MSS | tcp_init_cwnd() | Conservative start |
| 2️⃣ | For each ACK → cwnd += 1 MSS | tcp_cong_avoid_ai() | Exponential growth |
| 3️⃣ | When cwnd ≥ ssthresh → move to Congestion Avoidance | tcp_slow_start() transition | Switch to linear |
🚶 Congestion Avoidance
When: cwnd ≥ ssthresh
Goal: Steady growth, avoid overloading the network.
🧮 Growth formula (Reno-like): cwnd_next = cwnd + MSS * (MSS / cwnd)
| Step | Action | Kernel Function | Effect |
|---|---|---|---|
| 1️⃣ | Increase cwnd slowly | tcp_cong_avoid_ai() | Linear growth per RTT |
| 2️⃣ | cwnd += MSS * (MSS / cwnd) | Implemented in tcp_reno_cong_avoid() | Approximates 1 MSS per RTT |
| 3️⃣ | Continue until packet loss or congestion indication | — | Then trigger recovery |
⚠️ Packet Loss Detection
- Detection Paths: Timeout (RTO) → implies severe congestion.
- 3 Duplicate ACKs → implies single packet loss, network still working.
| Type | Action | Function | Result |
|---|---|---|---|
| Timeout | tcp_retransmit_timer() | Reduces ssthresh = cwnd / 2, resets cwnd = 1 MSS | Re-enter Slow Start |
| 3 Dup ACKs | tcp_fastretrans_alert() | Triggers Fast Retransmit | cwnd reduced but not fully reset |
🚑 Fast Recovery
When: Loss detected by 3 duplicate ACKs (no timeout).
Goal: Avoid stopping transmission completely; maintain partial throughput.
| Step | Action | Kernel Function | Behavior |
|---|---|---|---|
| 1️⃣ | Retransmit missing segment | tcp_retransmit_skb() | Immediate resend |
| 2️⃣ | Temporarily reduce cwnd | tcp_enter_recovery() | cwnd = ssthresh |
| 3️⃣ | Continue sending new segments if allowed | tcp_try_to_open() | Keeps pipeline full |
| 4️⃣ | When recovery ACKed → exit recovery → Congestion Avoidance | tcp_end_cwnd_reduction() | Resume normal growth |
📈 Interaction Diagram (Simplified Timeline)
🧭 Key Kernel Components Mapping
| Layer | Component | Example Function | Description |
|---|---|---|---|
| Application | Socket syscalls | sys_sendmsg(), sys_recvmsg() | Data entry & exit to/from kernel |
| TCP Layer | Congestion control logic | tcp_sendmsg(), tcp_ack(), tcp_cong_avoid() | Manages cwnd, retransmits, ACKs |
| IP Layer | Routing & IP header | ip_queue_xmit() | Determines route, adds IP header |
| Network Device Layer | Queue & send to NIC | dev_queue_xmit(), ndo_start_xmit() | Push to device driver |
| Driver Layer | NIC driver ops | netdev_ops | Manages DMA and ring buffers |
| RX Path | NAPI poll loop | napi_poll(), netif_receive_skb() | Receives packets and calls protocol handlers |
| ACK Handling | TCP receive path | tcp_input.c (tcp_ack(), tcp_clean_rtx_queue()) | Updates cwnd and retransmission state |
🧠 What It Shows:
- Slow Start = exponential growth (fast ramp-up)
- Congestion Avoidance = linear growth (steady pace)
- Loss = congestion signal (cut speed)
- Fast Recovery = retransmit + partial resume
- Cycle repeats as network conditions change
It visually shows Slow Start → Congestion Avoidance → Loss → Fast Recovery → Back to Avoidance.
📈 TCP Congestion Window (cwnd) Growth Timeline
🧠 How to Read It:
- At first (Slow Start) → cwnd doubles rapidly (1, 2, 4, 8, 16...).
- After reaching ssthresh → cwnd grows slowly, linearly (17, 18, 19...).
- When packet loss happens → cwnd drops sharply (network congestion).
- Fast Recovery → cwnd recovers partially, not from zero.
- Congestion Avoidance resumes → steady slow increase again.
“TCP starts fast to test the road, slows down when it gets crowded, and backs off after a jam — then cautiously speeds up again.”
Initial Values
| Parameter | Server | Client |
|---|---|---|
| MSS | 1460 B | 1380 B |
| Window Size | 65160 | 65535 |
| Window Scale | 512 | 32 |
| Effective rwnd | 33,364,992 B (~22,840 MSS) | 2,097,120 B (~1518 MSS) |
| Initial cwnd | 1 MSS (1460 B) | — |
| Initial ssthresh | 16 MSS (example) | — |
| TCP Algorithm | CUBIC / Reno | — |
Example Table Snapshot
| RTT | Event / Phase | cwnd (MSS) | cwnd_next (MSS) | rwnd (MSS, client) | rwnd_next | ssthresh (MSS) | Notes |
|---|---|---|---|---|---|---|---|
| 0 | Initial | 1 | 1 | 1518 | 1518 | 16 | Start connection |
| 1 | Slow Start | 1 | 2 | 1518 | 1518 | 16 | ACK for 1 MSS → cwnd += 1 MSS |
| 2 | Slow Start | 2 | 4 | 1518 | 1518 | 16 | ACKs for 2 MSS → cwnd doubles |
| 3 | Slow Start | 4 | 8 | 1518 | 1518 | 16 | cwnd doubles |
| 4 | Slow Start | 8 | 16 | 1518 | 1518 | 16 | cwnd doubles → reaches ssthresh |
| 5 | Congestion Avoidance | 16 | 17 | 1518 | 1518 | 16 | cwnd ≥ ssthresh → linear growth |
| 6 | Congestion Avoidance | 17 | 18 | 1518 | 1518 | 16 | cwnd += 1 MSS |
| 7 | Congestion Avoidance | 18 | 19 | 1518 | 1518 | 16 | cwnd += 1 MSS |
| 8 | Loss via 3 Dup ACKs | 19 | 22 | 1518 | 1518 | 9 | Fast Retransmit → cwnd = ssthresh+3=12 MSS |
| 9 | Fast Recovery | 12 | 13 | 1518 | 1518 | 9 | Each DUP ACK → cwnd += 1 MSS |
| 10 | Fast Recovery | 13 | 9 | 1518 | 1518 | 9 | ACK for retransmitted segment → cwnd = ssthresh |
| 11 | Congestion Avoidance | 9 | 10 | 1518 | 1518 | 9 | Resume linear growth |
| 12 | Congestion Avoidance | 10 | 11 | 1518 | 1518 | 9 | Linear growth continues |
| 13 | Congestion Avoidance | 11 | 12 | 1518 | 1518 | 9 | Linear growth continues |
| 14 | Timeout | 12 | 1 | 1518 | 1518 | 6 | Timeout → cwnd = 1 MSS, ssthresh = cwnd/2=6 MSS |
| 15 | Slow Start | 1 | 2 | 1518 | 1518 | 6 | Start slow start after timeout |
| 16 | Slow Start | 2 | 4 | 1518 | 1518 | 6 | cwnd doubles |
| 17 | Slow Start | 4 | 8 | 1518 | 1518 | 6 | cwnd doubles |
| 18 | Congestion Avoidance | 8 | 9 | 1518 | 1518 | 6 | cwnd ≥ ssthresh → linear growth |
| 19 | Congestion Avoidance | 9 | 10 | 1518 | 1518 | 6 | Linear growth continues |
| 20 | Congestion Avoidance | 10 | 11 | 1518 | 1518 | 6 | Linear growth continues |
✅ Key Points in This Table
- Slow Start: exponential (cwnd doubles per RTT)
- Linear Growth / Congestion Avoidance: cwnd += 1 MSS per RTT
- Fast Retransmit / Fast Recovery: cwnd temporarily inflated, then reset to ssthresh
- Timeout: cwnd reset to 1 MSS, ssthresh halved
🧩 TCP Parameter Adjustments Table
| Event / Phase | cwnd (Current Window) | cwnd_next (Next Window) | rwnd (Receiver Window) | rwnd_next | ssthresh | Description / Notes |
|---|---|---|---|---|---|---|
| Slow Start (Exponential Growth) | 1 MSS → N MSS | cwnd_next = cwnd + 1 MSS per ACK | Advertised by receiver | rwnd_next = rwnd | Unchanged | Each ACK doubles effective cwnd; very fast growth. |
| Switch to Linear / Congestion Avoidance | cwnd ≥ ssthresh | cwnd_next = cwnd + (1 MSS * MSS / cwnd) ≈ +1 MSS per RTT | rwnd | rwnd_next = rwnd | Unchanged | Linear growth; increase 1 MSS per RTT to avoid congestion. |
| Timeout (RTO) | cwnd | cwnd_next = 1 MSS | rwnd | rwnd_next = rwnd | ssthresh = max(cwnd / 2, 2 MSS) | Conservative reaction to packet loss; restart slow start. |
| 3 Duplicate ACKs (Fast Retransmit / Fast Recovery) | cwnd | cwnd_next = ssthresh + 3 MSS (fast recovery) | rwnd | rwnd_next = rwnd | ssthresh = max(cwnd / 2, 2 MSS) | Retransmit lost packet immediately; cwnd reduced to ssthresh for controlled recovery. |
| Linear Growth (Congestion Avoidance) | cwnd ≥ ssthresh | cwnd_next = cwnd + 1 MSS per RTT | rwnd | rwnd_next = rwnd | ssthresh | Conservative linear increase; prevents new congestion. |
| Exponential Growth (Slow Start) | cwnd < ssthresh | cwnd_next = cwnd + 1 MSS per ACK (doubles per RTT) | rwnd | rwnd_next = rwnd | ssthresh | Aggressive ramp-up of cwnd to probe available bandwidth. |
🔑 Notes on Terms:
- cwnd = congestion window (sender-side limit on unacked data)
- cwnd_next = cwnd for next RTT or after processing ACK/loss
- rwnd = receiver’s advertised window (receiver buffer limit)
- rwnd_next = updated rwnd (usually unchanged, unless receiver buffer changes)
- ssthresh = slow start threshold; boundary between exponential → linear growth
- Timeout = no ACK received within RTO → conservative reset
- 3 Dup ACKs = fast retransmit signal for isolated packet loss
ASCII Timeline Diagram
Check TCP congestion control algorithm
sysctl net.ipv4.tcp_congestion_control
Example output: net.ipv4.tcp_congestion_control = cubic
This shows which algorithm (CUBIC, Reno, BBR, etc.) is used. Each algorithm calculates ssthresh differently.
Check default initial cwnd and ssthresh (via ip route / sysctl)
Linux does not expose ssthresh directly, because it is per-connection.
But you can check TCP buffer defaults:
sysctl net.ipv4.tcp_rmem
sysctl net.ipv4.tcp_wmem
Example Output:
net.ipv4.tcp_rmem = 4096 87380 6291456
net.ipv4.tcp_wmem = 4096 16384 4194304
Format: min default max (in bytes)
default = default buffer size → indirectly limits cwnd and ssthresh
No comments:
Post a Comment