Sunday, October 26, 2025

🧠 How Each Algorithm Phase Works InternallyCongestion Control Phases - How Each Algorithm Phase Works Internally

Congestion control algorithms, particularly in TCP, operate through a sequence of phases to efficiently manage the rate at which data is sent into the network, aiming to prevent network collapse due to oversubscription. The three main phases are Slow Start, Congestion Avoidance, and Fast Recovery.

 

PhaseGrowth Typecwnd IncreaseTriggered ByGoal
Slow StartExponential MSS per ACKNew Connection / TimeoutRapidly probe capacity
Congestion AvoidanceAdditive (Linear) MSS per RTTSustain flow, cautiously probe higher capacity
Fast RecoveryAdditive (Linear) MSS per duplicate ACKTriple Duplicate ACK (TDA)Quick recovery from single loss without Slow Start

Slow Start (SS)

The Slow Start phase is used when a connection is initially established or after a timeout has occurred. Its primary goal is to quickly find a significant portion of the available network capacity without overloading it initially.

  • Mechanism: The sender starts with a small congestion window (cwnd), typically 1 Maximum Segment Size (MSS) or 2−4 MSS, depending on the implementation.
  • Growth: For every Acknowledgment (ACK) received, the cwnd is increased by 1 MSS.
  • Effect: Since multiple ACKs are received for the segments sent in the previous round, the cwnd effectively doubles with every Round Trip Time (RTT). This exponential growth is rapid but carefully controlled by the incoming ACKs.
  • Exit Condition: Slow Start continues until the cwnd reaches a predefined threshold, known as the slow start threshold (ssthresh), at which point the algorithm transitions to Congestion Avoidance. This ssthresh is often initialized to a very large value or set to half of the cwnd that caused the last known congestion event (a segment loss). 

Congestion Avoidance (CA)

Once the cwnd reaches ssthresh, the algorithm enters Congestion Avoidance. This phase switches to a more conservative, linear growth to probe for additional available bandwidth more carefully.

  • Mechanism: The cwnd is increased by 1 MSS for every RTT (or, equivalently, by cwndMSS2​ for every ACK received).
  • Growth: The growth is additive and linear. This slow, steady increase helps prevent overshooting the network's capacity and incurring immediate congestion losses.
  • Goal: To sustain a high-rate flow while carefully attempting to use any remaining available capacity.
  • Trigger for Transition/Exit:
    • Loss detected via Timeout: This indicates severe congestion. The algorithm sets ssthresh to half of the current cwnd, sets cwnd to 1 MSS, and returns to the Slow Start phase.
    • Loss detected via Triple Duplicate ACKs (TDAs): This indicates moderate congestion and triggers a transition to the Fast Recovery phase. 

Fast Recovery (FR)

Fast Recovery is activated when the sender receives three duplicate ACKs (TDAs). This indicates that a single segment has been lost, but subsequent segments have reached the receiver, suggesting the congestion is moderate and the network isn't completely saturated. This phase attempts to recover from the loss quickly without resorting to Slow Start.

  • Entry/Initial Actions (on TDA reception):
    • Set ssthresh: The threshold is set to 2cwnd​.
    • Set cwnd: The congestion window is set to ssthresh+3 MSS (where 3 accounts for the 3 segments already successfully received and acknowledged by the TDAs).
    • Resend: The segment presumed lost (indicated by the TDAs) is immediately retransmitted (Fast Retransmit).
  • During Fast Recovery:
    • For each additional duplicate ACK received, the cwnd is increased by 1 MSS. This "inflates" the window to account for the segment that just left the network (the one the TDA is acknowledging).
    • The sender continues to transmit new segments if the inflated cwnd allows.
  • Exit Condition (Success):
    • When a new ACK arrives (acknowledging the retransmitted segment and all intermediate segments), the loss is considered recovered.
    • The cwnd is then set back to the current value of ssthresh, and the algorithm returns to the Congestion Avoidance phase. This is because the TDA indicated the network was only moderately congested. 

  Congestion Control Phases

PhaseTriggerBehaviorcwnd Growth
Slow StartConnection start or after RTOcwnd starts small (1–10 MSS), doubles each RTTExponential
Congestion Avoidancecwnd ≥ ssthreshcwnd increases linearly (1 MSS per RTT)Linear
Fast Retransmit / Recovery3 duplicate ACKsResend lost packet, reduce cwndcwnd = cwnd/2
Timeout (RTO)No ACK within timeoutRetransmit all outstanding, reset cwndReset to 1 MSS

 🧠 How Each Algorithm Phase Works Internally

🐣 Slow Start

When: New connection or after timeout recovery.
Goal: Quickly discover available capacity.

🧮 Growth formula: cwnd_next = cwnd + 1 * (ACKs received) 

StepActionKernel FunctionEffect
1️⃣Initialize cwnd = 1 MSStcp_init_cwnd()Conservative start
2️⃣For each ACK → cwnd += 1 MSStcp_cong_avoid_ai()Exponential growth
3️⃣When cwnd ≥ ssthresh → move to Congestion Avoidancetcp_slow_start() transitionSwitch to linear

🚶 Congestion Avoidance

When: cwnd ≥ ssthresh
Goal: Steady growth, avoid overloading the network.

🧮 Growth formula (Reno-like): cwnd_next = cwnd + MSS * (MSS / cwnd) 

StepActionKernel FunctionEffect
1️⃣Increase cwnd slowlytcp_cong_avoid_ai()Linear growth per RTT
2️⃣cwnd += MSS * (MSS / cwnd)Implemented in tcp_reno_cong_avoid()Approximates 1 MSS per RTT
3️⃣Continue until packet loss or congestion indicationThen trigger recovery

⚠️ Packet Loss Detection

  • Detection Paths: Timeout (RTO) → implies severe congestion.
  • 3 Duplicate ACKs → implies single packet loss, network still working.
TypeActionFunctionResult
Timeouttcp_retransmit_timer()Reduces ssthresh = cwnd / 2, resets cwnd = 1 MSSRe-enter Slow Start
3 Dup ACKstcp_fastretrans_alert()Triggers Fast Retransmitcwnd reduced but not fully reset

 🚑 Fast Recovery

 When: Loss detected by 3 duplicate ACKs (no timeout).
Goal: Avoid stopping transmission completely; maintain partial throughput.

StepActionKernel FunctionBehavior
1️⃣Retransmit missing segmenttcp_retransmit_skb()Immediate resend
2️⃣Temporarily reduce cwndtcp_enter_recovery()cwnd = ssthresh
3️⃣Continue sending new segments if allowedtcp_try_to_open()Keeps pipeline full
4️⃣When recovery ACKed → exit recovery → Congestion Avoidancetcp_end_cwnd_reduction()Resume normal growth

 📈 Interaction Diagram (Simplified Timeline)

 

 🧭 Key Kernel Components Mapping

LayerComponentExample FunctionDescription
ApplicationSocket syscallssys_sendmsg(), sys_recvmsg()Data entry & exit to/from kernel
TCP LayerCongestion control logictcp_sendmsg(), tcp_ack(), tcp_cong_avoid()Manages cwnd, retransmits, ACKs
IP LayerRouting & IP headerip_queue_xmit()Determines route, adds IP header
Network Device LayerQueue & send to NICdev_queue_xmit(), ndo_start_xmit()Push to device driver
Driver LayerNIC driver opsnetdev_opsManages DMA and ring buffers
RX PathNAPI poll loopnapi_poll(), netif_receive_skb()Receives packets and calls protocol handlers
ACK HandlingTCP receive pathtcp_input.c (tcp_ack(), tcp_clean_rtx_queue())Updates cwnd and retransmission state

 

  

 

 🧠 What It Shows:

  • Slow Start = exponential growth (fast ramp-up)
  • Congestion Avoidance = linear growth (steady pace)
  • Loss = congestion signal (cut speed)
  • Fast Recovery = retransmit + partial resume
  • Cycle repeats as network conditions change

 It visually shows Slow Start → Congestion Avoidance → Loss → Fast Recovery → Back to Avoidance.

 📈 TCP Congestion Window (cwnd) Growth Timeline

 

 🧠 How to Read It:

  • At first (Slow Start) → cwnd doubles rapidly (1, 2, 4, 8, 16...).
  • After reaching ssthresh → cwnd grows slowly, linearly (17, 18, 19...).
  • When packet loss happens → cwnd drops sharply (network congestion).
  • Fast Recovery → cwnd recovers partially, not from zero.
  • Congestion Avoidance resumes → steady slow increase again.

 “TCP starts fast to test the road, slows down when it gets crowded, and backs off after a jam — then cautiously speeds up again.”

 Initial Values

ParameterServerClient
MSS1460 B1380 B
Window Size6516065535
Window Scale51232
Effective rwnd33,364,992 B (~22,840 MSS)2,097,120 B (~1518 MSS)
Initial cwnd1 MSS (1460 B)
Initial ssthresh16 MSS (example)
TCP AlgorithmCUBIC / Reno

 

Example Table Snapshot  

RTTEvent / Phasecwnd (MSS)cwnd_next (MSS)rwnd (MSS, client)rwnd_nextssthresh (MSS)Notes
0Initial111518151816Start connection
1Slow Start121518151816ACK for 1 MSS → cwnd += 1 MSS
2Slow Start241518151816ACKs for 2 MSS → cwnd doubles
3Slow Start481518151816cwnd doubles
4Slow Start8161518151816cwnd doubles → reaches ssthresh
5Congestion Avoidance16171518151816cwnd ≥ ssthresh → linear growth
6Congestion Avoidance17181518151816cwnd += 1 MSS
7Congestion Avoidance18191518151816cwnd += 1 MSS
8Loss via 3 Dup ACKs1922151815189Fast Retransmit → cwnd = ssthresh+3=12 MSS
9Fast Recovery1213151815189Each DUP ACK → cwnd += 1 MSS
10Fast Recovery139151815189ACK for retransmitted segment → cwnd = ssthresh
11Congestion Avoidance910151815189Resume linear growth
12Congestion Avoidance1011151815189Linear growth continues
13Congestion Avoidance1112151815189Linear growth continues
14Timeout121151815186Timeout → cwnd = 1 MSS, ssthresh = cwnd/2=6 MSS
15Slow Start12151815186Start slow start after timeout
16Slow Start24151815186cwnd doubles
17Slow Start48151815186cwnd doubles
18Congestion Avoidance89151815186cwnd ≥ ssthresh → linear growth
19Congestion Avoidance910151815186Linear growth continues
20Congestion Avoidance1011151815186Linear growth continues

 

✅ Key Points in This Table

  • Slow Start: exponential (cwnd doubles per RTT)
  • Linear Growth / Congestion Avoidance: cwnd += 1 MSS per RTT
  • Fast Retransmit / Fast Recovery: cwnd temporarily inflated, then reset to ssthresh
  • Timeout: cwnd reset to 1 MSS, ssthresh halved

 🧩 TCP Parameter Adjustments Table

Event / Phasecwnd (Current Window)cwnd_next (Next Window)rwnd (Receiver Window)rwnd_nextssthreshDescription / Notes
Slow Start (Exponential Growth)1 MSS → N MSScwnd_next = cwnd + 1 MSS per ACKAdvertised by receiverrwnd_next = rwndUnchangedEach ACK doubles effective cwnd; very fast growth.
Switch to Linear / Congestion Avoidancecwnd ≥ ssthreshcwnd_next = cwnd + (1 MSS * MSS / cwnd) ≈ +1 MSS per RTTrwndrwnd_next = rwndUnchangedLinear growth; increase 1 MSS per RTT to avoid congestion.
Timeout (RTO)cwndcwnd_next = 1 MSSrwndrwnd_next = rwndssthresh = max(cwnd / 2, 2 MSS)Conservative reaction to packet loss; restart slow start.
3 Duplicate ACKs (Fast Retransmit / Fast Recovery)cwndcwnd_next = ssthresh + 3 MSS (fast recovery)rwndrwnd_next = rwndssthresh = max(cwnd / 2, 2 MSS)Retransmit lost packet immediately; cwnd reduced to ssthresh for controlled recovery.
Linear Growth (Congestion Avoidance)cwnd ≥ ssthreshcwnd_next = cwnd + 1 MSS per RTTrwndrwnd_next = rwndssthreshConservative linear increase; prevents new congestion.
Exponential Growth (Slow Start)cwnd < ssthreshcwnd_next = cwnd + 1 MSS per ACK (doubles per RTT)rwndrwnd_next = rwndssthreshAggressive ramp-up of cwnd to probe available bandwidth.

 🔑 Notes on Terms:

  • cwnd = congestion window (sender-side limit on unacked data)
  • cwnd_next = cwnd for next RTT or after processing ACK/loss
  • rwnd = receiver’s advertised window (receiver buffer limit)
  • rwnd_next = updated rwnd (usually unchanged, unless receiver buffer changes)
  • ssthresh = slow start threshold; boundary between exponential → linear growth
  • Timeout = no ACK received within RTO → conservative reset
  • 3 Dup ACKs = fast retransmit signal for isolated packet loss

 

ASCII Timeline Diagram

Check TCP congestion control algorithm

sysctl net.ipv4.tcp_congestion_control

Example output: net.ipv4.tcp_congestion_control = cubic 

This shows which algorithm (CUBIC, Reno, BBR, etc.) is used. Each algorithm calculates ssthresh differently. 

Check default initial cwnd and ssthresh (via ip route / sysctl)

Linux does not expose ssthresh directly, because it is per-connection.
But you can check TCP buffer defaults: 

sysctl net.ipv4.tcp_rmem
sysctl net.ipv4.tcp_wmem

Example Output: 

net.ipv4.tcp_rmem = 4096 87380 6291456
net.ipv4.tcp_wmem = 4096 16384 4194304

Format: min default max (in bytes)
default = default buffer size → indirectly limits cwnd and ssthresh

No comments:

Post a Comment