Saturday, October 25, 2025

Congestion control tecnhical jargons

What is the Receiver Window (rwnd)?

The Receiver Window (rwnd) is the amount of free buffer space the receiving TCP stack currently has available for incoming data.

  • Mechanism: The receiver constantly advertises its current rwnd size to the sender using the Window Size field in the TCP header of every ACK segment it sends back.
  • Purpose: The rwnd acts as a limit on the amount of unacknowledged data the sender is allowed to have "in flight" at any given time.
  • Location: The rwnd is maintained by the receiving end of the connection.

Unacknowledged Data

This refers to the data segments that the sender has transmitted but has not yet received a confirmation (ACK) for from the receiver.

  • When the sender transmits a packet, that data is temporarily stored in the sender's buffer.
  • It remains in the sender's buffer (and is considered "unacknowledged") until the receiver sends an ACK indicating it has successfully received and processed that sequence of bytes. 

Data In Flight

This is a common term in networking that means the same thing as "unacknowledged data" (or sometimes slightly more specifically, the amount of unacknowledged data currently residing in the network path, which is often called the Flight Size).

  • It is data that has left the sender's machine and is currently traveling across the network, sitting in router queues, or waiting in the receiver's buffer before being processed by the application.
  • It's data that is "out there" and for which the sender is waiting for a positive acknowledgement. 

"The rwnd acts as a limit..."

This is the key point for flow control. The sender must obey the receiver's advertised window.

  • The Rule: The sender calculates the amount of unacknowledged data it currently has "in flight" and must ensure that this amount never exceeds the current rwnd value advertised by the receiver.
    • Bytes In Flight≤min(Congestion Window,Receiver Window)
  • The Purpose: This mechanism prevents the sender from sending data faster than the receiver can read it from its buffer. If the sender ignored the rwnd, it would flood the receiver's limited buffer space, causing the receiver to drop packets, which wastes network bandwidth and triggers unnecessary retransmissions. 

The rwnd is the receiver's way of shouting back to the sender, "I have this much room left! Don't send more than that total amount of data until I tell you I've cleared some space!" This is the mechanism for Flow Control. 

The Role of rwnd in Flow Control

While the Congestion Window (cwnd) limits transmission based on the network's capacity (congestion control), the Receiver Window (rwnd) limits transmission based on the receiver's processing capacity and available memory (flow control).

Flow Control Rule:

  • The TCP sender must adhere to the following rule when determining how much data to send:
    • Effective Window=min(cwnd,rwnd)
The sender will never send more data than the smaller of the two values. 
Scenario Limiting Factor Focus
If Congestion Window () The sender is limited by what the network can handle. (Congestion Control)
If Receiver Window () The sender is limited by what the receiver can handle. (Flow Control)

 Window Size Management:

  • Data Arrives: When data arrives, the receiver places it into its buffer, and the available rwnd decreases.
  • Data Read: When the application layer reads the data from the buffer, that buffer space is freed up, and the available rwnd increases.
  • Feedback: The receiver includes the new, updated rwnd value in the next ACK it sends to the sender.

If the receiver's buffer fills up completely, the rwnd is advertised as zero. This is known as Zero Window, and it stops the sender from transmitting new data until the application reads the buffered data, and the receiver can advertise a non-zero rwnd again.

 What is the Congestion Window (cwnd)?

The cwnd, or Congestion Window, is a crucial state variable in TCP (Transmission Control Protocol) used by a sender to limit the amount of unacknowledged data it can transmit into the network before receiving an acknowledgment (ACK).

The primary purpose of the cwnd is to perform congestion control, which is a set of algorithms TCP uses to prevent network overload. 

How the cwnd Works

  • Sender-Side Limit: The cwnd is maintained by the sending host and is a local estimate of how much capacity is currently available on the path to the receiver without causing congestion (like router buffer overflow).
  • The Transmission Limit: A TCP sender's window size—the maximum amount of data it can have "in flight" (sent but not yet acknowledged)—is determined by the minimum of two values:
    • Sender Window Size=min(Congestion Window (cwnd),Receiver Window (rwnd))
    • The Receiver Window (rwnd) is advertised by the receiver for flow control (preventing the sender from overwhelming the receiver's buffer), while the cwnd is for congestion control (preventing network congestion). 
  • Dynamic Adjustment: The cwnd size changes dynamically based on network feedback:
    • Increase (Probing for Capacity): The cwnd is increased when the sender receives ACKs, which implicitly signal that the network is capable of handling the current data rate. This increase is typically exponential (during the Slow Start phase) and then linear (during the Congestion Avoidance phase).
    • Decrease (Reacting to Congestion): The cwnd is reduced when the sender detects packet loss (often via a timeout or receiving duplicate ACKs), which is interpreted as a sign of network congestion. This is a multiplicative decrease to rapidly reduce the offered load on the network. 

Duplicate ACK (DUP ACK):

  • Sent by the TCP Receiver.
  • A receiver sends a Duplicate ACK when it receives a segment that is out of order, indicating a gap in the sequence numbers (i.e., a segment is missing).
  • The ACK number in the duplicate ACK is the sequence number of the missing segment (the next expected byte).
  • The purpose is to quickly notify the sender of the apparent loss, without waiting for the retransmission timer to expire 

First Retransmission (Fast Retransmit):

  • Performed by the TCP Sender.
  • The sender triggers a "Fast Retransmit" when it receives a certain number of identical Duplicate ACKs (typically three, meaning one original ACK and three duplicates, or four ACKs with the same sequence number).
  • Upon receiving the third duplicate ACK, the sender retransmits the missing segment (the one indicated by the ACK number in the DUP ACKs) immediately, without waiting for the retransmission timeout. 
tcp.analysis.retransmission (Timeout-Based Loss) 

This filter identifies a Retransmission Timeout (RTO) loss event, which is the more severe type of loss detection.

  • Mechanism: The sender sends a segment and starts its internal RTO timer. If the sender does not receive an ACK for that data segment before the timer expires, it assumes the packet (or the ACK) is lost and retransmits the segment.
  • Resulting Action (Congestion Control): When an RTO occurs, the TCP stack assumes severe congestion.
    • The congestion window (cwnd) is reset to 1 MSS.
    • The slow start threshold (ssthresh) is set to half of the previous cwnd.
    • The connection enters the Slow Start phase to probe the network cautiously
  • Trace Signature: A large, often increasing, delay (the RTO value) between the original packet and the retransmitted packet. The RTO starts at a relatively high value (often 1 second or 3 seconds) and doubles with each subsequent timeout (exponential backoff). 

tcp.analysis.fast_retransmission (Duplicate-ACK Based Loss)

This filter identifies a Fast Retransmit loss event, which is a quicker, less severe recovery mechanism.
  • Mechanism: The sender receives three or more duplicate ACKs for the same data segment. A duplicate ACK means the receiver got data out-of-order and is informing the sender of the highest sequence number received in-order. Three duplicates strongly suggest a lost packet without having to wait for the RTO timer.
  • Resulting Action (Congestion Control): When a Fast Retransmit occurs, the TCP stack assumes moderate congestion.
    • The slow start threshold (ssthresh) is set to half of the current cwnd (or cwnd/2).
    • The missing segment is retransmitted immediately (Fast Retransmit).
    • The connection enters Fast Recovery, typically setting cwnd to ssthresh+3 MSS (for the "inflated" cwnd state).
  • Trace Signature: The retransmitted packet arrives immediately after the third duplicate ACK is received, which is much faster than an RTO 
Featuretcp.analysis.retransmissiontcp.analysis.fast_retransmission
Detection timer expires (no received).Three or more Duplicate received.
SeveritySevere Congestion (assumes network saturation).Moderate Congestion (assumes a single dropped packet).
ResponseSlow Start / Exponential Backoff.Fast Retransmit / Fast Recovery.
ActionResets to . is halved (set to ), then inflated.
TimingLong delay (seconds) that doubles.Immediate retransmission.

The key to differentiating between a retransmission that is part of a Fast Recovery (after a Fast Retransmit) and a retransmission due to an (Timeout) lies in two primary factors: the preceding packets and the time delay.

This is the most definitive way to tell them apart in a packet trace.

Differentiating by Preceding Packets  

Type of RetransmissionPreceding Packets to Look For
RTO RetransmissionYou will see NO traffic from the receiver (the host that should be acknowledging the data) for the entire duration of the Retransmission Timeout (). The communication simply stops, and after a long silence, the sender resends the data. This implies that the were likely lost, or the original data segment was lost and the receiver never got it.
Fast RetransmissionYou must see or more identical Duplicate ACKs immediately before the retransmitted data packet. These are the explicit trigger for the Fast Retransmit. The duplicate prove that the receiver is still active, has received subsequent data (out-of-order), and is requesting the missing segment.

Differentiating by Timing and RTO Value

The time difference between the original segment and the retransmitted segment clearly separates the two. 

Type of RetransmissionTime DelayRTO Value in TraceCongestion Action
RTO RetransmissionLong and Exponentially IncreasingThe delay will be equal to the current calculated RTO value (e.g., , , , , etc.). reset to (Slow Start).
Fast RetransmissionImmediateThe delay will be very short, typically in the range of to . It is dictated only by the time it took for the third Duplicate to arrive at the sender's buffer. is halved (Fast Recovery).

Practical Validation in Wireshark

Fast Retransmission: Filter for tcp.analysis.duplicate_ack. Look for a packet labeled "Fast Retransmission" immediately following the 3rd duplicate ACK. The time delta will be tiny.

RTO Retransmission: Filter for tcp.analysis.retransmission (excluding the fast ones). The retransmitted packet's TCP analysis detail will show a long delay, and you can confirm no intervening packets were sent by the receiver. 

In summary, a Fast Retransmission is a quick, proactive correction based on clear feedback (Duplicated ACKs), while an RTO Retransmission is a last resort based on silence and the expiration of a defensive timer.

🧠 Round-Trip Time (RTT) 

RTT is the time it takes for a packet to go from sender → receiver → back (ACK).

RTT = Time when ACK received − Time when segment was sent. 

  • The RTT is the basic measurement of network delay for a single data segment.
  • Definition: The time duration from when a TCP sender transmits a data segment until it receives the corresponding acknowledgment (ACK) from the receiver.
  • Purpose: It measures the propagation delay between the two hosts plus any processing time in the network path.
  • Snapshot: It's a raw, single-measurement value that is constantly fluctuating based on network conditions (congestion, route changes, routing, queueing delays etc.).

⚙️ Smoothed Round-Trip Time (

Because individual RTT samples fluctuate (jitter, bursty traffic), TCP doesn’t react to every spike.
Instead, it maintains a smoothed average, called SRTT, using an exponential weighted moving average (EWMA).

where α = 1/8 (0.125) (default in Linux).

👉 That means new samples slightly adjust the average but don’t completely replace it.
It “smooths out” temporary spikes in delay.

The SRTT is a weighted average of the measured RTT values.
  • Definition: It's an exponentially weighted moving average (EWMA) of the RTT measurements. It's an estimate of the "normal" RTT for the connection.
  • Purpose: To smooth out temporary spikes in RTT measurements and provide a stable base for estimating the timeout value. It makes the TCP stack less reactive to brief network hiccups.
  • Formula (General Form): SRTT=(1−α)×SRTTold​+α×RTTnew​
    • α (alpha) is the smoothing factor (typically 1/8​ or 0.125). 
    • A low α means the SRTT changes slowly.

📉 

RTTVAR (RTT Variance) measures how much RTT values fluctuate (jitter).
It’s a smoothed estimate of the deviation between recent RTT samples and the SRTT.

 RTTVAR = (1 − β) * RTTVAR + β * |SRTT − RTT_sample|, where β = 1/4 (0.25) by default.

👉 If network delay becomes unstable, RTTVAR increases.
👉 If RTT samples are steady, RTTVAR decreases.

The RTTVAR measures how much the RTT is fluctuating around the SRTT average.
  • Definition: It is an EWMA of the deviation (or variance) between the measured RTT and the SRTT.
  • Purpose: To estimate the volatility of the network connection. A high RTTVAR means the RTT is unstable (variable latency), requiring a larger safety margin for the RTO.
  • Formula (General Form): RTTVAR=(1−β)×RTTVARold​+β×∣SRTT−RTTnew​∣
    • β (beta) is the gain for the variance (typically 41​ or 0.25).

⏱️  Calculating RTO (Retransmission Timeout) 

TCP calculates RTO using both SRTT and RTTVAR: RTO = SRTT + 4 * RTTVAR

This ensures the timeout dynamically adapts:

  • If RTT is stable → smaller RTO (faster retransmits).
  • If RTT is fluctuating → larger RTO (avoid false retransmissions).

Calculate the Actual RTO

The actual value used by the sender can be validated by measuring the time difference between the two packets.

Note: RFC 6298 also says the initial RTO (before any sample) should be 1.0 s. After the first sample, the above formulas apply. Some stacks clamp RTO to a minimum—if you need those platform-specific clamps I can add them. 

Formulas used (RFC-style)

  • α = 1/8 = 0.125
  • β = 1/4 = 0.25
  • For the first RTT sample M:
    • SRTT = M
    • RTTVAR = M / 2
  • For each subsequent sample M:
    • D = |SRTT_old − M|
    • RTTVAR = (1 − β) * RTTVAR_old + β * D
    • SRTT = (1 − α) * SRTT_old + α * M
  • RTO = SRTT + 4 * RTTVAR 

⚙️  Default Initial RTO (Linux / RFC Standard)

According to RFC 6298 (the current standard):  Initial RTO = 1 second (1000 ms)
Linux follows this RFC. You can confirm it in /proc:

cat /proc/sys/net/ipv4/tcp_syn_retries
cat /proc/sys/net/ipv4/tcp_retries1
cat /proc/sys/net/ipv4/tcp_retries2

but the actual timer is internal in the kernel and initialized to 1 second for new connections. 

Example Timeline

StepEvent / RTT Sample (M ms)RTT (ms)Calculation (intermediate)SRTT (ms)RTTVAR (ms)RTO_raw (ms)Notes
0SYN sent (no RTT yet)No sample yet1000.0Initial RTO before any RTT measurement = 1.0 s (RFC 6298)
1First sample100SRTT = M = 100; RTTVAR = M/2 = 50100.050.0100+4*50=300.0First sample; initial RTO computation
2Second sample120RTTVAR = 0.75×50+0.25×|100−120|=42.5;
SRTT=0.875×100+0.125×120=102.5
102.542.5102.5+4*42.5=272.5Slight RTT increase
3Third sample110RTTVAR = 0.75×42.5 + 0.25×|102.5−110| = 33.75; 
SRTT = 0.875×102.5 + 0.125×110 = 103.43
103.4433.75103.43+4*33.75=238.44RTT drops slightly
4Fourth sample (spike)200RTTVAR = 0.75×33.75 + 0.25×|103.43−200|=49.45; SRTT=0.875×103.43+0.125×200=115.47115.4849.45115.47+4*49.45=314.29RTT spike increases RTO
5Fifth sample140RTTVAR = 0.75×49.45 + 0.25×|115.47−140| = 43.39; 
SRTT = 0.875×115.47 + 0.125×140 = 117.88
117.8943.40117.88+4*43.39=289.37RTO adjusts downward after spike
6Sixth sample150RTTVAR = 0.75×43.39 + 0.25×|117.88−150| = 35.54; 
SRTT = 0.875×117.88 + 0.125×150 = 121.36
121.3635.55121.36+4*35.54=262.55Slight RTT increase
7Seventh sample130RTTVAR = 0.75×35.54 + 0.25×|121.36−130| = 28.53; 
SRTT = 0.875×121.36 + 0.125×130 = 122.95
122.9528.54122.95+4*28.53=235.39RTT drops moderately
8Eighth sample (spike)170RTTVAR = 0.75×28.53 + 0.25×|122.95−170| = 22.90; 
SRTT = 0.875×122.95 + 0.125×170 = 129.29
129.2922.90129.29+4*22.90=221.90Spike; RTO smooths
9Ninth sample160RTTVAR = 0.75×22.90 + 0.25×|129.29−160) = 17.91; 
SRTT = 0.875×129.29 + 0.125×160 = 132.78
132.7917.92132.78+4*17.91=204.46Slight decrease in RTT
10Tenth sample120RTTVAR = 0.75×17.91 + 0.25×|132.78−120) = 14.58; 
SRTT = 0.875×132.788+ 0.125×120 = 131.88
131.8914.58131.88+4*14.58=189.22RTT drop; RTO decreases smoothly

 ðŸ“–  RFC 6298 Rule Summary

 Here’s exactly what happens: 

StepEventVariableValue / Rule
1New connectionSRTTUndefined (no samples yet)
2New connectionRTTVARUndefined (no samples yet)
3Initial RTORTO1 second (default)
4First RTT sample arrivesSRTT = RTT_sample(First smoothed estimate)
5RTTVAR = RTT_sample / 2First deviation estimate
6New RTORTO = SRTT + 4 × RTTVAR

So the kernel starts conservatively — waits 1 second before retransmitting the very first packet if no ACK is seen.

⏱️ Backoff Behavior 

If a retransmission is needed and still no ACK is received, the kernel doubles the RTO each time (exponential backoff): 

AttemptRTO (seconds)
1st1.0
2nd2.0
3rd4.0
4th8.0
...up to system-defined max (usually 120s)

 This prevents flooding a congested network.

 There are four common reasons for packet re-transmission

  • The lack of an acknowledgement that data has been received within in a reasonable time
  • The sender discovering that transmission was unsuccessful.
  • The receiver notifying the sender that expected data hasn't been received.
  • The receiver discovering that data has been damaged during initial transmission. 

If there has no acknowledgment for the data before TCP's automatic timer expires, the segment is re-transmitted. The multiple packet re-transmission in case of no acknowledgment for the data is the default behavior of linux kernel. These OS parameters will decide how much attempts OS will do packet re-transmit and what will be the time gaps of the packet re-transmission .

  • TCP_RTO_MIN (200 ms)
  • TCP_RTO_MAX (120 seconds)
  • tcp_retries1 (3)
  • tcp_retries2 (15)

2018-04-27-linux-tcp-rto-retries2.png 

You may refer this link for more details :

https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html

TCP retransmits an unacknowledged packet up to tcp_retries2 sysctl setting times (defaults to 15) using an exponential backoff timeout for which each retransmission timeout is between TCP_RTO_MIN (200 ms) and TCP_RTO_MAX (120 seconds). Once the 15th retry expires (by default), the TCP stack will notify the layers above (ie. app) of a broken connection.

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt 

 

No comments:

Post a Comment