Let's go in depth: Congestion control tecnhical jargons

🧩 Key Control Variables (Kernel-Level Meaning)

Congestion control is about how fast the sender can send data but the sender learns how the network is behaving through feedback — that feedback comes in the form of ACKs from the receiver.

Variable	Description	Kernel Symbol / Field	Updated By
cwnd	Congestion window — limits bytes in flight	`tp->snd_cwnd`	`tcp_cong_avoid()`
rwnd	Receiver window — flow control from receiver	`tp->snd_wnd`	From ACK’s Window field
ssthresh	Slow-start threshold — boundary between exponential & linear growth	`tp->snd_ssthresh`	`tcp_enter_recovery()` or `tcp_enter_loss()`
RTT	Round-trip time estimate	`tp->srtt_us`	`tcp_rtt_estimator()`
RTO	Retransmission timeout	`icsk->icsk_rto`	`tcp_set_rto()`

What is the Receiver Window (rwnd)?

The Receiver Window (rwnd) is the amount of free buffer space the receiving TCP stack currently has available for incoming data.

Mechanism: The receiver constantly advertises its current rwnd size to the sender using the Window Size field in the TCP header of every ACK segment it sends back.
Purpose: The rwnd acts as a limit on the amount of unacknowledged data the sender is allowed to have "in flight" at any given time.
Location: The rwnd is maintained by the receiving end of the connection.

Unacknowledged Data

This refers to the data segments that the sender has transmitted but has not yet received a confirmation (ACK) for from the receiver.

When the sender transmits a packet, that data is temporarily stored in the sender's buffer.
It remains in the sender's buffer (and is considered "unacknowledged") until the receiver sends an ACK indicating it has successfully received and processed that sequence of bytes.

Data In Flight

This is a common term in networking that means the same thing as "unacknowledged data" (or sometimes slightly more specifically, the amount of unacknowledged data currently residing in the network path, which is often called the Flight Size).

It is data that has left the sender's machine and is currently traveling across the network, sitting in router queues, or waiting in the receiver's buffer before being processed by the application.
It's data that is "out there" and for which the sender is waiting for a positive acknowledgement.

"The rwnd acts as a limit..."

This is the key point for flow control. The sender must obey the receiver's advertised window.

The Rule: The sender calculates the amount of unacknowledged data it currently has "in flight" and must ensure that this amount never exceeds the current rwnd value advertised by the receiver.

Bytes In Flight≤min(Congestion Window,Receiver Window)

The Purpose: This mechanism prevents the sender from sending data faster than the receiver can read it from its buffer. If the sender ignored the rwnd, it would flood the receiver's limited buffer space, causing the receiver to drop packets, which wastes network bandwidth and triggers unnecessary retransmissions.

The rwnd is the receiver's way of shouting back to the sender, "I have this much room left! Don't send more than that total amount of data until I tell you I've cleared some space!" This is the mechanism for Flow Control.

The Role of rwnd in Flow Control

While the Congestion Window (cwnd) limits transmission based on the network's capacity (congestion control), the Receiver Window (rwnd) limits transmission based on the receiver's processing capacity and available memory (flow control).

Flow Control Rule:

The TCP sender must adhere to the following rule when determining how much data to send:

Effective Window=min(cwnd,rwnd)

The sender will never send more data than the smaller of the two values.

\text{Effective Window} = \min(\text{cwnd}, \text{rwnd})

Scenario	Limiting Factor	Focus
If $\text{cwnd} < \text{rwnd}$	Congestion Window ( $\text{cwnd}$ )	The sender is limited by what the network can handle. (Congestion Control)
If $\text{rwnd} < \text{cwnd}$	Receiver Window ( $\text{rwnd}$ )	The sender is limited by what the receiver can handle. (Flow Control)

Window Size Management:

Data Arrives: When data arrives, the receiver places it into its buffer, and the available rwnd decreases.
Data Read: When the application layer reads the data from the buffer, that buffer space is freed up, and the available rwnd increases.
Feedback: The receiver includes the new, updated rwnd value in the next ACK it sends to the sender.

If the receiver's buffer fills up completely, the rwnd is advertised as zero. This is known as Zero Window, and it stops the sender from transmitting new data until the application reads the buffered data, and the receiver can advertise a non-zero rwnd again.

What is the Congestion Window (cwnd)?

The cwnd, or Congestion Window, is a crucial state variable in TCP (Transmission Control Protocol) used by a sender to limit the amount of unacknowledged data it can transmit into the network before receiving an acknowledgment (ACK).

The primary purpose of the cwnd is to perform congestion control, which is a set of algorithms TCP uses to prevent network overload.

How the cwnd Works

Sender-Side Limit: The cwnd is maintained by the sending host and is a local estimate of how much capacity is currently available on the path to the receiver without causing congestion (like router buffer overflow).
The Transmission Limit: A TCP sender's window size—the maximum amount of data it can have "in flight" (sent but not yet acknowledged)—is determined by the minimum of two values:

Sender Window Size=min(Congestion Window (cwnd),Receiver Window (rwnd))
The Receiver Window (rwnd) is advertised by the receiver for flow control (preventing the sender from overwhelming the receiver's buffer), while the cwnd is for congestion control (preventing network congestion).

Dynamic Adjustment: The cwnd size changes dynamically based on network feedback:

Increase (Probing for Capacity): The cwnd is increased when the sender receives ACKs, which implicitly signal that the network is capable of handling the current data rate. This increase is typically exponential (during the Slow Start phase) and then linear (during the Congestion Avoidance phase).
Decrease (Reacting to Congestion): The cwnd is reduced when the sender detects packet loss (often via a timeout or receiving duplicate ACKs), which is interpreted as a sign of network congestion. This is a multiplicative decrease to rapidly reduce the offered load on the network.

What is the Slow Start Threshold (ssthresh)?

It marks the boundary between Slow Start (exponential growth) and Congestion Avoidance (linear growth).

When cwnd < ssthresh → TCP is in Slow Start (doubles cwnd every RTT)
When cwnd ≥ ssthresh → TCP enters Congestion Avoidance (linear growth)

Initial ssthresh — How it’s chosen

TCP does not know the network capacity at first.

No prior congestion info → conservative start.

Initial ssthresh is typically a “large” value or some default.

Common defaults in Linux: 64 KB or 10 MSS, depending on the kernel version and implementation.
Purpose: allow the connection to quickly enter Congestion Avoidance if network is large.

sensible default: enough to let Slow Start probe the network but not too aggressive to cause congestion.

Parameter	Example Value	Notes
MSS	1460 B	size of one TCP segment
Initial cwnd	1–2 MSS	start very small
Initial ssthresh	8–16 MSS (or larger, e.g., 64 KB / MSS ≈ 44 MSS)	chosen by OS / TCP stack default

This is why in many TCP examples we say “ssthresh = 8 MSS” — it’s a simplified example for teaching, not the exact Linux default.

Realistic Linux Behavior

In modern Linux, initial ssthresh is not fixed at 8 MSS, but often very large, effectively letting Slow Start continue until packet loss occurs.
After timeout or 3 dup ACKs, ssthresh is adjusted dynamically:

ssthresh=max(cwnd/2,2MSS)

This allows TCP to react to actual network conditions.

💡 Analogy

Think of it like driving a car on an unknown road:

Initial cwnd = 1 MSS → Start slowly, first gear.
Initial ssthresh = 8 MSS → You have permission to shift to higher gears (faster growth) after you’ve confirmed the road is safe.
Once you hit packet loss → reduce speed (ssthresh = half of previous cwnd) → try again.

💡 Key point:

Initial ssthresh is set internally by the TCP algorithm (like CUBIC or Reno) when the connection starts.
You cannot see it directly via sysctl; you need tcp_probe or observe cwnd behavior after packet loss.

Duplicate ACK (DUP ACK):

Sent by the TCP Receiver.
A receiver sends a Duplicate ACK when it receives a segment that is out of order, indicating a gap in the sequence numbers (i.e., a segment is missing).
The ACK number in the duplicate ACK is the sequence number of the missing segment (the next expected byte).
The purpose is to quickly notify the sender of the apparent loss, without waiting for the retransmission timer to expire

First Retransmission (Fast Retransmit):

Performed by the TCP Sender.
The sender triggers a "Fast Retransmit" when it receives a certain number of identical Duplicate ACKs (typically three, meaning one original ACK and three duplicates, or four ACKs with the same sequence number).
Upon receiving the third duplicate ACK, the sender retransmits the missing segment (the one indicated by the ACK number in the DUP ACKs) immediately, without waiting for the retransmission timeout.

tcp.analysis.retransmission (Timeout-Based Loss)

This filter identifies a Retransmission Timeout (RTO) loss event, which is the more severe type of loss detection.

Mechanism: The sender sends a segment and starts its internal RTO timer. If the sender does not receive an ACK for that data segment before the timer expires, it assumes the packet (or the ACK) is lost and retransmits the segment.
Resulting Action (Congestion Control): When an RTO occurs, the TCP stack assumes severe congestion.

The congestion window (cwnd) is reset to 1 MSS.
The slow start threshold (ssthresh) is set to half of the previous cwnd.
The connection enters the Slow Start phase to probe the network cautiously

Trace Signature: A large, often increasing, delay (the RTO value) between the original packet and the retransmitted packet. The RTO starts at a relatively high value (often 1 second or 3 seconds) and doubles with each subsequent timeout (exponential backoff).

tcp.analysis.fast_retransmission (Duplicate-ACK Based Loss)

This filter identifies a Fast Retransmit loss event, which is a quicker, less severe recovery mechanism.

Mechanism: The sender receives three or more duplicate ACKs for the same data segment. A duplicate ACK means the receiver got data out-of-order and is informing the sender of the highest sequence number received in-order. Three duplicates strongly suggest a lost packet without having to wait for the RTO timer.
Resulting Action (Congestion Control): When a Fast Retransmit occurs, the TCP stack assumes moderate congestion.

The slow start threshold (ssthresh) is set to half of the current cwnd (or cwnd/2).
The missing segment is retransmitted immediately (Fast Retransmit).
The connection enters Fast Recovery, typically setting cwnd to ssthresh+3 MSS (for the "inflated" cwnd state).

Trace Signature: The retransmitted packet arrives immediately after the third duplicate ACK is received, which is much faster than an RTO

Feature	tcp.analysis.retransmission	tcp.analysis.fast_retransmission
Detection	$\text{RTO}$ timer expires (no $\text{ACK}$ received).	Three or more Duplicate $\text{ACKs}$ received.
Severity	Severe Congestion (assumes network saturation).	Moderate Congestion (assumes a single dropped packet).
Response	Slow Start / Exponential Backoff.	Fast Retransmit / Fast Recovery.
$\text{cwnd}$ Action	Resets $\text{cwnd}$ to $\mathbf{1 \text{ MSS}}$ .	$\text{cwnd}$ is halved (set to $\text{ssthresh}$ ), then inflated.
Timing	Long delay (seconds) that doubles.	Immediate retransmission.

The key to differentiating between a retransmission that is part of a Fast Recovery (after a Fast Retransmit) and a retransmission due to an $\text{RTO}$ (Timeout) lies in two primary factors: the preceding packets and the time delay.

This is the most definitive way to tell them apart in a packet trace.

Differentiating by Preceding Packets

Type of Retransmission	Preceding Packets to Look For
RTO Retransmission	You will see NO traffic from the receiver (the host that should be acknowledging the data) for the entire duration of the Retransmission Timeout ( $\text{RTO}$ ). The communication simply stops, and after a long silence, the sender resends the data. This implies that the $\text{ACKs}$ were likely lost, or the original data segment was lost and the receiver never got it.
Fast Retransmission	You must see $\mathbf{3}$ or more identical Duplicate ACKs immediately before the retransmitted data packet. These $\text{ACKs}$ are the explicit trigger for the Fast Retransmit. The duplicate $\text{ACKs}$ prove that the receiver is still active, has received subsequent data (out-of-order), and is requesting the missing segment.

Differentiating by Timing and RTO Value

The time difference between the original segment and the retransmitted segment clearly separates the two.

Type of Retransmission	Time Delay	RTO Value in Trace	Congestion Action
RTO Retransmission	Long and Exponentially Increasing	The delay will be equal to the current calculated RTO value (e.g., $300 \text{ ms}$ , $1 \text{ s}$ , $2 \text{ s}$ , $4 \text{ s}$ , etc.).	$\mathbf{\text{cwnd}}$ reset to $1 \text{ MSS}$ (Slow Start).
Fast Retransmission	Immediate	The delay will be very short, typically in the range of $\mathbf{1 \text{ ms}}$ to $\mathbf{50 \text{ ms}}$ . It is dictated only by the time it took for the third Duplicate $\text{ACK}$ to arrive at the sender's buffer.	$\mathbf{\text{cwnd}}$ is halved (Fast Recovery).

Practical Validation in Wireshark

Fast Retransmission: Filter for tcp.analysis.duplicate_ack. Look for a packet labeled "Fast Retransmission" immediately following the 3rd duplicate ACK. The time delta will be tiny.

RTO Retransmission: Filter for tcp.analysis.retransmission (excluding the fast ones). The retransmitted packet's TCP analysis detail will show a long delay, and you can confirm no intervening packets were sent by the receiver.

In summary, a Fast Retransmission is a quick, proactive correction based on clear feedback (Duplicated ACKs), while an RTO Retransmission is a last resort based on silence and the expiration of a defensive timer.

🧠 Round-Trip Time (RTT)

RTT is the time it takes for a packet to go from sender → receiver → back (ACK).

RTT = Time when ACK received − Time when segment was sent.

The RTT is the basic measurement of network delay for a single data segment.
Definition: The time duration from when a TCP sender transmits a data segment until it receives the corresponding acknowledgment (ACK) from the receiver.
Purpose: It measures the propagation delay between the two hosts plus any processing time in the network path.
Snapshot: It's a raw, single-measurement value that is constantly fluctuating based on network conditions (congestion, route changes, routing, queueing delays etc.).

⚙️ Smoothed Round-Trip Time ( $\text{SRTT}$

Because individual RTT samples fluctuate (jitter, bursty traffic), TCP doesn’t react to every spike.
Instead, it maintains a smoothed average, called SRTT, using an exponential weighted moving average (EWMA).

$\text{SRTT}$ where α = 1/8 (0.125) (default in Linux).

👉 That means new samples slightly adjust the average but don’t completely replace it.
It “smooths out” temporary spikes in delay.

The SRTT is a weighted average of the measured RTT values.

Definition: It's an exponentially weighted moving average (EWMA) of the RTT measurements. It's an estimate of the "normal" RTT for the connection.
Purpose: To smooth out temporary spikes in RTT measurements and provide a stable base for estimating the timeout value. It makes the TCP stack less reactive to brief network hiccups.
Formula (General Form): SRTT=(1−α)×SRTTold+α×RTTnew

α (alpha) is the smoothing factor (typically 1/8 or 0.125).
A low α means the SRTT changes slowly.

📉 $\text{SRTT}$

RTTVAR (RTT Variance) measures how much RTT values fluctuate (jitter).
It’s a smoothed estimate of the deviation between recent RTT samples and the SRTT.

RTTVAR = (1 − β) * RTTVAR + β * |SRTT − RTT_sample|, where β = 1/4 (0.25) by default.

👉 If network delay becomes unstable, RTTVAR increases.
👉 If RTT samples are steady, RTTVAR decreases.

The RTTVAR measures how much the RTT is fluctuating around the SRTT average.

Definition: It is an EWMA of the deviation (or variance) between the measured RTT and the SRTT.
Purpose: To estimate the volatility of the network connection. A high RTTVAR means the RTT is unstable (variable latency), requiring a larger safety margin for the RTO.
Formula (General Form): RTTVAR=(1−β)×RTTVARold+β×∣SRTT−RTTnew∣

β (beta) is the gain for the variance (typically 41 or 0.25).

⏱️ Calculating RTO (Retransmission Timeout)

TCP calculates RTO using both SRTT and RTTVAR: RTO = SRTT + 4 * RTTVAR

This ensures the timeout dynamically adapts:

If RTT is stable → smaller RTO (faster retransmits).
If RTT is fluctuating → larger RTO (avoid false retransmissions).

Calculate the Actual RTO

The actual $\text{RTO}$ value used by the sender can be validated by measuring the time difference between the two packets.

\text{Actual RTO} = \text{Time of Retransmitted Packet} - \text{Time of Original Packet}

\text{Actual RTO} = \text{Time of Retransmitted Packet} - \text{Time of Original Packet}

\text{Actual RTO} = \text{Time of Retransmitted Packet} - \text{Time of Original Packet}

\text{Actual RTO} = \text{Time of Retransmitted Packet} - \text{Time of Original Packet}

Note: RFC 6298 also says the initial RTO (before any sample) should be 1.0 s. After the first sample, the above formulas apply. Some stacks clamp RTO to a minimum—if you need those platform-specific clamps I can add them.

Formulas used (RFC-style)

α = 1/8 = 0.125
β = 1/4 = 0.25
For the first RTT sample M:

SRTT = M
RTTVAR = M / 2

For each subsequent sample M:

D = |SRTT_old − M|
RTTVAR = (1 − β) * RTTVAR_old + β * D
SRTT = (1 − α) * SRTT_old + α * M

RTO = SRTT + 4 * RTTVAR

⚙️ Default Initial RTO (Linux / RFC Standard)

According to RFC 6298 (the current standard): Initial RTO = 1 second (1000 ms)
Linux follows this RFC. You can confirm it in /proc:

cat /proc/sys/net/ipv4/tcp_syn_retries
cat /proc/sys/net/ipv4/tcp_retries1
cat /proc/sys/net/ipv4/tcp_retries2

but the actual timer is internal in the kernel and initialized to 1 second for new connections.

Example Timeline

Step	Event / RTT Sample (M ms)	RTT (ms)	Calculation (intermediate)	SRTT (ms)	RTTVAR (ms)	RTO_raw (ms)	Notes
0	SYN sent (no RTT yet)	—	No sample yet	—	—	1000.0	Initial RTO before any RTT measurement = 1.0 s (RFC 6298)
1	First sample	100	SRTT = M = 100; RTTVAR = M/2 = 50	100.0	50.0	100+4*50=300.0	First sample; initial RTO computation
2	Second sample	120	RTTVAR = 0.75×50+0.25×\|100−120\|=42.5; SRTT=0.875×100+0.125×120=102.5	102.5	42.5	102.5+4*42.5=272.5	Slight RTT increase
3	Third sample	110	RTTVAR = 0.75×42.5 + 0.25×\|102.5−110\| = 33.75; SRTT = 0.875×102.5 + 0.125×110 = 103.43	103.44	33.75	103.43+4*33.75=238.44	RTT drops slightly
4	Fourth sample (spike)	200	RTTVAR = 0.75×33.75 + 0.25×\|103.43−200\|=49.45; SRTT=0.875×103.43+0.125×200=115.47	115.48	49.45	115.47+4*49.45=314.29	RTT spike increases RTO
5	Fifth sample	140	RTTVAR = 0.75×49.45 + 0.25×\|115.47−140\| = 43.39; SRTT = 0.875×115.47 + 0.125×140 = 117.88	117.89	43.40	117.88+4*43.39=289.37	RTO adjusts downward after spike
6	Sixth sample	150	RTTVAR = 0.75×43.39 + 0.25×\|117.88−150\| = 35.54; SRTT = 0.875×117.88 + 0.125×150 = 121.36	121.36	35.55	121.36+4*35.54=262.55	Slight RTT increase
7	Seventh sample	130	RTTVAR = 0.75×35.54 + 0.25×\|121.36−130\| = 28.53; SRTT = 0.875×121.36 + 0.125×130 = 122.95	122.95	28.54	122.95+4*28.53=235.39	RTT drops moderately
8	Eighth sample (spike)	170	RTTVAR = 0.75×28.53 + 0.25×\|122.95−170\| = 22.90; SRTT = 0.875×122.95 + 0.125×170 = 129.29	129.29	22.90	129.29+4*22.90=221.90	Spike; RTO smooths
9	Ninth sample	160	RTTVAR = 0.75×22.90 + 0.25×\|129.29−160) = 17.91; SRTT = 0.875×129.29 + 0.125×160 = 132.78	132.79	17.92	132.78+4*17.91=204.46	Slight decrease in RTT
10	Tenth sample	120	RTTVAR = 0.75×17.91 + 0.25×\|132.78−120) = 14.58; SRTT = 0.875×132.788+ 0.125×120 = 131.88	131.89	14.58	131.88+4*14.58=189.22	RTT drop; RTO decreases smoothly

📖 RFC 6298 Rule Summary

Here’s exactly what happens:

Step	Event	Variable	Value / Rule
1	New connection	SRTT	Undefined (no samples yet)
2	New connection	RTTVAR	Undefined (no samples yet)
3	Initial RTO	RTO	1 second (default)
4	First RTT sample arrives	SRTT = RTT_sample	(First smoothed estimate)
5	RTTVAR = RTT_sample / 2	First deviation estimate
6	New RTO	RTO = SRTT + 4 × RTTVAR

So the kernel starts conservatively — waits 1 second before retransmitting the very first packet if no ACK is seen.

⏱️ Backoff Behavior

If a retransmission is needed and still no ACK is received, the kernel doubles the RTO each time (exponential backoff):

Attempt	RTO (seconds)
1st	1.0
2nd	2.0
3rd	4.0
4th	8.0
...	up to system-defined max (usually 120s)

This prevents flooding a congested network.

There are four common reasons for packet re-transmission

The lack of an acknowledgement that data has been received within in a reasonable time
The sender discovering that transmission was unsuccessful.
The receiver notifying the sender that expected data hasn't been received.
The receiver discovering that data has been damaged during initial transmission.

If there has no acknowledgment for the data before TCP's automatic timer expires, the segment is re-transmitted. The multiple packet re-transmission in case of no acknowledgment for the data is the default behavior of linux kernel. These OS parameters will decide how much attempts OS will do packet re-transmit and what will be the time gaps of the packet re-transmission .

TCP_RTO_MIN (200 ms)
TCP_RTO_MAX (120 seconds)
tcp_retries1 (3)
tcp_retries2 (15)

You may refer this link for more details :

https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html

TCP retransmits an unacknowledged packet up to tcp_retries2 sysctl setting times (defaults to 15) using an exponential backoff timeout for which each retransmission timeout is between TCP_RTO_MIN (200 ms) and TCP_RTO_MAX (120 seconds). Once the 15th retry expires (by default), the TCP stack will notify the layers above (ie. app) of a broken connection.

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

Let's go in depth

Saturday, October 25, 2025

Congestion control tecnhical jargons

No comments:

Post a Comment

Search This Blog