Saturday, October 25, 2025

Congestion control tecnhical jargons

What is the Receiver Window (rwnd)?

The Receiver Window (rwnd) is the amount of free buffer space the receiving TCP stack currently has available for incoming data.

  • Mechanism: The receiver constantly advertises its current rwnd size to the sender using the Window Size field in the TCP header of every ACK segment it sends back.
  • Purpose: The rwnd acts as a limit on the amount of unacknowledged data the sender is allowed to have "in flight" at any given time.
  • Location: The rwnd is maintained by the receiving end of the connection.

Unacknowledged Data

This refers to the data segments that the sender has transmitted but has not yet received a confirmation (ACK) for from the receiver.

  • When the sender transmits a packet, that data is temporarily stored in the sender's buffer.
  • It remains in the sender's buffer (and is considered "unacknowledged") until the receiver sends an ACK indicating it has successfully received and processed that sequence of bytes. 

Data In Flight

This is a common term in networking that means the same thing as "unacknowledged data" (or sometimes slightly more specifically, the amount of unacknowledged data currently residing in the network path, which is often called the Flight Size).

  • It is data that has left the sender's machine and is currently traveling across the network, sitting in router queues, or waiting in the receiver's buffer before being processed by the application.
  • It's data that is "out there" and for which the sender is waiting for a positive acknowledgement. 

"The rwnd acts as a limit..."

This is the key point for flow control. The sender must obey the receiver's advertised window.

  • The Rule: The sender calculates the amount of unacknowledged data it currently has "in flight" and must ensure that this amount never exceeds the current rwnd value advertised by the receiver.
    • Bytes In Flight≤min(Congestion Window,Receiver Window)
  • The Purpose: This mechanism prevents the sender from sending data faster than the receiver can read it from its buffer. If the sender ignored the rwnd, it would flood the receiver's limited buffer space, causing the receiver to drop packets, which wastes network bandwidth and triggers unnecessary retransmissions. 

The rwnd is the receiver's way of shouting back to the sender, "I have this much room left! Don't send more than that total amount of data until I tell you I've cleared some space!" This is the mechanism for Flow Control. 

The Role of rwnd in Flow Control

While the Congestion Window (cwnd) limits transmission based on the network's capacity (congestion control), the Receiver Window (rwnd) limits transmission based on the receiver's processing capacity and available memory (flow control).

Flow Control Rule:

  • The TCP sender must adhere to the following rule when determining how much data to send:
    • Effective Window=min(cwnd,rwnd)
The sender will never send more data than the smaller of the two values. 
Scenario Limiting Factor Focus
If Congestion Window () The sender is limited by what the network can handle. (Congestion Control)
If Receiver Window () The sender is limited by what the receiver can handle. (Flow Control)

 Window Size Management:

  • Data Arrives: When data arrives, the receiver places it into its buffer, and the available rwnd decreases.
  • Data Read: When the application layer reads the data from the buffer, that buffer space is freed up, and the available rwnd increases.
  • Feedback: The receiver includes the new, updated rwnd value in the next ACK it sends to the sender.

If the receiver's buffer fills up completely, the rwnd is advertised as zero. This is known as Zero Window, and it stops the sender from transmitting new data until the application reads the buffered data, and the receiver can advertise a non-zero rwnd again.

 What is the Congestion Window (cwnd)?

The cwnd, or Congestion Window, is a crucial state variable in TCP (Transmission Control Protocol) used by a sender to limit the amount of unacknowledged data it can transmit into the network before receiving an acknowledgment (ACK).

The primary purpose of the cwnd is to perform congestion control, which is a set of algorithms TCP uses to prevent network overload. 

How the cwnd Works

  • Sender-Side Limit: The cwnd is maintained by the sending host and is a local estimate of how much capacity is currently available on the path to the receiver without causing congestion (like router buffer overflow).
  • The Transmission Limit: A TCP sender's window size—the maximum amount of data it can have "in flight" (sent but not yet acknowledged)—is determined by the minimum of two values:
    • Sender Window Size=min(Congestion Window (cwnd),Receiver Window (rwnd))
    • The Receiver Window (rwnd) is advertised by the receiver for flow control (preventing the sender from overwhelming the receiver's buffer), while the cwnd is for congestion control (preventing network congestion). 
  • Dynamic Adjustment: The cwnd size changes dynamically based on network feedback:
    • Increase (Probing for Capacity): The cwnd is increased when the sender receives ACKs, which implicitly signal that the network is capable of handling the current data rate. This increase is typically exponential (during the Slow Start phase) and then linear (during the Congestion Avoidance phase).
    • Decrease (Reacting to Congestion): The cwnd is reduced when the sender detects packet loss (often via a timeout or receiving duplicate ACKs), which is interpreted as a sign of network congestion. This is a multiplicative decrease to rapidly reduce the offered load on the network. 

Duplicate ACK (DUP ACK):

  • Sent by the TCP Receiver.
  • A receiver sends a Duplicate ACK when it receives a segment that is out of order, indicating a gap in the sequence numbers (i.e., a segment is missing).
  • The ACK number in the duplicate ACK is the sequence number of the missing segment (the next expected byte).
  • The purpose is to quickly notify the sender of the apparent loss, without waiting for the retransmission timer to expire 

First Retransmission (Fast Retransmit):

  • Performed by the TCP Sender.
  • The sender triggers a "Fast Retransmit" when it receives a certain number of identical Duplicate ACKs (typically three, meaning one original ACK and three duplicates, or four ACKs with the same sequence number).
  • Upon receiving the third duplicate ACK, the sender retransmits the missing segment (the one indicated by the ACK number in the DUP ACKs) immediately, without waiting for the retransmission timeout. 
tcp.analysis.retransmission (Timeout-Based Loss) 

This filter identifies a Retransmission Timeout (RTO) loss event, which is the more severe type of loss detection.

  • Mechanism: The sender sends a segment and starts its internal RTO timer. If the sender does not receive an ACK for that data segment before the timer expires, it assumes the packet (or the ACK) is lost and retransmits the segment.
  • Resulting Action (Congestion Control): When an RTO occurs, the TCP stack assumes severe congestion.
    • The congestion window (cwnd) is reset to 1 MSS.
    • The slow start threshold (ssthresh) is set to half of the previous cwnd.
    • The connection enters the Slow Start phase to probe the network cautiously
  • Trace Signature: A large, often increasing, delay (the RTO value) between the original packet and the retransmitted packet. The RTO starts at a relatively high value (often 1 second or 3 seconds) and doubles with each subsequent timeout (exponential backoff). 

tcp.analysis.fast_retransmission (Duplicate-ACK Based Loss)

This filter identifies a Fast Retransmit loss event, which is a quicker, less severe recovery mechanism.
  • Mechanism: The sender receives three or more duplicate ACKs for the same data segment. A duplicate ACK means the receiver got data out-of-order and is informing the sender of the highest sequence number received in-order. Three duplicates strongly suggest a lost packet without having to wait for the RTO timer.
  • Resulting Action (Congestion Control): When a Fast Retransmit occurs, the TCP stack assumes moderate congestion.
    • The slow start threshold (ssthresh) is set to half of the current cwnd (or cwnd/2).
    • The missing segment is retransmitted immediately (Fast Retransmit).
    • The connection enters Fast Recovery, typically setting cwnd to ssthresh+3 MSS (for the "inflated" cwnd state).
  • Trace Signature: The retransmitted packet arrives immediately after the third duplicate ACK is received, which is much faster than an RTO 
Featuretcp.analysis.retransmissiontcp.analysis.fast_retransmission
Detection timer expires (no received).Three or more Duplicate received.
SeveritySevere Congestion (assumes network saturation).Moderate Congestion (assumes a single dropped packet).
ResponseSlow Start / Exponential Backoff.Fast Retransmit / Fast Recovery.
ActionResets to . is halved (set to ), then inflated.
TimingLong delay (seconds) that doubles.Immediate retransmission.

The key to differentiating between a retransmission that is part of a Fast Recovery (after a Fast Retransmit) and a retransmission due to an (Timeout) lies in two primary factors: the preceding packets and the time delay.

This is the most definitive way to tell them apart in a packet trace.

Differentiating by Preceding Packets  

Type of RetransmissionPreceding Packets to Look For
RTO RetransmissionYou will see NO traffic from the receiver (the host that should be acknowledging the data) for the entire duration of the Retransmission Timeout (). The communication simply stops, and after a long silence, the sender resends the data. This implies that the were likely lost, or the original data segment was lost and the receiver never got it.
Fast RetransmissionYou must see or more identical Duplicate ACKs immediately before the retransmitted data packet. These are the explicit trigger for the Fast Retransmit. The duplicate prove that the receiver is still active, has received subsequent data (out-of-order), and is requesting the missing segment.

Differentiating by Timing and RTO Value

The time difference between the original segment and the retransmitted segment clearly separates the two. 

Type of RetransmissionTime DelayRTO Value in TraceCongestion Action
RTO RetransmissionLong and Exponentially IncreasingThe delay will be equal to the current calculated RTO value (e.g., , , , , etc.). reset to (Slow Start).
Fast RetransmissionImmediateThe delay will be very short, typically in the range of to . It is dictated only by the time it took for the third Duplicate to arrive at the sender's buffer. is halved (Fast Recovery).

Practical Validation in Wireshark

Fast Retransmission: Filter for tcp.analysis.duplicate_ack. Look for a packet labeled "Fast Retransmission" immediately following the 3rd duplicate ACK. The time delta will be tiny.

RTO Retransmission: Filter for tcp.analysis.retransmission (excluding the fast ones). The retransmitted packet's TCP analysis detail will show a long delay, and you can confirm no intervening packets were sent by the receiver. 

In summary, a Fast Retransmission is a quick, proactive correction based on clear feedback (Duplicated ACKs), while an RTO Retransmission is a last resort based on silence and the expiration of a defensive timer.

🧠 Round-Trip Time (RTT) 

RTT is the time it takes for a packet to go from sender → receiver → back (ACK).

RTT = Time when ACK received − Time when segment was sent. 

  • The RTT is the basic measurement of network delay for a single data segment.
  • Definition: The time duration from when a TCP sender transmits a data segment until it receives the corresponding acknowledgment (ACK) from the receiver.
  • Purpose: It measures the propagation delay between the two hosts plus any processing time in the network path.
  • Snapshot: It's a raw, single-measurement value that is constantly fluctuating based on network conditions (congestion, route changes, routing, queueing delays etc.).

⚙️ Smoothed Round-Trip Time (

Because individual RTT samples fluctuate (jitter, bursty traffic), TCP doesn’t react to every spike.
Instead, it maintains a smoothed average, called SRTT, using an exponential weighted moving average (EWMA).

where α = 1/8 (0.125) (default in Linux).

👉 That means new samples slightly adjust the average but don’t completely replace it.
It “smooths out” temporary spikes in delay.

The SRTT is a weighted average of the measured RTT values.
  • Definition: It's an exponentially weighted moving average (EWMA) of the RTT measurements. It's an estimate of the "normal" RTT for the connection.
  • Purpose: To smooth out temporary spikes in RTT measurements and provide a stable base for estimating the timeout value. It makes the TCP stack less reactive to brief network hiccups.
  • Formula (General Form): SRTT=(1−α)×SRTTold​+α×RTTnew​
    • α (alpha) is the smoothing factor (typically 1/8​ or 0.125). 
    • A low α means the SRTT changes slowly.

📉 

RTTVAR (RTT Variance) measures how much RTT values fluctuate (jitter).
It’s a smoothed estimate of the deviation between recent RTT samples and the SRTT.

 RTTVAR = (1 − β) * RTTVAR + β * |SRTT − RTT_sample|, where β = 1/4 (0.25) by default.

👉 If network delay becomes unstable, RTTVAR increases.
👉 If RTT samples are steady, RTTVAR decreases.

The RTTVAR measures how much the RTT is fluctuating around the SRTT average.
  • Definition: It is an EWMA of the deviation (or variance) between the measured RTT and the SRTT.
  • Purpose: To estimate the volatility of the network connection. A high RTTVAR means the RTT is unstable (variable latency), requiring a larger safety margin for the RTO.
  • Formula (General Form): RTTVAR=(1−β)×RTTVARold​+β×∣SRTT−RTTnew​∣
    • β (beta) is the gain for the variance (typically 41​ or 0.25).

⏱️  Calculating RTO (Retransmission Timeout) 

TCP calculates RTO using both SRTT and RTTVAR: RTO = SRTT + 4 * RTTVAR

This ensures the timeout dynamically adapts:

  • If RTT is stable → smaller RTO (faster retransmits).
  • If RTT is fluctuating → larger RTO (avoid false retransmissions).

Calculate the Actual RTO

The actual value used by the sender can be validated by measuring the time difference between the two packets.

Note: RFC 6298 also says the initial RTO (before any sample) should be 1.0 s. After the first sample, the above formulas apply. Some stacks clamp RTO to a minimum—if you need those platform-specific clamps I can add them. 

Formulas used (RFC-style)

  • α = 1/8 = 0.125
  • β = 1/4 = 0.25
  • For the first RTT sample M:
    • SRTT = M
    • RTTVAR = M / 2
  • For each subsequent sample M:
    • D = |SRTT_old − M|
    • RTTVAR = (1 − β) * RTTVAR_old + β * D
    • SRTT = (1 − α) * SRTT_old + α * M
  • RTO = SRTT + 4 * RTTVAR 

⚙️  Default Initial RTO (Linux / RFC Standard)

According to RFC 6298 (the current standard):  Initial RTO = 1 second (1000 ms)
Linux follows this RFC. You can confirm it in /proc:

cat /proc/sys/net/ipv4/tcp_syn_retries
cat /proc/sys/net/ipv4/tcp_retries1
cat /proc/sys/net/ipv4/tcp_retries2

but the actual timer is internal in the kernel and initialized to 1 second for new connections. 

Example Timeline

StepEvent / RTT Sample (M ms)RTT (ms)Calculation (intermediate)SRTT (ms)RTTVAR (ms)RTO_raw (ms)Notes
0SYN sent (no RTT yet)No sample yet1000.0Initial RTO before any RTT measurement = 1.0 s (RFC 6298)
1First sample100SRTT = M = 100; RTTVAR = M/2 = 50100.050.0100+4*50=300.0First sample; initial RTO computation
2Second sample120RTTVAR = 0.75×50+0.25×|100−120|=42.5;
SRTT=0.875×100+0.125×120=102.5
102.542.5102.5+4*42.5=272.5Slight RTT increase
3Third sample110RTTVAR = 0.75×42.5 + 0.25×|102.5−110| = 33.75; 
SRTT = 0.875×102.5 + 0.125×110 = 103.43
103.4433.75103.43+4*33.75=238.44RTT drops slightly
4Fourth sample (spike)200RTTVAR = 0.75×33.75 + 0.25×|103.43−200|=49.45; SRTT=0.875×103.43+0.125×200=115.47115.4849.45115.47+4*49.45=314.29RTT spike increases RTO
5Fifth sample140RTTVAR = 0.75×49.45 + 0.25×|115.47−140| = 43.39; 
SRTT = 0.875×115.47 + 0.125×140 = 117.88
117.8943.40117.88+4*43.39=289.37RTO adjusts downward after spike
6Sixth sample150RTTVAR = 0.75×43.39 + 0.25×|117.88−150| = 35.54; 
SRTT = 0.875×117.88 + 0.125×150 = 121.36
121.3635.55121.36+4*35.54=262.55Slight RTT increase
7Seventh sample130RTTVAR = 0.75×35.54 + 0.25×|121.36−130| = 28.53; 
SRTT = 0.875×121.36 + 0.125×130 = 122.95
122.9528.54122.95+4*28.53=235.39RTT drops moderately
8Eighth sample (spike)170RTTVAR = 0.75×28.53 + 0.25×|122.95−170| = 22.90; 
SRTT = 0.875×122.95 + 0.125×170 = 129.29
129.2922.90129.29+4*22.90=221.90Spike; RTO smooths
9Ninth sample160RTTVAR = 0.75×22.90 + 0.25×|129.29−160) = 17.91; 
SRTT = 0.875×129.29 + 0.125×160 = 132.78
132.7917.92132.78+4*17.91=204.46Slight decrease in RTT
10Tenth sample120RTTVAR = 0.75×17.91 + 0.25×|132.78−120) = 14.58; 
SRTT = 0.875×132.788+ 0.125×120 = 131.88
131.8914.58131.88+4*14.58=189.22RTT drop; RTO decreases smoothly

 📖  RFC 6298 Rule Summary

 Here’s exactly what happens: 

StepEventVariableValue / Rule
1New connectionSRTTUndefined (no samples yet)
2New connectionRTTVARUndefined (no samples yet)
3Initial RTORTO1 second (default)
4First RTT sample arrivesSRTT = RTT_sample(First smoothed estimate)
5RTTVAR = RTT_sample / 2First deviation estimate
6New RTORTO = SRTT + 4 × RTTVAR

So the kernel starts conservatively — waits 1 second before retransmitting the very first packet if no ACK is seen.

⏱️ Backoff Behavior 

If a retransmission is needed and still no ACK is received, the kernel doubles the RTO each time (exponential backoff): 

AttemptRTO (seconds)
1st1.0
2nd2.0
3rd4.0
4th8.0
...up to system-defined max (usually 120s)

 This prevents flooding a congested network.

 There are four common reasons for packet re-transmission

  • The lack of an acknowledgement that data has been received within in a reasonable time
  • The sender discovering that transmission was unsuccessful.
  • The receiver notifying the sender that expected data hasn't been received.
  • The receiver discovering that data has been damaged during initial transmission. 

If there has no acknowledgment for the data before TCP's automatic timer expires, the segment is re-transmitted. The multiple packet re-transmission in case of no acknowledgment for the data is the default behavior of linux kernel. These OS parameters will decide how much attempts OS will do packet re-transmit and what will be the time gaps of the packet re-transmission .

  • TCP_RTO_MIN (200 ms)
  • TCP_RTO_MAX (120 seconds)
  • tcp_retries1 (3)
  • tcp_retries2 (15)

 You may refer this link for more details :

https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html

TCP retransmits an unacknowledged packet up to tcp_retries2 sysctl setting times (defaults to 15) using an exponential backoff timeout for which each retransmission timeout is between TCP_RTO_MIN (200 ms) and TCP_RTO_MAX (120 seconds). Once the 15th retry expires (by default), the TCP stack will notify the layers above (ie. app) of a broken connection.

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt 

 

TX/RX Synchronization Delays in the Kernel: When Full Duplex Becomes “Pseudo Half-Duplex”

In theory, Ethernet and modern NICs operate in full duplex, meaning transmit (TX) and receive (RX) operations can happen simultaneously over separate lanes. However, at the kernel and driver level, TX and RX often share critical hardware structures — such as descriptor rings, DMA channels, and driver locks protecting the socket buffer queues (sk_buffs). During high-burst transmissions, the TX path can monopolize these shared resources. The NIC driver focuses on draining the TX ring using DMA transfers, keeping interrupts masked or deprioritizing RX handling to maintain throughput efficiency.

When this happens, the RX engine becomes temporarily deferred, meaning that even though ACKs or responses arrive on the wire, they sit idle in the NIC buffer until the TX DMA cycle finishes. This can delay ACK processing by just a few microseconds — but in high-frequency trading systems, embedded gateways, or Telium-class firmware, those microseconds add up, causing latency spikes, retransmission triggers, or perceived half-duplex behavior.

Essentially, the system remains electrically full duplex, but kernel-level locking and DMA scheduling create a temporary serialization between TX and RX. This phenomenon explains why, during heavy TX bursts, inbound packets may appear delayed or dropped — not due to congestion, but because the driver and hardware are briefly “busy” on the transmit side, deferring receive processing until resources are released.

 

 SENDER HOST — TX (Transmit) Path (Full Kernel View)

Layer / BoxDescriptionKey Structures / FunctionsImportant Parameters
User Space AppYour program calling send() or write() on a socket.send(), write(), or sendto() syscallsUser buffer size
Socket LayerConverts user data into kernel-managed buffers (circular TX buffer). Responsible for flow control and segmentation.tcp_sendmsg(), struct sock, sk_buff (skb)sndbuf, sk_sndbuf, write_seq, snd_nxt
TCP LayerCreates TCP segments (MSS-sized), adds headers, maintains retransmission and cwnd logic.tcp_transmit_skb(), congestion control modulesMSS, cwnd, ssthresh, RTO
IP LayerWraps each TCP segment in an IP header; handles routing and fragmentation.ip_queue_xmit(), dst_entry, rtableIP TTL, DF bit, TOS
Qdisc LayerKernel queuing discipline—responsible for buffering, scheduling, and shaping before the NIC.struct Qdisc, pfifo_fast, fq_codelqlen (queue length), backlog, tx_bytes
NIC DriverImplements ndo_start_xmit() to map packets to DMA descriptors and push to NIC.netdev_ops, struct sk_buff, struct netdev_queueTX ring size, tx_timeout, tx_queue_len
NIC HardwarePerforms actual DMA read and transmission.TX descriptor ring, DMA engineDescriptor count, TX head/tail index

 

 RECEIVER HOST — RX (Receive) Path (Full Kernel View)

Layer / BoxDescriptionKey Structures / FunctionsImportant Parameters
NIC HardwareReceives Ethernet frames and writes into pre-allocated RX DMA buffers in main memory.RX ring buffer, DMA engineDescriptor ring length, head/tail index
NIC DriverHandles RX interrupt or NAPI poll, allocates sk_buff, fills metadata, pushes packet up.napi_poll(), netif_receive_skb()RX budget, RX ring lock, napi_gro_receive()
IP LayerExtracts and validates IP header, does checksum verification, routing decision.ip_rcv(), ip_rcv_finish()IP header len, checksum, TTL
TCP LayerReassembles segments, updates ACKs, flow control, and congestion window.tcp_v4_rcv(), tcp_ack(), tcp_data_queue()rcv_nxt, snd_una, window size, ACK delay
Socket LayerBuffers the received payload in a circular receive buffer for user-space reading.sk_receive_queue, tcp_recvmsg()rcvbuf, backlog size
User Space AppReads data with recv() or read().recv(), read() syscallsBuffer length, blocking/non-blocking

 🧩 Where Segmentation and Reassembly Occur

OperationDirectionLayerMechanism
SegmentationTX (Sender)TCP LayerSplits application data into MSS-sized sk_buffs before IP encapsulation
FragmentationTX (Optional)IP LayerOnly if IP MTU smaller than MSS (rare with PMTU discovery)
ReassemblyRX (Receiver)TCP LayerCombines multiple segments into original stream before copying to recv buffer

🧠 Concept 

ConceptExplanation
Circular BufferBoth TX and RX socket buffers are circular, maintaining read/write pointers.
SegmentationTCP divides data into MSS units before enqueueing for IP.
QdiscQueuing logic that schedules skbs to the NIC for fairness or shaping.
DMA Descriptor RingShared memory region where NIC reads (TX) or writes (RX) packets.
Interrupt / NAPINIC signals kernel for packet completion or reception.
ACK ClockingTCP relies on ACK arrival to pace cwnd and send new data.

Duplex Modes — The Basics 

ModeMeaningBehavior
Full DuplexTX (Transmit) and RX (Receive) can operate simultaneously on the same link.Both sides can send and receive data at the same time without waiting.
Half DuplexTX and RX share the same medium but cannot operate simultaneously — only one direction is active at a time.When one side transmits, the other must wait until it finishes before replying.

 

 How It Affects TX/RX Flow (Compared to the Above Full Kernel Flow)

BehaviorFull Duplex (Standard Linux PC / Server)Half Duplex (Embedded / Firmware-driven)
TX/RX OperationTX and RX rings work independently and can DMA simultaneously.TX and RX rings often share DMA channels or hardware resources.
Interrupt HandlingTX completions and RX arrivals can occur in parallel threads (NAPI or IRQ).RX interrupts might be masked or deferred while TX DMA is active.
LockingIndependent TX/RX locks; contention minimal.Shared TX/RX lock — RX handler waits for TX release (→ jitter).
TimingNear zero coupling — ACKs can be received while still sending.ACKs can be delayed because RX engine sleeps during TX DMA.
Effective BehaviorTrue duplex link — high throughput and low latency.Quasi half-duplex — you may see pauses between TX bursts and ACK arrivals.

⚙️ Embedded Devices and “Quasi Half-Duplex”

  • Many embedded NICs or SoCs (System-on-Chip) operate in a pseudo half-duplex fashion because:
  • They share one DMA engine for both TX and RX.
  • TX DMA activity “locks out” RX temporarily to avoid memory contention.
  • Firmware (or RTOS driver) toggles RX/TX enable bits explicitly: 

          

This results in:

  • Short RX blackout windows (few µs–ms).
  • Delayed ACK reception → increased RTT.
  • Possible TCP retransmissions or cwnd collapse if ACKs are delayed too long. 

 🧩 Where TX/RX Synchronization & Flip Timing Actually Happens

LayerRoleTX/RX InteractionImpact / Synchronization Point
Application LayerCalls send() / recv() syscallsNone (user-space calls are independent)Not applicable
Socket LayerManages circular buffers (sk_sndbuf, sk_rcvbuf)Logically independent per socketNo locking contention with NIC directly
TCP LayerControls congestion window (cwnd) and ACK timingDepends on timely RX of ACKsDelays appear indirectly if RX interrupts are delayed
IP / Routing LayerRoutes outgoing/incoming sk_buffsMinimal interactionShared packet queues, no hardware contention
Queueing Discipline (qdisc)Software queue before NICTX path onlyIndependent; no RX involvement
NIC Driver Layer 🧠 (CRITICAL ZONE)Manages TX and RX rings, DMA, and interruptsTX and RX share hardware resources, locks, and interrupt context⚠️ This is where “flip timing” and half-duplex effects occur
NIC Hardware (MAC / PHY)Executes DMA transfers, transmits and receives packetsTX and RX engines share PCIe bus, buffers, and DMA channelsHardware-level contention or serialization possible

 The Layer Where It Actually Happens

TX/RX synchronization and flip timing issues occur at the NIC driver and hardware layer — below the IP stack, inside the kernel’s device driver (netdev) context.

Specifically:

  • In Linux, this involves:
    • ndo_start_xmit() (TX path)
    • napi_poll() or netif_receive_skb() (RX path)
    • Both share spinlocks, ring buffer memory, and interrupt routines.
  • TX completion and RX interrupt handling can block each other if:
    • The driver uses shared locks (e.g., netdev_queue->lock).
    • NAPI polling is deferred while TX IRQs dominate.
  • Firmware-driven NICs (like embedded SoCs or Telium-based stacks) may enforce TX completion before RX enable — effectively half-duplexing the link temporarily. 

 

  • Layer: NIC driver & hardware (bottom of kernel stack).
  • Mechanism: Shared DMA queues, interrupt lines, and spinlocks.
  • Effect: TX bursts monopolize hardware → RX path delayed → late ACKs/timeouts.
  • Analogy: Think of TX and RX as two people sharing a single narrow door — if one keeps walking through (TX burst), the other (RX) must wait. 

⚙️ TX–RX Ring Interaction & Lock Contention Timeline

TX and RX rings share the same NIC DMA engine and driver locks. During TX bursts, the transmit path dominates CPU or PCIe/DMA bandwidth, delaying RX interrupts or NAPI polling.

Shared Rings + Contention Timeline 

 

🕒 Microsecond-Scale Behavior Summary

Time (µs)EventDescription
0TX lock acquiredndo_start_xmit() grabs TX ring lock
2TX DMA startsFrames begin DMA transfer to NIC
4RX interrupt arrivesIncoming ACK/packet — IRQ handler can’t take lock
6TX completion IRQ firesTX done, but same IRQ line shared
8Lock still heldRX polling (NAPI) delayed
10Lock releasedRX handler runs
12ACK processedRTT appears inflated to TCP layer

📊 Resulting TCP Symptoms 

Observable EffectRoot CauseLayer Impacted
Increased RTTRX delay due to TX contentionTCP
Duplicate ACKsDelayed ACK receptionTCP
cwnd stallsACK clock slowed downCongestion control
Spurious retransmissionsRX DMA backlogTransport layer
Apparent half-duplexFirmware defers RXNIC hardware/driver

 ⚙️ Real-World Example

  • Seen in Telium, NXP, or Intel I225/I226 firmware when large TX bursts occur.
  • Some NICs serialize TX/RX DMA channels to reduce PCIe contention.
  • Linux mitigates this with:
    • NAPI (poll-mode RX)
    • Separate MSI-X interrupts for TX and RX queues
    • RPS/RFS to distribute RX load across CPUs

 Kernel-Level Timing Diagram — TX Burst Lock Contention & ACK Delay (µs-scale)

🧠 Explanation by Phase 

PhaseTime Range (Example)Kernel/Driver ActivityEffect
1. TX Lock Acquired0–5 µsspin_lock(txrx_lock) acquired by TX path before DMA enqueue.RX path blocked from accessing shared resources.
2. TX DMA Active5–150 µsNIC continuously transmitting packets from TX ring. RX DMA channel disabled or idle.ACKs arrive at NIC but remain unprocessed (in HW buffer).
3. RX Deferred50–200 µsInterrupts masked; NAPI polling paused.ACKs not delivered to TCP layer → congestion window stalls.
4. TX Completion150–180 µsTX interrupt raised → completion handler releases lock.RX engine re-enabled.
5. RX Resume180–250 µsNIC DMA posts pending ACKs to RX ring → kernel processes via netif_receive_skb().TCP sees ACKs late → apparent RTT inflation.
6. Normal Flow>250 µsTX and RX operate normally until next burst.ACK-based pacing normalizes.

 🔩 Affected Components

ComponentDescriptionRelevance
TX/RX Ring BuffersShared circular DMA descriptor arrays managed by NIC and driver.Contention occurs when both directions share descriptors or DMA channel.
TX/RX Spinlock (txrx_lock)Kernel lock guarding shared NIC resources.Prevents concurrent DMA setup — serialization of TX/RX.
NAPI Poll LoopKernel polling mode replacing per-packet interrupts.Paused or delayed when TX holds lock too long.
TCP Congestion ControlRelies on ACK pacing.Missed or late ACKs shrink cwnd temporarily.
DMA Engine / SoC BusShared hardware bus between NIC and memory.TX DMA hogs bandwidth, delaying RX descriptor updates.

 Full-Duplex vs Quasi Half-Duplex Timing

BehaviorFull DuplexQuasi Half-Duplex (Embedded)
TX & RX DMAIndependentShared DMA channel
Lock ContentionNonetxrx_lock serializes access
RX Interrupts During TXAllowedMasked/deferred
ACK Latency< 10 µs100–300 µs typical
TCP PerformanceStable cwnd growthStalled cwnd during bursts

 

RX/TX Path in the Linux Kernel — Deep Dive

Every bit of data that travels across a network — from a simple web page request to a large file transfer — passes through the Linux kernel’s networking stack. Inside this stack, two main paths control the entire data flow: the TX (Transmit) path for sending packets and the RX (Receive) path for receiving them. These two paths act like the arteries and veins of the networking system — one pushes data out, and the other brings data in.

🔹 The TX Path – From Application to Network

When an application sends data (for example, using send() or write()), that data first enters the TX path.
Inside the kernel, it passes through several layers:

  • The TCP layer segments the data into chunks (based on the Maximum Segment Size, MSS).
  • The IP layer adds routing and addressing information.
  • The Queueing discipline (qdisc) layer decides when and how packets are transmitted (handling traffic shaping and prioritization).
  • Finally, the Network Interface Card (NIC) driver uses DMA (Direct Memory Access) to move packets from kernel memory to the network hardware, which then transmits them on the wire.

The TX path ensures efficient use of available bandwidth, obeys congestion control rules, and maintains the order and reliability of packets. 

🔹 The RX Path – From Network to Application

When packets arrive from the network, the RX path takes over.
The NIC receives incoming frames and places them in the RX ring buffer (a circular buffer in kernel memory). The NAPI (New API) mechanism then retrieves those packets efficiently, avoiding excessive interrupts during heavy load. 

Each packet travels upward through:

  • The Ethernet layer (for frame validation),
  • The IP layer (for routing and integrity checks),
  • And finally, the TCP layer, which reassembles packets into the original data stream.

The application then receives this reassembled data when it calls recv() or read(). 

🔹 Why TX/RX Path Matters

The efficiency of these two paths directly determines network performance, latency, and throughput.
A well-optimized TX path reduces CPU load and ensures packets are sent smoothly even under high traffic. An efficient RX path prevents packet drops and keeps the system responsive under heavy inbound load.

Together, the TX and RX paths form the foundation of how Linux handles networking — bridging the gap between user-space applications and physical network interfaces. Without their careful coordination, reliable and high-speed network communication simply wouldn’t be possible. 

 

TX Path : Application → Kernel → TCP/IP Stack → qdisc → NIC driver → DMA → Wire

RX path: Wire → NIC → DMA → NAPI → TCP/IP Stack → Socket Buffer → Application 

These paths operate asynchronously, governed by socket buffers (sk_buff), ring buffers, and TCP flow control (rwnd, cwnd). 

 

  🚀 TX (Transmit) Path — Inside Linux

StepComponentFunctionKernel Structures / FunctionsFlow Control Impact
1️⃣Application write() / sendmsg()User → kernel transitionsys_sendmsg() / sys_write()The user-space app writes to a socket. The syscall transitions to kernel mode.
2️⃣Socket buffer allocationKernel allocates sk_buff, copies user datask_buff (struct sk_buff)        sock_alloc_send_pskb()Affects send buffer limits (SO_SNDBUF).Data is copied from user buffer into a sk_buff (the universal kernel packet structure).
3️⃣TCP segmentationSegments based on MSS, cwndtcp_sendmsg()tcp_write_xmit()cwnd limits how many packets can be in-flight.TCP handles segmentation, sequence numbers, and window checks. It decides how much to send based on congestion control.
4️⃣OffloadingLarge Send Offload (TSO/GSO)skb_gso_segment()Reduces CPU overhead
5️⃣Routing & IP encapsulationAdds IP header, determines routeip_queue_xmit()Routing cache (FIB) used.The IP layer adds an IP header and uses FIB (Forwarding Information Base) to select the route and next hop.
6️⃣Queueing discipline (qdisc)Manages packet scheduling and shapingpfifo_fast, fq_codel, sch_fq, etc.Packets enter the qdisc layer where scheduling, shaping, or prioritization occurs.
7️⃣NIC TX ring enqueueAdds skb to NIC’s TX descriptor ringndo_start_xmit()Controlled by TX ring depth.The driver puts packets in the NIC’s transmit ring buffer.
8️⃣DMA mappingNIC copies buffer from kernel to devicedma_map_single()Zero-copy possible (e.g., XDP). NIC’s DMA engine copies packet data from kernel memory to NIC buffers.
9️⃣TransmissionNIC sends on the wirePHY/MACAt this point cwnd decreases (outstanding data ↑). Packet bits go onto the Ethernet medium.

TCP Segmentation Offload (TSO) and Generic Segmentation Offload (GSO) perform segmentation offload when transmitting TCP packets.

📍 Key interactions with TCP congestion control: At Step 3, TCP checks if it’s allowed to send more data (based on cwnd and rwnd). If not, data stays queued in the socket buffer until ACKs arrive. 

📈 Congestion Control in TX Path

  • cwnd limits how much unacknowledged data can be sent.
  • On ACK: cwnd grows (Slow Start → Congestion Avoidance).
  • On loss: cwnd halves (Fast Retransmit / Recovery). 

 📥 RX (Receive) Path — Inside Linux

StepComponentFunctionKernel Structures / FunctionsFlow Control Impact
1️⃣Packet arrival (wire)Bits received via PHYNIC hardwarePacket received on wire; NIC stores it in RX ring via DMA.
2️⃣DMA to RX ringNIC writes frame into RX bufferstruct napi_struct + DMA descriptorsBypasses CPU copy
3️⃣NAPI pollingKernel polls RX ring (reduces interrupt load)napi_poll()           netif_receive_skb()Controls packet batch size.NIC triggers interrupt or NAPI polls RX ring.
4️⃣sk_buff creationBuild skb from RX bufferbuild_skb()Per-packet metadata. The driver wraps raw data into an sk_buff structure.
5️⃣Protocol stack demuxEthernet → IP → TCPeth_type_trans(), ip_rcv(), tcp_v4_rcv()Routing / filtering applied. Each layer parses headers and hands the packet up.
6️⃣TCP reassemblyHandles out-of-order, missing segmentstcp_data_queue()Manages rcv_nxt, window. TCP verifies checksum, sequence number, and reorders out-of-sequence segments.
7️⃣ACK generationSends ACK to sendertcp_send_ack()Updates sender’s cwnd. TCP acknowledges received data to sender (critical for congestion control feedback).
8️⃣Delivery to appCopy to user-space buffertcp_recvmsg()Receiver window (rwnd) shrinks/grows. When application calls recv(), kernel copies the data from socket buffer to user-space.

 📍 Key point:  RX path determines feedback to sender via ACKs and advertised window (rwnd).

📉 Flow Control in RX Path

  • rwnd (receiver window) = available socket buffer space.
  • Sender’s cwnd cannot exceed rwnd → prevents overrun.
  • As the app reads data (recv()), kernel increases rwnd → sender resumes.

🧩 Key Kernel Structures (for both paths)

StructureLayerPurpose
struct sk_buffCore networkingUniversal packet descriptor
struct sockSocket layerHolds socket state
struct tcp_sockTCP layerTracks cwnd, ssthresh, seq numbers
struct net_deviceNIC abstractionInterface representation
struct napi_structDriver layerUsed for RX polling via NAPI
struct netdev_queueLink layerRepresents TX queue for NIC

🧠 How the Flow Works

PhaseSender ActionReceiver ResponseControl Feedback
1. TX initiationApp sends → kernel enqueues data into TX socket buffer.
2. TCP segmentationtcp_write_xmit() forms segments (limited by cwnd).
3. TransmissionPackets queued → NIC TX ring → DMA → Wire.
4. RX processingNIC RX ring → napi_poll()tcp_v4_rcv() → data reassembly.TCP sends ACKs.tcp_send_ack()
5. ACK receptionSender receives ACKs, advances snd_una, increases cwnd.cwnd++
6. App deliveryData reaches receiver’s app via recv().rwnd grows (buffer freed).rwnd++ advertised
7. Loop continuesNew data transmitted as cwnd and rwnd allow.Continuous feedback loop

⚙️ Core Control Interactions

MechanismManaged ByDirectionPurpose
cwnd (Congestion Window)Sender (TCP layer)OutboundControls how much data can be in-flight
rwnd (Receive Window)Receiver (TCP layer)Inbound (Advertised)Tells sender how much buffer space is available
ACK packetsReceiver → SenderReverseSignals successful receipt & drives cwnd growth
NAPI pollingKernel DriverLocal (RX)Reduces interrupt overhead during heavy load
TSO/GRO/GSONIC / KernelLocalOffload large segment handling

The circular buffers are used inside the kernel to manage socket data queues — they sit between the application and the TCP/IP stack, i.e., at the Transport Layer (TCP/UDP), and are visible as socket send and receive buffers.

1️⃣ Sending Circular Buffer (TX)

  • Where: Sender host, inside kernel, part of TCP socket (struct tcp_sock).
  • Layer: Transport layer (TCP)
  • Kernel Structure: sk_buff queues in the send buffer (sk_sndbuf)
  • Function:
    • Holds application data before it is segmented and transmitted.
    • TCP manages flow/congestion control using this buffer (cwnd, snd_una, snd_nxt).
    • Acts as a circular/ring buffer:
      • Head = next free spot to write data from app
      • Tail = next byte to transmit or waiting ACK 

2️⃣ Receiving Circular Buffer (RX)

  • Where: Receiver host, inside kernel, part of TCP socket (struct tcp_sock).
  • Layer: Transport layer (TCP)
  • Kernel Structure: sk_buff queues in the receive buffer (sk_rcvbuf)
  • Function:
    • Holds packets received from NIC (via RX ring → sk_buff) until the application reads them.
    • TCP reorders out-of-sequence segments here.
    • Acts as a circular buffer:
      • Head = last received segment
      • Tail = next byte to deliver to application (recv() call)
Buffer TypeLocationLayerKernel StructurePurpose
TX (Send)Sender kernelTransport / TCPsk_sndbuf + sk_buff queueStores app data before segmentation & transmission
RX (Receive)Receiver kernelTransport / TCPsk_rcvbuf + sk_buff queueStores received packets until application reads them

These circular buffers are separate from NIC hardware rings, but interact with them:

  • TX buffer → NIC TX ring → DMA → wire
  • RX buffer ← NIC RX ring ← DMA ← wire

Circular buffer ensures efficient use of fixed memory without moving data, with head/tail pointers wrapping around as data is sent or consumed.

Technical kernel-level view, showing exactly how TX/RX buffers, NIC rings, and TCP flow control work in Linux during packet transmission :

 

 Key Technical Points:

  • TX circular buffer (sk_sndbuf): stores unsent application data; sender can’t exceed cwnd bytes in-flight.
  • NIC TX ring: hardware descriptor queue for DMA transfers; enables zero-copy transmission.
  • Flow Control: cwnd limits outstanding bytes; ACKs from receiver advance tail pointer.

 

Key Technical Points:

  • RX circular buffer (sk_rcvbuf): holds incoming data until application reads it.
  • Flow Control (rwnd): advertised to sender to prevent buffer overflow.
  • ACK Generation: triggers cwnd updates on sender, enabling more data transmission.

 

  • TX buffer: ensures sender can buffer data for NIC without blocking app.
  • RX buffer: ensures receiver can store out-of-order packets and feed app smoothly.
  • NIC Rings: act as fast hardware queues between kernel memory and network.
  • TCP flow/congestion control: cwnd (sender) and rwnd (receiver) regulate transmission rate and buffer usage. 
ComponentLayerRole
TX Circular Buffer (sk_sndbuf)Transport / TCPHolds unsent data from application, limited by cwnd
RX Circular Buffer (sk_rcvbuf)Transport / TCPHolds received packets until application reads them, advertised via rwnd
NIC TX RingLink / DriverHardware queue for outgoing packets via DMA
NIC RX RingLink / DriverHardware queue for incoming packets via DMA
cwndTCP LayerCongestion control — limits in-flight bytes
rwndTCP LayerFlow control — prevents receiver buffer overflow

 🧩 Multiple Buffers in Linux TX/RX Path

There are three main buffers at the sender side, and similar buffers at the receiver side:

  • Socket Layer TX buffer (sk_sndbuf)
  • Queueing Discipline (qdisc) buffer
  • NIC Driver TX ring buffer

1️⃣ Socket Layer Buffer (Circular Buffer)

  • Layer: Transport (TCP)
  • Structure: sk_buff queue inside struct sock → sk_sndbuf
  • Purpose:
    • Holds application data before TCP segments it for transmission.
    • Acts as a circular buffer:
      • Head: next free spot to copy data from app
      • Tail: next byte to transmit or waiting for ACK
    • Works with TCP flow control:
      • cwnd (congestion window) limits the amount of in-flight data
      • snd_una / snd_nxt track acknowledged and unacknowledged bytes
  • Behavior:
    • When app calls send(), data is copied into this buffer.
    • TCP checks cwnd and only moves data from this buffer into the qdisc according to congestion control rules.
    • Provides a back-pressure mechanism for applications: if the buffer is full, send() may block or fail with EAGAIN. 

2️⃣ Queueing Discipline (qdisc) Buffer

  • Layer: Link Layer, kernel networking subsystem
  • Structure: Kernel-managed packet queue, e.g., pfifo_fast, fq_codel, sch_fq
  • Purpose:
    • Holds packets ready to be transmitted by the NIC.
    • Performs traffic shaping, scheduling, and prioritization.
    • Acts as a buffer between TCP and NIC hardware.
  • Behavior:
    • TCP passes fully formed sk_buff segments to qdisc.
    • Qdisc schedules packets based on policy (FIFO, fair queuing, CoDel, etc.).
    • Helps control burstiness and reduce packet drops at NIC due to congestion.
    • Queue depth can be tuned (default ~1000 packets for pfifo_fast). 

3️⃣ NIC Driver TX Ring Buffer

  • Layer: Link Layer → Hardware
  • Structure: Circular DMA descriptor array in kernel memory (NIC TX ring)
  • Purpose:
    • Holds packets that are ready to be sent by the NIC hardware.
    • Enables zero-copy DMA transfer: hardware reads packets directly from memory.
  • Behavior:
    • Qdisc or netdev layer enqueues sk_buff into TX ring.
    • NIC hardware DMA engine fetches the packet and transmits it onto the wire.
    • Ring buffer is finite, so if full, upper layers must wait — this is part of back-pressure propagation up to TCP. 

4️⃣ RX Path Buffers 

On the receiver side, the analogous buffers exist:

  • NIC RX Ring
    • DMA writes incoming frames from wire into RX descriptors.
    • Acts as a hardware receive queue.
  • Socket Layer RX buffer (sk_rcvbuf)
    • Holds TCP segments until the application reads them.
    • Works with rwnd (receive window) to control sender flow.
  • Intermediate kernel queues
    • For example, NAPI polling may batch multiple sk_buffs for efficiency.
    • TCP layer reorders out-of-order segments before delivering to socket buffer. 

 

🔹 Key Points About Multiple Buffers

BufferTypeRoleFlow Control
Socket TX (sk_sndbuf)CircularHolds app datacwnd limits data sent
qdiscQueueScheduling & shapingIf full, TCP waits
NIC TX ringCircular DMAHardware queue for NICNIC full → back-pressure to qdisc/TCP
NIC RX ringCircular DMAHardware queue for received framesN/A, but batch processing via NAPI
Socket RX (sk_rcvbuf)CircularHolds received segmentsAdvertised rwnd limits sender