Let's go in depth: TX/RX Synchronization Delays in the Kernel: When Full Duplex Becomes “Pseudo Half-Duplex”

In theory, Ethernet and modern NICs operate in full duplex, meaning transmit (TX) and receive (RX) operations can happen simultaneously over separate lanes. However, at the kernel and driver level, TX and RX often share critical hardware structures — such as descriptor rings, DMA channels, and driver locks protecting the socket buffer queues (sk_buffs). During high-burst transmissions, the TX path can monopolize these shared resources. The NIC driver focuses on draining the TX ring using DMA transfers, keeping interrupts masked or deprioritizing RX handling to maintain throughput efficiency.

When this happens, the RX engine becomes temporarily deferred, meaning that even though ACKs or responses arrive on the wire, they sit idle in the NIC buffer until the TX DMA cycle finishes. This can delay ACK processing by just a few microseconds — but in high-frequency trading systems, embedded gateways, or Telium-class firmware, those microseconds add up, causing latency spikes, retransmission triggers, or perceived half-duplex behavior.

Essentially, the system remains electrically full duplex, but kernel-level locking and DMA scheduling create a temporary serialization between TX and RX. This phenomenon explains why, during heavy TX bursts, inbound packets may appear delayed or dropped — not due to congestion, but because the driver and hardware are briefly “busy” on the transmit side, deferring receive processing until resources are released.

SENDER HOST — TX (Transmit) Path (Full Kernel View)

Layer / Box	Description	Key Structures / Functions	Important Parameters
User Space App	Your program calling `send()` or `write()` on a socket.	`send()`, `write()`, or `sendto()` syscalls	User buffer size
Socket Layer	Converts user data into kernel-managed buffers (circular TX buffer). Responsible for flow control and segmentation.	`tcp_sendmsg()`, `struct sock`, `sk_buff` (skb)	`sndbuf`, `sk_sndbuf`, `write_seq`, `snd_nxt`
TCP Layer	Creates TCP segments (MSS-sized), adds headers, maintains retransmission and cwnd logic.	`tcp_transmit_skb()`, congestion control modules	MSS, cwnd, ssthresh, RTO
IP Layer	Wraps each TCP segment in an IP header; handles routing and fragmentation.	`ip_queue_xmit()`, `dst_entry`, `rtable`	IP TTL, DF bit, TOS
Qdisc Layer	Kernel queuing discipline—responsible for buffering, scheduling, and shaping before the NIC.	`struct Qdisc`, `pfifo_fast`, `fq_codel`	qlen (queue length), backlog, tx_bytes
NIC Driver	Implements `ndo_start_xmit()` to map packets to DMA descriptors and push to NIC.	`netdev_ops`, `struct sk_buff`, `struct netdev_queue`	TX ring size, tx_timeout, tx_queue_len
NIC Hardware	Performs actual DMA read and transmission.	TX descriptor ring, DMA engine	Descriptor count, TX head/tail index

RECEIVER HOST — RX (Receive) Path (Full Kernel View)

Layer / Box	Description	Key Structures / Functions	Important Parameters
NIC Hardware	Receives Ethernet frames and writes into pre-allocated RX DMA buffers in main memory.	RX ring buffer, DMA engine	Descriptor ring length, head/tail index
NIC Driver	Handles RX interrupt or NAPI poll, allocates `sk_buff`, fills metadata, pushes packet up.	`napi_poll()`, `netif_receive_skb()`	RX budget, RX ring lock, napi_gro_receive()
IP Layer	Extracts and validates IP header, does checksum verification, routing decision.	`ip_rcv()`, `ip_rcv_finish()`	IP header len, checksum, TTL
TCP Layer	Reassembles segments, updates ACKs, flow control, and congestion window.	`tcp_v4_rcv()`, `tcp_ack()`, `tcp_data_queue()`	rcv_nxt, snd_una, window size, ACK delay
Socket Layer	Buffers the received payload in a circular receive buffer for user-space reading.	`sk_receive_queue`, `tcp_recvmsg()`	`rcvbuf`, backlog size
User Space App	Reads data with `recv()` or `read()`.	`recv()`, `read()` syscalls	Buffer length, blocking/non-blocking

🧩 Where Segmentation and Reassembly Occur

Operation	Direction	Layer	Mechanism
Segmentation	TX (Sender)	TCP Layer	Splits application data into MSS-sized `sk_buff`s before IP encapsulation
Fragmentation	TX (Optional)	IP Layer	Only if IP MTU smaller than MSS (rare with PMTU discovery)
Reassembly	RX (Receiver)	TCP Layer	Combines multiple segments into original stream before copying to recv buffer

🧠 Concept

Concept	Explanation
Circular Buffer	Both TX and RX socket buffers are circular, maintaining read/write pointers.
Segmentation	TCP divides data into MSS units before enqueueing for IP.
Qdisc	Queuing logic that schedules skbs to the NIC for fairness or shaping.
DMA Descriptor Ring	Shared memory region where NIC reads (TX) or writes (RX) packets.
Interrupt / NAPI	NIC signals kernel for packet completion or reception.
ACK Clocking	TCP relies on ACK arrival to pace cwnd and send new data.

Duplex Modes — The Basics

Mode	Meaning	Behavior
Full Duplex	TX (Transmit) and RX (Receive) can operate simultaneously on the same link.	Both sides can send and receive data at the same time without waiting.
Half Duplex	TX and RX share the same medium but cannot operate simultaneously — only one direction is active at a time.	When one side transmits, the other must wait until it finishes before replying.

How It Affects TX/RX Flow (Compared to the Above Full Kernel Flow)

Behavior	Full Duplex (Standard Linux PC / Server)	Half Duplex (Embedded / Firmware-driven)
TX/RX Operation	TX and RX rings work independently and can DMA simultaneously.	TX and RX rings often share DMA channels or hardware resources.
Interrupt Handling	TX completions and RX arrivals can occur in parallel threads (NAPI or IRQ).	RX interrupts might be masked or deferred while TX DMA is active.
Locking	Independent TX/RX locks; contention minimal.	Shared TX/RX lock — RX handler waits for TX release (→ jitter).
Timing	Near zero coupling — ACKs can be received while still sending.	ACKs can be delayed because RX engine sleeps during TX DMA.
Effective Behavior	True duplex link — high throughput and low latency.	Quasi half-duplex — you may see pauses between TX bursts and ACK arrivals.

⚙️ Embedded Devices and “Quasi Half-Duplex”

Many embedded NICs or SoCs (System-on-Chip) operate in a pseudo half-duplex fashion because:
They share one DMA engine for both TX and RX.
TX DMA activity “locks out” RX temporarily to avoid memory contention.
Firmware (or RTOS driver) toggles RX/TX enable bits explicitly:

This results in:

Short RX blackout windows (few µs–ms).
Delayed ACK reception → increased RTT.
Possible TCP retransmissions or cwnd collapse if ACKs are delayed too long.

🧩 Where TX/RX Synchronization & Flip Timing Actually Happens

Layer	Role	TX/RX Interaction	Impact / Synchronization Point
Application Layer	Calls `send()` / `recv()` syscalls	None (user-space calls are independent)	Not applicable
Socket Layer	Manages circular buffers (`sk_sndbuf`, `sk_rcvbuf`)	Logically independent per socket	No locking contention with NIC directly
TCP Layer	Controls congestion window (`cwnd`) and ACK timing	Depends on timely RX of ACKs	Delays appear indirectly if RX interrupts are delayed
IP / Routing Layer	Routes outgoing/incoming sk_buffs	Minimal interaction	Shared packet queues, no hardware contention
Queueing Discipline (qdisc)	Software queue before NIC	TX path only	Independent; no RX involvement
NIC Driver Layer 🧠 (CRITICAL ZONE)	Manages TX and RX rings, DMA, and interrupts	TX and RX share hardware resources, locks, and interrupt context	⚠️ This is where “flip timing” and half-duplex effects occur
NIC Hardware (MAC / PHY)	Executes DMA transfers, transmits and receives packets	TX and RX engines share PCIe bus, buffers, and DMA channels	Hardware-level contention or serialization possible

The Layer Where It Actually Happens

TX/RX synchronization and flip timing issues occur at the NIC driver and hardware layer — below the IP stack, inside the kernel’s device driver (netdev) context.

Specifically:

In Linux, this involves:

ndo_start_xmit() (TX path)
napi_poll() or netif_receive_skb() (RX path)
Both share spinlocks, ring buffer memory, and interrupt routines.

TX completion and RX interrupt handling can block each other if:

The driver uses shared locks (e.g., netdev_queue->lock).
NAPI polling is deferred while TX IRQs dominate.

Firmware-driven NICs (like embedded SoCs or Telium-based stacks) may enforce TX completion before RX enable — effectively half-duplexing the link temporarily.

Layer: NIC driver & hardware (bottom of kernel stack).
Mechanism: Shared DMA queues, interrupt lines, and spinlocks.
Effect: TX bursts monopolize hardware → RX path delayed → late ACKs/timeouts.
Analogy: Think of TX and RX as two people sharing a single narrow door — if one keeps walking through (TX burst), the other (RX) must wait.

⚙️ TX–RX Ring Interaction & Lock Contention Timeline

TX and RX rings share the same NIC DMA engine and driver locks. During TX bursts, the transmit path dominates CPU or PCIe/DMA bandwidth, delaying RX interrupts or NAPI polling.

Shared Rings + Contention Timeline

🕒 Microsecond-Scale Behavior Summary

Time (µs)	Event	Description
0	TX lock acquired	`ndo_start_xmit()` grabs TX ring lock
2	TX DMA starts	Frames begin DMA transfer to NIC
4	RX interrupt arrives	Incoming ACK/packet — IRQ handler can’t take lock
6	TX completion IRQ fires	TX done, but same IRQ line shared
8	Lock still held	RX polling (NAPI) delayed
10	Lock released	RX handler runs
12	ACK processed	RTT appears inflated to TCP layer

📊 Resulting TCP Symptoms

Observable Effect	Root Cause	Layer Impacted
Increased RTT	RX delay due to TX contention	TCP
Duplicate ACKs	Delayed ACK reception	TCP
cwnd stalls	ACK clock slowed down	Congestion control
Spurious retransmissions	RX DMA backlog	Transport layer
Apparent half-duplex	Firmware defers RX	NIC hardware/driver

⚙️ Real-World Example

Seen in Telium, NXP, or Intel I225/I226 firmware when large TX bursts occur.
Some NICs serialize TX/RX DMA channels to reduce PCIe contention.
Linux mitigates this with:

NAPI (poll-mode RX)
Separate MSI-X interrupts for TX and RX queues
RPS/RFS to distribute RX load across CPUs

Kernel-Level Timing Diagram — TX Burst Lock Contention & ACK Delay (µs-scale)

🧠 Explanation by Phase

Phase	Time Range (Example)	Kernel/Driver Activity	Effect
1. TX Lock Acquired	0–5 µs	`spin_lock(txrx_lock)` acquired by TX path before DMA enqueue.	RX path blocked from accessing shared resources.
2. TX DMA Active	5–150 µs	NIC continuously transmitting packets from TX ring. RX DMA channel disabled or idle.	ACKs arrive at NIC but remain unprocessed (in HW buffer).
3. RX Deferred	50–200 µs	Interrupts masked; NAPI polling paused.	ACKs not delivered to TCP layer → congestion window stalls.
4. TX Completion	150–180 µs	TX interrupt raised → completion handler releases lock.	RX engine re-enabled.
5. RX Resume	180–250 µs	NIC DMA posts pending ACKs to RX ring → kernel processes via `netif_receive_skb()`.	TCP sees ACKs late → apparent RTT inflation.
6. Normal Flow	>250 µs	TX and RX operate normally until next burst.	ACK-based pacing normalizes.

🔩 Affected Components

Component	Description	Relevance
TX/RX Ring Buffers	Shared circular DMA descriptor arrays managed by NIC and driver.	Contention occurs when both directions share descriptors or DMA channel.
TX/RX Spinlock (`txrx_lock`)	Kernel lock guarding shared NIC resources.	Prevents concurrent DMA setup — serialization of TX/RX.
NAPI Poll Loop	Kernel polling mode replacing per-packet interrupts.	Paused or delayed when TX holds lock too long.
TCP Congestion Control	Relies on ACK pacing.	Missed or late ACKs shrink cwnd temporarily.
DMA Engine / SoC Bus	Shared hardware bus between NIC and memory.	TX DMA hogs bandwidth, delaying RX descriptor updates.

Full-Duplex vs Quasi Half-Duplex Timing

Behavior	Full Duplex	Quasi Half-Duplex (Embedded)
TX & RX DMA	Independent	Shared DMA channel
Lock Contention	None	`txrx_lock` serializes access
RX Interrupts During TX	Allowed	Masked/deferred
ACK Latency	< 10 µs	100–300 µs typical
TCP Performance	Stable cwnd growth	Stalled cwnd during bursts

Let's go in depth

Saturday, October 25, 2025

TX/RX Synchronization Delays in the Kernel: When Full Duplex Becomes “Pseudo Half-Duplex”

No comments:

Post a Comment

Search This Blog