Saturday, October 25, 2025

TX/RX Synchronization Delays in the Kernel: When Full Duplex Becomes “Pseudo Half-Duplex”

In theory, Ethernet and modern NICs operate in full duplex, meaning transmit (TX) and receive (RX) operations can happen simultaneously over separate lanes. However, at the kernel and driver level, TX and RX often share critical hardware structures — such as descriptor rings, DMA channels, and driver locks protecting the socket buffer queues (sk_buffs). During high-burst transmissions, the TX path can monopolize these shared resources. The NIC driver focuses on draining the TX ring using DMA transfers, keeping interrupts masked or deprioritizing RX handling to maintain throughput efficiency.

When this happens, the RX engine becomes temporarily deferred, meaning that even though ACKs or responses arrive on the wire, they sit idle in the NIC buffer until the TX DMA cycle finishes. This can delay ACK processing by just a few microseconds — but in high-frequency trading systems, embedded gateways, or Telium-class firmware, those microseconds add up, causing latency spikes, retransmission triggers, or perceived half-duplex behavior.

Essentially, the system remains electrically full duplex, but kernel-level locking and DMA scheduling create a temporary serialization between TX and RX. This phenomenon explains why, during heavy TX bursts, inbound packets may appear delayed or dropped — not due to congestion, but because the driver and hardware are briefly “busy” on the transmit side, deferring receive processing until resources are released.

 

 SENDER HOST — TX (Transmit) Path (Full Kernel View)

Layer / BoxDescriptionKey Structures / FunctionsImportant Parameters
User Space AppYour program calling send() or write() on a socket.send(), write(), or sendto() syscallsUser buffer size
Socket LayerConverts user data into kernel-managed buffers (circular TX buffer). Responsible for flow control and segmentation.tcp_sendmsg(), struct sock, sk_buff (skb)sndbuf, sk_sndbuf, write_seq, snd_nxt
TCP LayerCreates TCP segments (MSS-sized), adds headers, maintains retransmission and cwnd logic.tcp_transmit_skb(), congestion control modulesMSS, cwnd, ssthresh, RTO
IP LayerWraps each TCP segment in an IP header; handles routing and fragmentation.ip_queue_xmit(), dst_entry, rtableIP TTL, DF bit, TOS
Qdisc LayerKernel queuing discipline—responsible for buffering, scheduling, and shaping before the NIC.struct Qdisc, pfifo_fast, fq_codelqlen (queue length), backlog, tx_bytes
NIC DriverImplements ndo_start_xmit() to map packets to DMA descriptors and push to NIC.netdev_ops, struct sk_buff, struct netdev_queueTX ring size, tx_timeout, tx_queue_len
NIC HardwarePerforms actual DMA read and transmission.TX descriptor ring, DMA engineDescriptor count, TX head/tail index

 

 RECEIVER HOST — RX (Receive) Path (Full Kernel View)

Layer / BoxDescriptionKey Structures / FunctionsImportant Parameters
NIC HardwareReceives Ethernet frames and writes into pre-allocated RX DMA buffers in main memory.RX ring buffer, DMA engineDescriptor ring length, head/tail index
NIC DriverHandles RX interrupt or NAPI poll, allocates sk_buff, fills metadata, pushes packet up.napi_poll(), netif_receive_skb()RX budget, RX ring lock, napi_gro_receive()
IP LayerExtracts and validates IP header, does checksum verification, routing decision.ip_rcv(), ip_rcv_finish()IP header len, checksum, TTL
TCP LayerReassembles segments, updates ACKs, flow control, and congestion window.tcp_v4_rcv(), tcp_ack(), tcp_data_queue()rcv_nxt, snd_una, window size, ACK delay
Socket LayerBuffers the received payload in a circular receive buffer for user-space reading.sk_receive_queue, tcp_recvmsg()rcvbuf, backlog size
User Space AppReads data with recv() or read().recv(), read() syscallsBuffer length, blocking/non-blocking

 🧩 Where Segmentation and Reassembly Occur

OperationDirectionLayerMechanism
SegmentationTX (Sender)TCP LayerSplits application data into MSS-sized sk_buffs before IP encapsulation
FragmentationTX (Optional)IP LayerOnly if IP MTU smaller than MSS (rare with PMTU discovery)
ReassemblyRX (Receiver)TCP LayerCombines multiple segments into original stream before copying to recv buffer

🧠 Concept 

ConceptExplanation
Circular BufferBoth TX and RX socket buffers are circular, maintaining read/write pointers.
SegmentationTCP divides data into MSS units before enqueueing for IP.
QdiscQueuing logic that schedules skbs to the NIC for fairness or shaping.
DMA Descriptor RingShared memory region where NIC reads (TX) or writes (RX) packets.
Interrupt / NAPINIC signals kernel for packet completion or reception.
ACK ClockingTCP relies on ACK arrival to pace cwnd and send new data.

Duplex Modes — The Basics 

ModeMeaningBehavior
Full DuplexTX (Transmit) and RX (Receive) can operate simultaneously on the same link.Both sides can send and receive data at the same time without waiting.
Half DuplexTX and RX share the same medium but cannot operate simultaneously — only one direction is active at a time.When one side transmits, the other must wait until it finishes before replying.

 

 How It Affects TX/RX Flow (Compared to the Above Full Kernel Flow)

BehaviorFull Duplex (Standard Linux PC / Server)Half Duplex (Embedded / Firmware-driven)
TX/RX OperationTX and RX rings work independently and can DMA simultaneously.TX and RX rings often share DMA channels or hardware resources.
Interrupt HandlingTX completions and RX arrivals can occur in parallel threads (NAPI or IRQ).RX interrupts might be masked or deferred while TX DMA is active.
LockingIndependent TX/RX locks; contention minimal.Shared TX/RX lock — RX handler waits for TX release (→ jitter).
TimingNear zero coupling — ACKs can be received while still sending.ACKs can be delayed because RX engine sleeps during TX DMA.
Effective BehaviorTrue duplex link — high throughput and low latency.Quasi half-duplex — you may see pauses between TX bursts and ACK arrivals.

⚙️ Embedded Devices and “Quasi Half-Duplex”

  • Many embedded NICs or SoCs (System-on-Chip) operate in a pseudo half-duplex fashion because:
  • They share one DMA engine for both TX and RX.
  • TX DMA activity “locks out” RX temporarily to avoid memory contention.
  • Firmware (or RTOS driver) toggles RX/TX enable bits explicitly: 

          

This results in:

  • Short RX blackout windows (few µs–ms).
  • Delayed ACK reception → increased RTT.
  • Possible TCP retransmissions or cwnd collapse if ACKs are delayed too long. 

 🧩 Where TX/RX Synchronization & Flip Timing Actually Happens

LayerRoleTX/RX InteractionImpact / Synchronization Point
Application LayerCalls send() / recv() syscallsNone (user-space calls are independent)Not applicable
Socket LayerManages circular buffers (sk_sndbuf, sk_rcvbuf)Logically independent per socketNo locking contention with NIC directly
TCP LayerControls congestion window (cwnd) and ACK timingDepends on timely RX of ACKsDelays appear indirectly if RX interrupts are delayed
IP / Routing LayerRoutes outgoing/incoming sk_buffsMinimal interactionShared packet queues, no hardware contention
Queueing Discipline (qdisc)Software queue before NICTX path onlyIndependent; no RX involvement
NIC Driver Layer 🧠 (CRITICAL ZONE)Manages TX and RX rings, DMA, and interruptsTX and RX share hardware resources, locks, and interrupt context⚠️ This is where “flip timing” and half-duplex effects occur
NIC Hardware (MAC / PHY)Executes DMA transfers, transmits and receives packetsTX and RX engines share PCIe bus, buffers, and DMA channelsHardware-level contention or serialization possible

 The Layer Where It Actually Happens

TX/RX synchronization and flip timing issues occur at the NIC driver and hardware layer — below the IP stack, inside the kernel’s device driver (netdev) context.

Specifically:

  • In Linux, this involves:
    • ndo_start_xmit() (TX path)
    • napi_poll() or netif_receive_skb() (RX path)
    • Both share spinlocks, ring buffer memory, and interrupt routines.
  • TX completion and RX interrupt handling can block each other if:
    • The driver uses shared locks (e.g., netdev_queue->lock).
    • NAPI polling is deferred while TX IRQs dominate.
  • Firmware-driven NICs (like embedded SoCs or Telium-based stacks) may enforce TX completion before RX enable — effectively half-duplexing the link temporarily. 

 

  • Layer: NIC driver & hardware (bottom of kernel stack).
  • Mechanism: Shared DMA queues, interrupt lines, and spinlocks.
  • Effect: TX bursts monopolize hardware → RX path delayed → late ACKs/timeouts.
  • Analogy: Think of TX and RX as two people sharing a single narrow door — if one keeps walking through (TX burst), the other (RX) must wait. 

⚙️ TX–RX Ring Interaction & Lock Contention Timeline

TX and RX rings share the same NIC DMA engine and driver locks. During TX bursts, the transmit path dominates CPU or PCIe/DMA bandwidth, delaying RX interrupts or NAPI polling.

Shared Rings + Contention Timeline 

 

🕒 Microsecond-Scale Behavior Summary

Time (µs)EventDescription
0TX lock acquiredndo_start_xmit() grabs TX ring lock
2TX DMA startsFrames begin DMA transfer to NIC
4RX interrupt arrivesIncoming ACK/packet — IRQ handler can’t take lock
6TX completion IRQ firesTX done, but same IRQ line shared
8Lock still heldRX polling (NAPI) delayed
10Lock releasedRX handler runs
12ACK processedRTT appears inflated to TCP layer

📊 Resulting TCP Symptoms 

Observable EffectRoot CauseLayer Impacted
Increased RTTRX delay due to TX contentionTCP
Duplicate ACKsDelayed ACK receptionTCP
cwnd stallsACK clock slowed downCongestion control
Spurious retransmissionsRX DMA backlogTransport layer
Apparent half-duplexFirmware defers RXNIC hardware/driver

 ⚙️ Real-World Example

  • Seen in Telium, NXP, or Intel I225/I226 firmware when large TX bursts occur.
  • Some NICs serialize TX/RX DMA channels to reduce PCIe contention.
  • Linux mitigates this with:
    • NAPI (poll-mode RX)
    • Separate MSI-X interrupts for TX and RX queues
    • RPS/RFS to distribute RX load across CPUs

 Kernel-Level Timing Diagram — TX Burst Lock Contention & ACK Delay (µs-scale)

🧠 Explanation by Phase 

PhaseTime Range (Example)Kernel/Driver ActivityEffect
1. TX Lock Acquired0–5 µsspin_lock(txrx_lock) acquired by TX path before DMA enqueue.RX path blocked from accessing shared resources.
2. TX DMA Active5–150 µsNIC continuously transmitting packets from TX ring. RX DMA channel disabled or idle.ACKs arrive at NIC but remain unprocessed (in HW buffer).
3. RX Deferred50–200 µsInterrupts masked; NAPI polling paused.ACKs not delivered to TCP layer → congestion window stalls.
4. TX Completion150–180 µsTX interrupt raised → completion handler releases lock.RX engine re-enabled.
5. RX Resume180–250 µsNIC DMA posts pending ACKs to RX ring → kernel processes via netif_receive_skb().TCP sees ACKs late → apparent RTT inflation.
6. Normal Flow>250 µsTX and RX operate normally until next burst.ACK-based pacing normalizes.

 🔩 Affected Components

ComponentDescriptionRelevance
TX/RX Ring BuffersShared circular DMA descriptor arrays managed by NIC and driver.Contention occurs when both directions share descriptors or DMA channel.
TX/RX Spinlock (txrx_lock)Kernel lock guarding shared NIC resources.Prevents concurrent DMA setup — serialization of TX/RX.
NAPI Poll LoopKernel polling mode replacing per-packet interrupts.Paused or delayed when TX holds lock too long.
TCP Congestion ControlRelies on ACK pacing.Missed or late ACKs shrink cwnd temporarily.
DMA Engine / SoC BusShared hardware bus between NIC and memory.TX DMA hogs bandwidth, delaying RX descriptor updates.

 Full-Duplex vs Quasi Half-Duplex Timing

BehaviorFull DuplexQuasi Half-Duplex (Embedded)
TX & RX DMAIndependentShared DMA channel
Lock ContentionNonetxrx_lock serializes access
RX Interrupts During TXAllowedMasked/deferred
ACK Latency< 10 µs100–300 µs typical
TCP PerformanceStable cwnd growthStalled cwnd during bursts

 

No comments:

Post a Comment