In theory, Ethernet and modern NICs operate in full duplex, meaning transmit (TX) and receive (RX) operations can happen simultaneously over separate lanes. However, at the kernel and driver level, TX and RX often share critical hardware structures — such as descriptor rings, DMA channels, and driver locks protecting the socket buffer queues (sk_buffs). During high-burst transmissions, the TX path can monopolize these shared resources. The NIC driver focuses on draining the TX ring using DMA transfers, keeping interrupts masked or deprioritizing RX handling to maintain throughput efficiency.
When this happens, the RX engine becomes temporarily deferred, meaning that even though ACKs or responses arrive on the wire, they sit idle in the NIC buffer until the TX DMA cycle finishes. This can delay ACK processing by just a few microseconds — but in high-frequency trading systems, embedded gateways, or Telium-class firmware, those microseconds add up, causing latency spikes, retransmission triggers, or perceived half-duplex behavior.
Essentially, the system remains electrically full duplex, but kernel-level locking and DMA scheduling create a temporary serialization between TX and RX. This phenomenon explains why, during heavy TX bursts, inbound packets may appear delayed or dropped — not due to congestion, but because the driver and hardware are briefly “busy” on the transmit side, deferring receive processing until resources are released.
SENDER HOST — TX (Transmit) Path (Full Kernel View)
| Layer / Box | Description | Key Structures / Functions | Important Parameters |
|---|---|---|---|
| User Space App | Your program calling send() or write() on a socket. | send(), write(), or sendto() syscalls | User buffer size |
| Socket Layer | Converts user data into kernel-managed buffers (circular TX buffer). Responsible for flow control and segmentation. | tcp_sendmsg(), struct sock, sk_buff (skb) | sndbuf, sk_sndbuf, write_seq, snd_nxt |
| TCP Layer | Creates TCP segments (MSS-sized), adds headers, maintains retransmission and cwnd logic. | tcp_transmit_skb(), congestion control modules | MSS, cwnd, ssthresh, RTO |
| IP Layer | Wraps each TCP segment in an IP header; handles routing and fragmentation. | ip_queue_xmit(), dst_entry, rtable | IP TTL, DF bit, TOS |
| Qdisc Layer | Kernel queuing discipline—responsible for buffering, scheduling, and shaping before the NIC. | struct Qdisc, pfifo_fast, fq_codel | qlen (queue length), backlog, tx_bytes |
| NIC Driver | Implements ndo_start_xmit() to map packets to DMA descriptors and push to NIC. | netdev_ops, struct sk_buff, struct netdev_queue | TX ring size, tx_timeout, tx_queue_len |
| NIC Hardware | Performs actual DMA read and transmission. | TX descriptor ring, DMA engine | Descriptor count, TX head/tail index |
RECEIVER HOST — RX (Receive) Path (Full Kernel View)
| Layer / Box | Description | Key Structures / Functions | Important Parameters |
|---|---|---|---|
| NIC Hardware | Receives Ethernet frames and writes into pre-allocated RX DMA buffers in main memory. | RX ring buffer, DMA engine | Descriptor ring length, head/tail index |
| NIC Driver | Handles RX interrupt or NAPI poll, allocates sk_buff, fills metadata, pushes packet up. | napi_poll(), netif_receive_skb() | RX budget, RX ring lock, napi_gro_receive() |
| IP Layer | Extracts and validates IP header, does checksum verification, routing decision. | ip_rcv(), ip_rcv_finish() | IP header len, checksum, TTL |
| TCP Layer | Reassembles segments, updates ACKs, flow control, and congestion window. | tcp_v4_rcv(), tcp_ack(), tcp_data_queue() | rcv_nxt, snd_una, window size, ACK delay |
| Socket Layer | Buffers the received payload in a circular receive buffer for user-space reading. | sk_receive_queue, tcp_recvmsg() | rcvbuf, backlog size |
| User Space App | Reads data with recv() or read(). | recv(), read() syscalls | Buffer length, blocking/non-blocking |
🧩 Where Segmentation and Reassembly Occur
| Operation | Direction | Layer | Mechanism |
|---|---|---|---|
| Segmentation | TX (Sender) | TCP Layer | Splits application data into MSS-sized sk_buffs before IP encapsulation |
| Fragmentation | TX (Optional) | IP Layer | Only if IP MTU smaller than MSS (rare with PMTU discovery) |
| Reassembly | RX (Receiver) | TCP Layer | Combines multiple segments into original stream before copying to recv buffer |
🧠 Concept
| Concept | Explanation |
|---|---|
| Circular Buffer | Both TX and RX socket buffers are circular, maintaining read/write pointers. |
| Segmentation | TCP divides data into MSS units before enqueueing for IP. |
| Qdisc | Queuing logic that schedules skbs to the NIC for fairness or shaping. |
| DMA Descriptor Ring | Shared memory region where NIC reads (TX) or writes (RX) packets. |
| Interrupt / NAPI | NIC signals kernel for packet completion or reception. |
| ACK Clocking | TCP relies on ACK arrival to pace cwnd and send new data. |
Duplex Modes — The Basics
| Mode | Meaning | Behavior |
|---|---|---|
| Full Duplex | TX (Transmit) and RX (Receive) can operate simultaneously on the same link. | Both sides can send and receive data at the same time without waiting. |
| Half Duplex | TX and RX share the same medium but cannot operate simultaneously — only one direction is active at a time. | When one side transmits, the other must wait until it finishes before replying. |
How It Affects TX/RX Flow (Compared to the Above Full Kernel Flow)
| Behavior | Full Duplex (Standard Linux PC / Server) | Half Duplex (Embedded / Firmware-driven) |
|---|---|---|
| TX/RX Operation | TX and RX rings work independently and can DMA simultaneously. | TX and RX rings often share DMA channels or hardware resources. |
| Interrupt Handling | TX completions and RX arrivals can occur in parallel threads (NAPI or IRQ). | RX interrupts might be masked or deferred while TX DMA is active. |
| Locking | Independent TX/RX locks; contention minimal. | Shared TX/RX lock — RX handler waits for TX release (→ jitter). |
| Timing | Near zero coupling — ACKs can be received while still sending. | ACKs can be delayed because RX engine sleeps during TX DMA. |
| Effective Behavior | True duplex link — high throughput and low latency. | Quasi half-duplex — you may see pauses between TX bursts and ACK arrivals. |
⚙️ Embedded Devices and “Quasi Half-Duplex”
- Many embedded NICs or SoCs (System-on-Chip) operate in a pseudo half-duplex fashion because:
- They share one DMA engine for both TX and RX.
- TX DMA activity “locks out” RX temporarily to avoid memory contention.
- Firmware (or RTOS driver) toggles RX/TX enable bits explicitly:
This results in:
- Short RX blackout windows (few µs–ms).
- Delayed ACK reception → increased RTT.
- Possible TCP retransmissions or cwnd collapse if ACKs are delayed too long.
🧩 Where TX/RX Synchronization & Flip Timing Actually Happens
| Layer | Role | TX/RX Interaction | Impact / Synchronization Point |
|---|---|---|---|
| Application Layer | Calls send() / recv() syscalls | None (user-space calls are independent) | Not applicable |
| Socket Layer | Manages circular buffers (sk_sndbuf, sk_rcvbuf) | Logically independent per socket | No locking contention with NIC directly |
| TCP Layer | Controls congestion window (cwnd) and ACK timing | Depends on timely RX of ACKs | Delays appear indirectly if RX interrupts are delayed |
| IP / Routing Layer | Routes outgoing/incoming sk_buffs | Minimal interaction | Shared packet queues, no hardware contention |
| Queueing Discipline (qdisc) | Software queue before NIC | TX path only | Independent; no RX involvement |
| NIC Driver Layer 🧠 (CRITICAL ZONE) | Manages TX and RX rings, DMA, and interrupts | TX and RX share hardware resources, locks, and interrupt context | ⚠️ This is where “flip timing” and half-duplex effects occur |
| NIC Hardware (MAC / PHY) | Executes DMA transfers, transmits and receives packets | TX and RX engines share PCIe bus, buffers, and DMA channels | Hardware-level contention or serialization possible |
The Layer Where It Actually Happens
TX/RX synchronization and flip timing issues occur at the NIC driver and hardware layer — below the IP stack, inside the kernel’s device driver (netdev) context.
Specifically:
- In Linux, this involves:
- ndo_start_xmit() (TX path)
- napi_poll() or netif_receive_skb() (RX path)
- Both share spinlocks, ring buffer memory, and interrupt routines.
- TX completion and RX interrupt handling can block each other if:
- The driver uses shared locks (e.g., netdev_queue->lock).
- NAPI polling is deferred while TX IRQs dominate.
- Firmware-driven NICs (like embedded SoCs or Telium-based stacks) may enforce TX completion before RX enable — effectively half-duplexing the link temporarily.
- Layer: NIC driver & hardware (bottom of kernel stack).
- Mechanism: Shared DMA queues, interrupt lines, and spinlocks.
- Effect: TX bursts monopolize hardware → RX path delayed → late ACKs/timeouts.
- Analogy: Think of TX and RX as two people sharing a single narrow door — if one keeps walking through (TX burst), the other (RX) must wait.
⚙️ TX–RX Ring Interaction & Lock Contention Timeline
TX and RX rings share the same NIC DMA engine and driver locks. During TX bursts, the transmit path dominates CPU or PCIe/DMA bandwidth, delaying RX interrupts or NAPI polling.
Shared Rings + Contention Timeline
🕒 Microsecond-Scale Behavior Summary
| Time (µs) | Event | Description |
|---|---|---|
| 0 | TX lock acquired | ndo_start_xmit() grabs TX ring lock |
| 2 | TX DMA starts | Frames begin DMA transfer to NIC |
| 4 | RX interrupt arrives | Incoming ACK/packet — IRQ handler can’t take lock |
| 6 | TX completion IRQ fires | TX done, but same IRQ line shared |
| 8 | Lock still held | RX polling (NAPI) delayed |
| 10 | Lock released | RX handler runs |
| 12 | ACK processed | RTT appears inflated to TCP layer |
📊 Resulting TCP Symptoms
| Observable Effect | Root Cause | Layer Impacted |
|---|---|---|
| Increased RTT | RX delay due to TX contention | TCP |
| Duplicate ACKs | Delayed ACK reception | TCP |
| cwnd stalls | ACK clock slowed down | Congestion control |
| Spurious retransmissions | RX DMA backlog | Transport layer |
| Apparent half-duplex | Firmware defers RX | NIC hardware/driver |
⚙️ Real-World Example
- Seen in Telium, NXP, or Intel I225/I226 firmware when large TX bursts occur.
- Some NICs serialize TX/RX DMA channels to reduce PCIe contention.
- Linux mitigates this with:
- NAPI (poll-mode RX)
- Separate MSI-X interrupts for TX and RX queues
- RPS/RFS to distribute RX load across CPUs
Kernel-Level Timing Diagram — TX Burst Lock Contention & ACK Delay (µs-scale)
🧠 Explanation by Phase
| Phase | Time Range (Example) | Kernel/Driver Activity | Effect |
|---|---|---|---|
| 1. TX Lock Acquired | 0–5 µs | spin_lock(txrx_lock) acquired by TX path before DMA enqueue. | RX path blocked from accessing shared resources. |
| 2. TX DMA Active | 5–150 µs | NIC continuously transmitting packets from TX ring. RX DMA channel disabled or idle. | ACKs arrive at NIC but remain unprocessed (in HW buffer). |
| 3. RX Deferred | 50–200 µs | Interrupts masked; NAPI polling paused. | ACKs not delivered to TCP layer → congestion window stalls. |
| 4. TX Completion | 150–180 µs | TX interrupt raised → completion handler releases lock. | RX engine re-enabled. |
| 5. RX Resume | 180–250 µs | NIC DMA posts pending ACKs to RX ring → kernel processes via netif_receive_skb(). | TCP sees ACKs late → apparent RTT inflation. |
| 6. Normal Flow | >250 µs | TX and RX operate normally until next burst. | ACK-based pacing normalizes. |
🔩 Affected Components
| Component | Description | Relevance |
|---|---|---|
| TX/RX Ring Buffers | Shared circular DMA descriptor arrays managed by NIC and driver. | Contention occurs when both directions share descriptors or DMA channel. |
TX/RX Spinlock (txrx_lock) | Kernel lock guarding shared NIC resources. | Prevents concurrent DMA setup — serialization of TX/RX. |
| NAPI Poll Loop | Kernel polling mode replacing per-packet interrupts. | Paused or delayed when TX holds lock too long. |
| TCP Congestion Control | Relies on ACK pacing. | Missed or late ACKs shrink cwnd temporarily. |
| DMA Engine / SoC Bus | Shared hardware bus between NIC and memory. | TX DMA hogs bandwidth, delaying RX descriptor updates. |
Full-Duplex vs Quasi Half-Duplex Timing
| Behavior | Full Duplex | Quasi Half-Duplex (Embedded) |
|---|---|---|
| TX & RX DMA | Independent | Shared DMA channel |
| Lock Contention | None | txrx_lock serializes access |
| RX Interrupts During TX | Allowed | Masked/deferred |
| ACK Latency | < 10 µs | 100–300 µs typical |
| TCP Performance | Stable cwnd growth | Stalled cwnd during bursts |
No comments:
Post a Comment