Saturday, October 25, 2025

RX/TX Path in the Linux Kernel — Deep Dive

Every bit of data that travels across a network — from a simple web page request to a large file transfer — passes through the Linux kernel’s networking stack. Inside this stack, two main paths control the entire data flow: the TX (Transmit) path for sending packets and the RX (Receive) path for receiving them. These two paths act like the arteries and veins of the networking system — one pushes data out, and the other brings data in.

πŸ”Ή The TX Path – From Application to Network

When an application sends data (for example, using send() or write()), that data first enters the TX path.
Inside the kernel, it passes through several layers:

  • The TCP layer segments the data into chunks (based on the Maximum Segment Size, MSS).
  • The IP layer adds routing and addressing information.
  • The Queueing discipline (qdisc) layer decides when and how packets are transmitted (handling traffic shaping and prioritization).
  • Finally, the Network Interface Card (NIC) driver uses DMA (Direct Memory Access) to move packets from kernel memory to the network hardware, which then transmits them on the wire.

The TX path ensures efficient use of available bandwidth, obeys congestion control rules, and maintains the order and reliability of packets. 

πŸ”Ή The RX Path – From Network to Application

When packets arrive from the network, the RX path takes over.
The NIC receives incoming frames and places them in the RX ring buffer (a circular buffer in kernel memory). The NAPI (New API) mechanism then retrieves those packets efficiently, avoiding excessive interrupts during heavy load. 

Each packet travels upward through:

  • The Ethernet layer (for frame validation),
  • The IP layer (for routing and integrity checks),
  • And finally, the TCP layer, which reassembles packets into the original data stream.

The application then receives this reassembled data when it calls recv() or read(). 

πŸ”Ή Why TX/RX Path Matters

The efficiency of these two paths directly determines network performance, latency, and throughput.
A well-optimized TX path reduces CPU load and ensures packets are sent smoothly even under high traffic. An efficient RX path prevents packet drops and keeps the system responsive under heavy inbound load.

Together, the TX and RX paths form the foundation of how Linux handles networking — bridging the gap between user-space applications and physical network interfaces. Without their careful coordination, reliable and high-speed network communication simply wouldn’t be possible. 

 

TX Path : Application → Kernel → TCP/IP Stack → qdisc → NIC driver → DMA → Wire

RX path: Wire → NIC → DMA → NAPI → TCP/IP Stack → Socket Buffer → Application 

These paths operate asynchronously, governed by socket buffers (sk_buff), ring buffers, and TCP flow control (rwnd, cwnd). 

 

  πŸš€ TX (Transmit) Path — Inside Linux

StepComponentFunctionKernel Structures / FunctionsFlow Control Impact
1️⃣Application write() / sendmsg()User → kernel transitionsys_sendmsg() / sys_write()The user-space app writes to a socket. The syscall transitions to kernel mode.
2️⃣Socket buffer allocationKernel allocates sk_buff, copies user datask_buff (struct sk_buff)        sock_alloc_send_pskb()Affects send buffer limits (SO_SNDBUF).Data is copied from user buffer into a sk_buff (the universal kernel packet structure).
3️⃣TCP segmentationSegments based on MSS, cwndtcp_sendmsg()tcp_write_xmit()cwnd limits how many packets can be in-flight.TCP handles segmentation, sequence numbers, and window checks. It decides how much to send based on congestion control.
4️⃣OffloadingLarge Send Offload (TSO/GSO)skb_gso_segment()Reduces CPU overhead
5️⃣Routing & IP encapsulationAdds IP header, determines routeip_queue_xmit()Routing cache (FIB) used.The IP layer adds an IP header and uses FIB (Forwarding Information Base) to select the route and next hop.
6️⃣Queueing discipline (qdisc)Manages packet scheduling and shapingpfifo_fast, fq_codel, sch_fq, etc.Packets enter the qdisc layer where scheduling, shaping, or prioritization occurs.
7️⃣NIC TX ring enqueueAdds skb to NIC’s TX descriptor ringndo_start_xmit()Controlled by TX ring depth.The driver puts packets in the NIC’s transmit ring buffer.
8️⃣DMA mappingNIC copies buffer from kernel to devicedma_map_single()Zero-copy possible (e.g., XDP). NIC’s DMA engine copies packet data from kernel memory to NIC buffers.
9️⃣TransmissionNIC sends on the wirePHY/MACAt this point cwnd decreases (outstanding data ↑). Packet bits go onto the Ethernet medium.

TCP Segmentation Offload (TSO) and Generic Segmentation Offload (GSO) perform segmentation offload when transmitting TCP packets.

πŸ“ Key interactions with TCP congestion control: At Step 3, TCP checks if it’s allowed to send more data (based on cwnd and rwnd). If not, data stays queued in the socket buffer until ACKs arrive. 

πŸ“ˆ Congestion Control in TX Path

  • cwnd limits how much unacknowledged data can be sent.
  • On ACK: cwnd grows (Slow Start → Congestion Avoidance).
  • On loss: cwnd halves (Fast Retransmit / Recovery). 

 πŸ“₯ RX (Receive) Path — Inside Linux

StepComponentFunctionKernel Structures / FunctionsFlow Control Impact
1️⃣Packet arrival (wire)Bits received via PHYNIC hardwarePacket received on wire; NIC stores it in RX ring via DMA.
2️⃣DMA to RX ringNIC writes frame into RX bufferstruct napi_struct + DMA descriptorsBypasses CPU copy
3️⃣NAPI pollingKernel polls RX ring (reduces interrupt load)napi_poll()           netif_receive_skb()Controls packet batch size.NIC triggers interrupt or NAPI polls RX ring.
4️⃣sk_buff creationBuild skb from RX bufferbuild_skb()Per-packet metadata. The driver wraps raw data into an sk_buff structure.
5️⃣Protocol stack demuxEthernet → IP → TCPeth_type_trans(), ip_rcv(), tcp_v4_rcv()Routing / filtering applied. Each layer parses headers and hands the packet up.
6️⃣TCP reassemblyHandles out-of-order, missing segmentstcp_data_queue()Manages rcv_nxt, window. TCP verifies checksum, sequence number, and reorders out-of-sequence segments.
7️⃣ACK generationSends ACK to sendertcp_send_ack()Updates sender’s cwnd. TCP acknowledges received data to sender (critical for congestion control feedback).
8️⃣Delivery to appCopy to user-space buffertcp_recvmsg()Receiver window (rwnd) shrinks/grows. When application calls recv(), kernel copies the data from socket buffer to user-space.

 πŸ“ Key point:  RX path determines feedback to sender via ACKs and advertised window (rwnd).

πŸ“‰ Flow Control in RX Path

  • rwnd (receiver window) = available socket buffer space.
  • Sender’s cwnd cannot exceed rwnd → prevents overrun.
  • As the app reads data (recv()), kernel increases rwnd → sender resumes.

🧩 Key Kernel Structures (for both paths)

StructureLayerPurpose
struct sk_buffCore networkingUniversal packet descriptor
struct sockSocket layerHolds socket state
struct tcp_sockTCP layerTracks cwnd, ssthresh, seq numbers
struct net_deviceNIC abstractionInterface representation
struct napi_structDriver layerUsed for RX polling via NAPI
struct netdev_queueLink layerRepresents TX queue for NIC

🧠 How the Flow Works

PhaseSender ActionReceiver ResponseControl Feedback
1. TX initiationApp sends → kernel enqueues data into TX socket buffer.
2. TCP segmentationtcp_write_xmit() forms segments (limited by cwnd).
3. TransmissionPackets queued → NIC TX ring → DMA → Wire.
4. RX processingNIC RX ring → napi_poll()tcp_v4_rcv() → data reassembly.TCP sends ACKs.tcp_send_ack()
5. ACK receptionSender receives ACKs, advances snd_una, increases cwnd.cwnd++
6. App deliveryData reaches receiver’s app via recv().rwnd grows (buffer freed).rwnd++ advertised
7. Loop continuesNew data transmitted as cwnd and rwnd allow.Continuous feedback loop

⚙️ Core Control Interactions

MechanismManaged ByDirectionPurpose
cwnd (Congestion Window)Sender (TCP layer)OutboundControls how much data can be in-flight
rwnd (Receive Window)Receiver (TCP layer)Inbound (Advertised)Tells sender how much buffer space is available
ACK packetsReceiver → SenderReverseSignals successful receipt & drives cwnd growth
NAPI pollingKernel DriverLocal (RX)Reduces interrupt overhead during heavy load
TSO/GRO/GSONIC / KernelLocalOffload large segment handling

The circular buffers are used inside the kernel to manage socket data queues — they sit between the application and the TCP/IP stack, i.e., at the Transport Layer (TCP/UDP), and are visible as socket send and receive buffers.

1️⃣ Sending Circular Buffer (TX)

  • Where: Sender host, inside kernel, part of TCP socket (struct tcp_sock).
  • Layer: Transport layer (TCP)
  • Kernel Structure: sk_buff queues in the send buffer (sk_sndbuf)
  • Function:
    • Holds application data before it is segmented and transmitted.
    • TCP manages flow/congestion control using this buffer (cwnd, snd_una, snd_nxt).
    • Acts as a circular/ring buffer:
      • Head = next free spot to write data from app
      • Tail = next byte to transmit or waiting ACK 

2️⃣ Receiving Circular Buffer (RX)

  • Where: Receiver host, inside kernel, part of TCP socket (struct tcp_sock).
  • Layer: Transport layer (TCP)
  • Kernel Structure: sk_buff queues in the receive buffer (sk_rcvbuf)
  • Function:
    • Holds packets received from NIC (via RX ring → sk_buff) until the application reads them.
    • TCP reorders out-of-sequence segments here.
    • Acts as a circular buffer:
      • Head = last received segment
      • Tail = next byte to deliver to application (recv() call)
Buffer TypeLocationLayerKernel StructurePurpose
TX (Send)Sender kernelTransport / TCPsk_sndbuf + sk_buff queueStores app data before segmentation & transmission
RX (Receive)Receiver kernelTransport / TCPsk_rcvbuf + sk_buff queueStores received packets until application reads them

These circular buffers are separate from NIC hardware rings, but interact with them:

  • TX buffer → NIC TX ring → DMA → wire
  • RX buffer ← NIC RX ring ← DMA ← wire

Circular buffer ensures efficient use of fixed memory without moving data, with head/tail pointers wrapping around as data is sent or consumed.

Technical kernel-level view, showing exactly how TX/RX buffers, NIC rings, and TCP flow control work in Linux during packet transmission :

 

 Key Technical Points:

  • TX circular buffer (sk_sndbuf): stores unsent application data; sender can’t exceed cwnd bytes in-flight.
  • NIC TX ring: hardware descriptor queue for DMA transfers; enables zero-copy transmission.
  • Flow Control: cwnd limits outstanding bytes; ACKs from receiver advance tail pointer.

 

Key Technical Points:

  • RX circular buffer (sk_rcvbuf): holds incoming data until application reads it.
  • Flow Control (rwnd): advertised to sender to prevent buffer overflow.
  • ACK Generation: triggers cwnd updates on sender, enabling more data transmission.

 

  • TX buffer: ensures sender can buffer data for NIC without blocking app.
  • RX buffer: ensures receiver can store out-of-order packets and feed app smoothly.
  • NIC Rings: act as fast hardware queues between kernel memory and network.
  • TCP flow/congestion control: cwnd (sender) and rwnd (receiver) regulate transmission rate and buffer usage. 
ComponentLayerRole
TX Circular Buffer (sk_sndbuf)Transport / TCPHolds unsent data from application, limited by cwnd
RX Circular Buffer (sk_rcvbuf)Transport / TCPHolds received packets until application reads them, advertised via rwnd
NIC TX RingLink / DriverHardware queue for outgoing packets via DMA
NIC RX RingLink / DriverHardware queue for incoming packets via DMA
cwndTCP LayerCongestion control — limits in-flight bytes
rwndTCP LayerFlow control — prevents receiver buffer overflow

 πŸ§© Multiple Buffers in Linux TX/RX Path

There are three main buffers at the sender side, and similar buffers at the receiver side:

  • Socket Layer TX buffer (sk_sndbuf)
  • Queueing Discipline (qdisc) buffer
  • NIC Driver TX ring buffer

1️⃣ Socket Layer Buffer (Circular Buffer)

  • Layer: Transport (TCP)
  • Structure: sk_buff queue inside struct sock → sk_sndbuf
  • Purpose:
    • Holds application data before TCP segments it for transmission.
    • Acts as a circular buffer:
      • Head: next free spot to copy data from app
      • Tail: next byte to transmit or waiting for ACK
    • Works with TCP flow control:
      • cwnd (congestion window) limits the amount of in-flight data
      • snd_una / snd_nxt track acknowledged and unacknowledged bytes
  • Behavior:
    • When app calls send(), data is copied into this buffer.
    • TCP checks cwnd and only moves data from this buffer into the qdisc according to congestion control rules.
    • Provides a back-pressure mechanism for applications: if the buffer is full, send() may block or fail with EAGAIN. 

2️⃣ Queueing Discipline (qdisc) Buffer

  • Layer: Link Layer, kernel networking subsystem
  • Structure: Kernel-managed packet queue, e.g., pfifo_fast, fq_codel, sch_fq
  • Purpose:
    • Holds packets ready to be transmitted by the NIC.
    • Performs traffic shaping, scheduling, and prioritization.
    • Acts as a buffer between TCP and NIC hardware.
  • Behavior:
    • TCP passes fully formed sk_buff segments to qdisc.
    • Qdisc schedules packets based on policy (FIFO, fair queuing, CoDel, etc.).
    • Helps control burstiness and reduce packet drops at NIC due to congestion.
    • Queue depth can be tuned (default ~1000 packets for pfifo_fast). 

3️⃣ NIC Driver TX Ring Buffer

  • Layer: Link Layer → Hardware
  • Structure: Circular DMA descriptor array in kernel memory (NIC TX ring)
  • Purpose:
    • Holds packets that are ready to be sent by the NIC hardware.
    • Enables zero-copy DMA transfer: hardware reads packets directly from memory.
  • Behavior:
    • Qdisc or netdev layer enqueues sk_buff into TX ring.
    • NIC hardware DMA engine fetches the packet and transmits it onto the wire.
    • Ring buffer is finite, so if full, upper layers must wait — this is part of back-pressure propagation up to TCP. 

4️⃣ RX Path Buffers 

On the receiver side, the analogous buffers exist:

  • NIC RX Ring
    • DMA writes incoming frames from wire into RX descriptors.
    • Acts as a hardware receive queue.
  • Socket Layer RX buffer (sk_rcvbuf)
    • Holds TCP segments until the application reads them.
    • Works with rwnd (receive window) to control sender flow.
  • Intermediate kernel queues
    • For example, NAPI polling may batch multiple sk_buffs for efficiency.
    • TCP layer reorders out-of-order segments before delivering to socket buffer. 

 

πŸ”Ή Key Points About Multiple Buffers

BufferTypeRoleFlow Control
Socket TX (sk_sndbuf)CircularHolds app datacwnd limits data sent
qdiscQueueScheduling & shapingIf full, TCP waits
NIC TX ringCircular DMAHardware queue for NICNIC full → back-pressure to qdisc/TCP
NIC RX ringCircular DMAHardware queue for received framesN/A, but batch processing via NAPI
Socket RX (sk_rcvbuf)CircularHolds received segmentsAdvertised rwnd limits sender

 

No comments:

Post a Comment