Sunday, September 7, 2025

Troubleshooting ARP Table Overflows: Why Random Connectivity Drops Happen in VLANs

Recently, we have been observing frequent connectivity drop issues on random application servers. Packet captures (tcpdump) revealed that some servers stop receiving ARP replies from their destination IP addresses—even though both are in the same VLAN and subnet.

This raises an important question: what could cause ARP failures in a flat Layer 2 domain?

A Quick Refresher: What is ARP?

Address Resolution Protocol (ARP) is the mechanism that maps an IP address to its corresponding MAC address on a local Ethernet network.

Without ARP, hosts cannot communicate within the same subnet.

  •  Layer 2 to Layer 3 Mapping: NICs don’t understand IP addresses—they only use MAC addresses. ARP provides the “glue” between IP (Layer 3) and Ethernet (Layer 2).
  • MAC Addresses: Every Ethernet NIC has a 48-bit globally unique identifier, burned into ROM, which is used to deliver Ethernet frames.
  • Operation: If host 10.0.0.11 wants to communicate with 10.0.0.22, it broadcasts an ARP request asking: “Who has 10.0.0.22?”. The host owning that IP replies with its MAC address.

Where Things Go Wrong: Possible ARP Table Overflow

In a /24 subnet (255.255.255.0), we can assign up to 254 usable IP addresses. If this limit is fully utilized (or close to it), ARP tables at switches, routers, or even servers may hit capacity constraints.

What we observed:

  • No ARP reply received for some destination IPs.
  • Symptoms appear random—one server works fine while another loses connectivity.
  • Recovery sometimes only happens after rebooting the NIC (which clears and reloads its ARP cache).

This suggests a case of ARP table overflow, where the device managing ARP entries runs out of space.

 

Why Does ARP Overflow Matter?

When an ARP table is full:

  • The garbage collector may discard ARP entries, sometimes randomly.
  • A discarded entry means that the host can no longer resolve the MAC address for a destination IP.
  • Until the NIC or OS refreshes its ARP cache—or the device itself is rebooted—communication to that destination fails.

Essentially, the host knows where to send packets (IP), but doesn’t know how to send them (MAC).

 

Potential Culprits

  • Router ARP Table Limit – Routers maintain ARP caches for each connected subnet. Hitting the per-interface ARP entry limit can cause drops.
  • Switch ARP/Forwarding Table Overflow – L2 switches may also have finite CAM/ARP tables. Overflow can lead to incomplete lookups.
  • Server-Side ARP Cache Limits – Even Linux/Windows servers have configurable limits for ARP cache entries. 

 

gc_thresh1 (since Linux 2.2) : The minimum number of entries to keep in the ARP cache. The garbage collector will not run if there are fewer than this number of entries in the cache.  Defaults to 128.

gc_thresh2 (since Linux 2.2) : The soft maximum number of entries to keep in the ARP cache.  The garbage collector will allow the number of entries to exceed this for 5 seconds before collection will be performed.  Defaults to 512.

gc_thresh3 (since Linux 2.2) : The hard maximum number of entries to keep in the ARP cache.  The garbage collector will always run if there are more than this number of entries in the cache.  Defaults to 1024.

Troubleshooting Commands

Here are useful commands to investigate ARP-related issues across different systems:

 # Show current ARP cache entries
arp -n

# Modern replacement command
ip neigh show

# Clear ARP cache
ip -s -s neigh flush all

# Check ARP kernel parameters
cat /proc/sys/net/ipv4/neigh/default/gc_thresh1
cat /proc/sys/net/ipv4/neigh/default/gc_thresh2
cat /proc/sys/net/ipv4/neigh/default/gc_thresh3

# Adjust thresholds (increase ARP cache size if needed)
sysctl -w net.ipv4.neigh.default.gc_thresh1=512
sysctl -w net.ipv4.neigh.default.gc_thresh2=1024
sysctl -w net.ipv4.neigh.default.gc_thresh3=2048

Recommendations

  • Check ARP Table Sizes on routers, switches, and servers in the VLAN. Verify if limits are being hit.
  • Segment Large VLANs – If the VLAN is hosting too many hosts (close to 254), consider subnetting further (e.g., /25, /26) to reduce ARP load.
  • Monitor ARP Cache Usage – Many network devices provide counters/logs for ARP cache utilization.
  • NIC/OS Tuning – Adjust ARP cache timeouts and maximum entries in server OS (Linux: /proc/sys/net/ipv4/neigh/*).

 

Tuesday, September 2, 2025

The NIC Speed Mismatch Challenge

 

Resolving Intermittent Connection Resets on ESXi: The NIC Speed Mismatch Challenge

Maintaining stable and high-performance network connectivity is critical in modern virtualized environments. Recently, our team encountered an intermittent TCP connection reset issue on the ESXi blade MK-FLEX-127-40-B2, which provided a perfect case study on the importance of proper NIC teaming configurations.

🧩 Issue Overview

During routine connectivity testing on the ESXi host, we observed sporadic TCP connection resets that were difficult to reproduce consistently. Upon investigation, we found that the issue occurred specifically when:

  • vmnic1 (10Gbps) and vmnic3 (1Gbps) were configured together in an active-active NIC teaming setup.

Other combinations, such as vmnic0 + vmnic1 or vmnic2 + vmnic3, exhibited no connectivity issues, highlighting a configuration-specific problem.



🔍 Root Cause Analysis

The underlying cause was a speed mismatch between teamed NICs, which led to asymmetric traffic paths:

  • Traffic could egress over the 10Gbps NIC (vmnic1) but return via the 1Gbps NIC (vmnic3) or vice versa.

  • This path asymmetry confused network devices, such as firewalls and load balancers performing stateful inspection, resulting in intermittent TCP resets.

  • Mismatched NICs in a team can also lead to:

    • Out-of-order packet delivery

    • MTU mismatches, particularly if jumbo frames are enabled on only one NIC

    • Load balancing inconsistencies under certain hashing policies

Key takeaway: All physical NICs in a team should be of the same speed, duplex, and model to avoid unpredictable network behavior.


🛠️ Resolution Steps

To address the issue, the NIC teaming configuration was updated:

  1. Replaced vmnic3 (1Gbps) with vmnic0 (10Gbps) in the team alongside vmnic1.

  2. Ensured consistent MTU, speed, and duplex settings across both NICs.

  3. Verified that traffic symmetry and load balancing consistency were restored under active-active teaming.



✅ Post-Change Results

After reconfiguration:

  • No further connection resets were observed during testing.

  • Network performance stabilized across all workloads.

  • The NIC team now adheres to best practices: all adapters are of the same speed and type, ensuring link-layer stability.

📌 Lessons Learned

This incident reinforced several key networking principles:

  1. NIC Homogeneity: Only team NICs with the same speed and model.

  2. MTU Consistency: Ensure jumbo frame settings match across all adapters.

  3. Traffic Symmetry: Active-active NIC teams require symmetric egress and ingress paths to maintain session integrity.

  4. Documentation & Audit: Regularly review NIC teaming and ESXi hardening checklists to prevent recurring issues.

🔗 Conclusion

Even in highly virtualized environments, simple configuration mismatches like NIC speed differences can cause elusive connectivity problems. By adhering to NIC teaming best practices, organizations can avoid asymmetric traffic issues, stabilize network performance, and ensure reliable connectivity for critical workloads.


Misusing SO_LINGER with 0 can lead to data loss

What SO_LINGER does

SO_LINGER is a socket option (setsockopt) that controls how close() behaves when unsent data remains on the socket.

  • SO_LINGER decides how the socket behaves when the application calls close() and there is unsent data in the socket’s send buffer.
  • Unsent data = bytes written by the application but not yet transmitted and acknowledged by the peer.

Behavior in Client Scenario

  • Client workflow: Send request → Wait for complete response → Call close().
  • At the point of close():
    • All client request bytes have been transmitted and ACKed.
    • Send buffer is empty.
    • There is no “unsent data” for SO_LINGER to discard.
  • Result:
    • With SO_LINGER(0), the client still sends RST instead of FIN, but no client data is lost.
    • The server may log the abrupt reset, but functionally it is harmless for stateless APIs.

Normal Close (default, no linger set): 

  • Client:  FIN →  
  • Server:      ACK → FIN →  
  • Client:           ACK

    Connection passes through FIN_WAIT, CLOSE_WAIT, LAST_ACK, TIME_WAIT.

    

Characteristics:

  • Graceful 4-way handshake.
  • States traversed: FIN_WAIT_1 → FIN_WAIT_2 → TIME_WAIT (client) and CLOSE_WAIT → LAST_ACK (server).
  • Guarantees reliable delivery of all data.

SO_LINGER(0) (Abortive Close) : 

  • Client:  RST →  
  • Server: Connection dropped immediately

      

All intermediate states skipped → both sides move to CLOSED instantly.

✅ Characteristics:

  • Instant teardown.
  • Skips FIN/ACK handshake, TIME_WAIT, CLOSE_WAIT, LAST_ACK.
  • Peer sees abrupt RST.
  • Any unsent data is discarded (not applicable in our stateless scenario).

Practical Implications of SO_LINGER(0)

  • ✅ No risk of data loss here (request fully sent, response fully received).
  • ✅ Good for short-lived, stateless API calls — avoids lingering sockets.
  • ⚠️ Server logs may show RST instead of FIN.
  • ⚠️ Should not be used in protocols requiring graceful close or guaranteed delivery after close().

SO_LINGER(5)

Case A: All data delivered before timeout

✅ Behavior:

  • close() blocks until handshake completes.
  • Graceful 4-way close, just like normal.
  • Application knows delivery succeeded before returning.

Case B: Timeout expires (no ACK from server within 5s)

❌ Behavior:

  • Data not acknowledged → stack aborts with RST.
  • Connection torn down abruptly.
  • Peer sees reset instead of FIN.

🔑 Summary of SO_LINGER(5)

  • Best case: Works like normal close, but close() blocks until data is ACKed.
  • Worst case: After 5s timeout, behaves like SO_LINGER(0) (RST abort).
  • Useful when the application must know if the peer ACKed data before completing close().

✅ Conclusion

  • In stateless client-server flow, SO_LINGER(0) is acceptable.
  • It allows instant connection teardown with no data loss, since the request/response exchange is already complete.
  • The only visible impact: the server sees an RST instead of a normal FIN handshake.

1. Definition of unsent data (in TCP/SO_LINGER context)

  • When a client calls write() (or send()) on a TCP socket, data goes into the socket’s send buffer.
  • Unsent data = bytes that have not yet been transmitted and acknowledged by the peer.
  • SO_LINGER controls what happens to this unsent data when close() is called:
    • SO_LINGER(0) → discard immediately, send RST.
    • SO_LINGER>0 → try to send within timeout.
    • default (SO_LINGER not set) → normal FIN handshake.

2. How it applies to stateless scenario

  • Client flow: send request → wait → receive full response → close.
  • At the point of close():
    • All client request bytes have been transmitted and acknowledged by the server.
    • There are no pending bytes in the client send buffer.
  • Therefore, the “unsent data” that SO_LINGER refers to does not exist in your scenario.

In client workflow, the client only calls close() after sending the complete request and receiving the full server response. At this point, the socket’s send buffer is empty, so there is no unsent data. SO_LINGER(0) will still close the socket abruptly, but it does not result in any loss of transmitted data.

  • By default (no linger or SO_LINGER disabled):
    • close() just queues a graceful shutdown.
    • TCP tries to deliver any unsent data and perform the normal 4-way FIN/ACK close.
    • The application returns from close() quickly, but the actual TCP teardown may still be in progress.
  • If SO_LINGER is enabled with a timeout >0 (e.g., 5 sec):
    • close() becomes blocking until either:
      • All unsent data is delivered and ACKed, and connection closes gracefully, or
      • The timeout expires → then connection is reset (RST).
  • If SO_LINGER is set with timeout = 0 (i.e., SO_LINGER(0)):
    • close() causes an immediate abortive close.
    • Any unsent data is discarded, and the stack sends RST instead of FIN.
    • This tears down the connection instantly.

🔹 Can we use SO_LINGER(0)?

  • Yes, it’s a valid, documented use.
  • But it changes semantics: instead of a graceful shutdown, we’re forcing an abortive close.
  • This is typically used when:
    • We don’t care about undelivered data.
    • We want to immediately free up resources / ports.
    • We need to ensure the peer can’t reuse half-open connections.

When client call close() API , how it behaves under different SO_LINGER settings, including the packets exchanged and the practical use cases.

Mode Packets on Wire Behavior of close() Pros Cons Typical Use
Normal close (default, no SO_LINGER) App calls close() → stack sends FIN → peer ACK → peer FIN → local ACK (4-way close) close() returns immediately, TCP teardown continues in background ✅ Graceful shutdown
✅ Ensures data delivery
✅ Peer sees clean close
❌ Leaves socket in TIME_WAIT
❌ Connection cleanup takes longer
General case (most apps)
SO_LINGER enabled, timeout > 0 
(e.g. 5 sec)
On close(): TCP waits until unsent data is ACKed, then does FIN/ACK exchange. If timeout expires → sends RST close() blocks until either data is delivered or timeout expires ✅ App knows whether data was delivered
✅ Useful in transactional protocols
❌ Blocks calling thread
❌ If timeout, abrupt RST
When you must confirm data delivery before returning from close()
SO_LINGER(0) 
(timeout = 0)
Immediately sends RST, skipping FIN/ACK handshake close() returns immediately, and connection is torn down instantly ✅ Frees resources instantly
✅ Avoids half-open states
❌ Any unsent data is discarded
❌ Peer sees abrupt reset (may log error)
❌ Not graceful
Emergency cleanup, abortive close, broken peers (like M400 not ACKing FIN)

Explanation

  • As soon as the client calls close() with SO_LINGER(0):
    • TCP stack sends RST immediately, discarding any unsent data.
    • Client socket transitions instantly to CLOSED.
  • Server receives RST:
    • Drops the half-open connection immediately.
    • Moves directly to CLOSED.
  • No FIN/ACK handshake occurs; there is no FIN_WAIT, CLOSE_WAIT, LAST_ACK, or TIME_WAIT on either side.

✅ Key difference vs normal 4-way close:

  • All intermediate states like FIN_WAIT-1, FIN_WAIT-2, TIME_WAIT, CLOSE_WAIT, LAST_ACK are skipped.
  • Connection is torn down immediately.

TCP Buffer :  When TCP documentation says “unsent data is discarded,” it refers to **data in the client’s send buffer that the TCP stack hasn’t physically put on the wire yet.

In Stateless scenario, Using SO_LINGER(0) is acceptable because:

  • The client already sent the request(all transaction payload ) and received the response, so there’s no risk of losing client data.
  • Client has no pending writes in the TCP send buffer. Therefore, there is no unsent data at the moment of calling close().
  • The connection is stateless, and each transaction opens a new connection anyway, so skipping the graceful FIN/ACK handshake doesn’t break application logic.
  • The only downside is the server may log an RST instead of a normal FIN, which is usually harmless for stateless APIs.

SO_LINGER(0) Impact

  • Causes immediate TCP reset (RST) instead of normal FIN handshake.
  • “Unsent data” refers to pending client-side writes, which in your case is already sent, so nothing is actually lost.
  • Client sees no issue; server may log an abrupt reset.

Wireshark TLS Decryption Guide For Java

 Introduction

Sometimes during project development, it is necessary to use tools like Wireshark to analyze the underlying network communication. However, as SSL/TLS has become the standard for secure network communication, all data is encrypted, making it impossible to directly observe the actual payload.

To decrypt TLS traffic in Wireshark, we need a way for the client or server to export the master secret key after the SSL/TLS certificate exchange. According to the TLS standard, once the certificate exchange and verification are complete, the communication channel switches from asymmetric encryption (RSA/ECDH) to symmetric encryption (AES/GCM, AES/CBC, etc.).

The rationale for this fallback is that public key encryption is computationally expensive and unnecessary for encrypting the bulk of the data. Symmetric encryption is much faster and provides the same level of security for ongoing communication. By capturing the master secret key, we can decrypt the encrypted TLS traffic in Wireshark and inspect the actual content exchanged between client and server.

This document explains how TLS encryption/decryption works in a Java client and how to use the extract-tls-secrets-4.0.0.jar agent to export TLS secrets for analysis, both standalone and with a Payara application server.

TLS Decryption & extract-tls-secrets Usage in Java

When a Java application (e.g., HttpsURLConnection, SSLSocket) communicates over HTTPS/TLS:

Component

Responsibility

OS (Kernel/TCP Stack)

Handles TCP/IP: packet receipt, checksum, segmentation, reassembly. Does not decrypt TLS.

JVM TLS Library (JSSE / Conscrypt / BouncyCastle)

Performs TLS handshake, key derivation, encryption, decryption, integrity verification. Produces plaintext for the app.

Application Server / Client App

Receives plaintext data after JVM decryption, executes business logic.

TLS Agent (extract-tls-secrets)

Hooks into JVM TLS APIs to capture pre-master/master secrets for external decryption. Does not modify payload.

This swim lane diagram clearly shows who does what at each step of TLS communication between a Java client/server and the network, highlighting the role of each component in encryption, decryption, and secret logging.

Decryption is performed by the JVM TLS library.

·       The TLS Agent only logs secrets.

·       The OS only handles transport of encrypted bytes.

·       The Application Server works purely with plaintext after decryption.

Step-by-Step Explanation:

1.       OS Lane

o   Receives encrypted TCP segments from the network.

o   Validates checksums and TCP flags.

o   Reassembles TLS records into a continuous stream for the JVM.

2.       JVM TLS Library Lane

o   Reads encrypted bytes from the OS.

o   Performs TLS handshake (ClientHello, ServerHello).

o   Validates server certificates.

o   Generates pre-master and master secrets.

o   Expands keys and decrypts ApplicationData records.

o   Verifies integrity and produces plaintext for the application server.

o   Encrypts response data before sending back to OS.

3.       TLS Agent Lane

o   Hooks into the JVM TLS library.

o   Captures pre-master and master secrets during handshake.

o   Logs secrets to a file for use with Wireshark.

o   Does not perform decryption or modify TLS data.

4.       Application Server Lane

o   Receives plaintext HTTP requests from JVM.

o   Parses headers and body, validates request data.

o   Executes business logic.

o   Generates HTTP response, which is then encrypted by JVM before being sent to the OS.

extract-tls-secrets Overview

Decrypt HTTPS/TLS connections on-the-fly. Extract the shared secrets from secure TLS connections for use with Wireshark. Attach to a Java process on either side of the connection to start decrypting.

·       Java agent to extract TLS secrets from running JVM processes.

·       Can be used standalone (attach to HttpURLConnectionExample) or with application servers like Payara.

·       Output secrets can be used by Wireshark to decrypt TLS traffic.

Using extract-tls-secrets Standalone and Run with TLS Agent

Download this extract-tls-secrets-4.0.0.jar from https://repo1.maven.org/maven2/name/neykov/extract-tls-secrets/4.0.0/extract-tls-secrets-4.0.0.jar

 Attach on startup

Add a startup argument to the JVM options: -javaagent:<path to jar>/extract-tls-secrets-4.0.0.jar=<path to secrets log file>

 For example to launch an application from a standalone java file run:

 java -javaagent:/path/to/extract-tls-secrets-4.0.0.jar=/path/to/secrets.log HttpURLConnectionExample

·       /path/to/secrets.log will contain TLS session secrets.

·       These can then be configured in Wireshark to decrypt the traffic.

Using extract-tls-secrets with Payara Server and Run with TLS Agent

JVM Startup Option : Captures TLS secrets for all JVM-initiated connections after startup.

Add the Java agent in Payara JVM options:

asadmin create-jvm-options "-javaagent:/path/to/extract-tls-secrets-4.0.0.jar=/path/to/secrets.log"

Hot-Attach to Running Payara

Only captures new TLS sessions after attachment. No runtime toggle; to “disable,” restart JVM without -javaagent.

Attach agent to running process:

 java -jar /path/to/extract-tls-secrets-4.0.0.jar <PID> /path/to/secrets.log

TLS secret key logs

TLS 1.3 (with traffic secrets)

In modern TLS 1.3, tools like export-tls-secrets or SSLKEYLOGFILE produce logs with named traffic secrets. Example (secrets.log):

TLS 1.3 no longer uses a single “master key.” Instead, it derives multiple secrets (handshake, application traffic, etc.) from the initial key exchange.

 Each line has: <Secret_Type> <ClientRandom> <Secret_Value_Hex>

Examples of Secret_Type:

  • CLIENT_HANDSHAKE_TRAFFIC_SECRET
  • SERVER_HANDSHAKE_TRAFFIC_SECRET
  • CLIENT_TRAFFIC_SECRET_0
  • SERVER_TRAFFIC_SECRET_0

TLS 1.2 and earlier (RSA / Master Secret)

 Older SSL/TLS (RSA key exchange) used a single Master-Key.

  • TLS 1.2 (RSA) logs include Session-ID and Master-Key.
  • The Master-Key is used to derive session keys for encryption.
  • Only one line per session; no multiple traffic secrets like TLS 1.3.

Example (secrets.log):



 Using TLS Secrets in Wireshark
  • Open Wireshark and open .pcap file.
  • Go to: Edit → Preferences → Protocols → TLS

o    Upload (Pre)-Master-Secret log filename.

 
 o    Set (Pre)-Master-Secret log filename to the secrets log path.
 

           

 TLS packets will now decrypt automatically.

TLS Version

Log Style

Example

TLS 1.3

Traffic secrets per direction/stage

CLIENT_HANDSHAKE_TRAFFIC_SECRET <ClientRandom> <HexKey>

TLS 1.2/SSL

Single master key per session

RSA Session-ID:<id>\nMaster-Key:<hex>


TLSv1.2 Example

·       Initially it shows like below:

o     Encrypted Request Payload : Frame 23 (Application Data)

o     Encrypted Response Payload : Frame 29 (Application Data)


·       After Decryption :

o   Encrypted Request Payload: Frame 23 (POST /iCNow…. HTTP/1.1,…)

o    Encrypted Response Payload : Frame 29 (HTTP/1.1 200 (text/html))

Decrypted Request Payload:

Decrypted Response Payload:


TLSv1.3 Example

·       Initially it shows like below:

o     Encrypted Request Payload : Frame 5986  (Application Data)

o     Encrypted Response Payload : Frame 6143  (Application Data)


·       After Decryption :

o   Decrypted Request Payload: Frame 5986 (GET /auruspay/api/dev/status HTTP/1.1)

o    Decrypted Response Payload : Frame 6143 (HTTP/1.1 200 OK, (application/json))


Decrypted Request Payload:

Decrypted Response Payload:


Friday, June 20, 2025

Network Path MTU Issues: PMTUD Black Holes, ICMP, and MSS Optimization

Efficient network communication relies on understanding and managing the Maximum Transmission Unit (MTU) across the entire end-to-end path. MTU mismatches or improperly handled Path MTU Discovery (PMTUD) can result in silent drops (black holes), degraded performance, or outright connection failures.

It explains:

  • What MTU and MSS are
  • How PMTUD works
  • Why PMTUD black holes occur
  • The role of ICMP in PMTUD
  • Best practices for avoiding issues, including MSS clamping

Maximum Transmission Unit (MTU)

  • The largest packet size (in bytes) that can be transmitted without fragmentation over a link.
  • Common MTUs:
    • Ethernet: 1500 bytes
    • IPsec VPN (ESP): ~1400 bytes (due to encapsulation overhead)
    • GRE tunnels: ~1476 bytes

Maximum Segment Size (MSS)

  • The maximum amount of TCP payload data a device is willing to receive in a single segment.
  • Calculated as: MSS = MTU - IP header (20 bytes) - TCP header (20 bytes)
    • For MTU 1500, MSS is typically 1460.

Tunnel Type Encapsulation Overhead Suggested MSS
IPsec (ESP) ~60 bytes 1380
GRE ~24 bytes 1436
SD-WAN (overhead varies) 60–100 bytes 1300–1360

Common Symptoms of MTU Issues

  • TCP connections hang during TLS handshake (often during Server Hello).
  • Long delays followed by timeouts or retransmissions.
  • Specific applications fail while others succeed.
  • Only large payloads are affected (e.g., HTTP POSTs, file uploads)

MSS Overshoot:
  •   If MSS > path MTU, packets are fragmented or dropped.
  •   Example: MSS=1434 with MTU=1383 → Fragmentation needed.

What Is MSS Clamping?

  • Router/firewall modifies the TCP MSS option in SYN packets.
  • Ensures TCP sessions agree on a safe MSS that fits below the true path MTU.

When to Use

  • When PMTUD is unreliable or ICMP cannot be guaranteed.
  • In environments with tunnels, IPsec, or MPLS which reduce effective MTU.

 Payload Fragmentation

    When It Happens: If DF=0 and MSS > path MTU.

    Risks:
  •  Increases latency.
  •  Some networks/firewalls drop fragments (security policies)  
Does MSS Change Based on Internet Speed/Bandwidth?

MSS is determined by MTU (Maximum Transmission Unit) and protocol overhead, not by bandwidth or speed fluctuations.
  • Example: Whether your link is 10 Mbps or 1 Gbps, MSS remains fixed at MTU - 40 bytes (TCP/IPv4 header).
  • Exception: If the path MTU changes (e.g., due to VPN tunnel adjustments), MSS may be renegotiated during TCP handshake.

 Factors That Can Influence MSS (Client-Side)  

Factor Impact on MSS Example
1. MTU of the Interface (Wi-Fi, Ethernet, LTE)
Directly sets MSS: MSS = MTU - 40 (TCP/IPv4 header). Ethernet MTU 1500 → MSS 1460.
(Primary determinant of MSS)
2. Tunnel Overhead Reduces effective MTU (and thus MSS). IPsec adds 50 bytes → MSS = 1500 - 50 - 40 = 1410.
3. MSS Clamping (by local router/firewall)
Firewalls/SD-WAN can enforce MSS limits to prevent fragmentation. Force MSS ≤ 1343 for VPN tunnels.
4. Path MTU Discovery (PMTUD) Dynamically adjusts MSS if intermediate links have smaller MTUs. Router with MTU 1400 → MSS 1360.
5. TCP Stack Settings
Dual-stack (IPv4 vs IPv6)
OS/kernel can override default MSS (e.g., sysctl net.ipv4.route.mtu in Linux). Manual MSS setting for POS terminals.
(Different header sizes → different MSS)

 What Does Not Influence MSS?

Factor Reason
Bandwidth/Speed (nternet speed (e.g., 1 Mbps vs 100 Mbps))
MSS is a size limit, not throughput-related.
Latency/Jitter  (ping time) Affects performance but not segment size.
Packet Loss Triggers retransmissions but doesn’t change MSS.
Encryption (TLS/SSL) Adds payload overhead but doesn’t alter TCP MSS (handled above transport layer).

PMTUD (Path MTU Discovery)

Path MTU Discovery (PMTUD) is a mechanism that helps a sender find the maximum IP packet size (MTU) that can traverse the network without fragmentation. Each link in a network path can have a different MTU.  PMTUD helps avoid:

  • Sending packets too big, which would get dropped if DF (Don't Fragment) is set
  • The overhead of IP fragmentation, which can hurt performance

How PMTUD Works:

  • Sender sends packets with the "Don’t Fragment" (DF) flag set.
  • If a router along the path encounters a packet larger than its MTU, it:
    • Drops the packet.
    • Sends an ICMP Type 3 (Code 4: "Fragmentation Needed") message back to the sender, including the next-hop MTU.
  • The sender then reduces its MSS/MTU to match.

PMTUD Black Holes

  • Occurs when:
    • Intermediate routers drop packets with DF bit set.
    • ICMP "Fragmentation Needed" messages are blocked or filtered.
  • Result: Sender never learns to reduce packet size → packets silently dropped.

 Blocking ICMP Type 3 Breaks PMTUD:

  • If firewalls block these ICMP messages, the sender never learns it needs to reduce MTU/MSS.
  • Result: Packets are silently dropped, causing timeouts and retries.

Best Practices for Enabling PMTUD

On Firewalls/Routers: For proper Path MTU Discovery (PMTUD) to work, "ICMP Type 3 (Destination Unreachable)- Code 4 (Fragmentation Needed but DF set)" must be allowed in both directions (inbound and outbound) across firewalls, routers, and hosts.

  • Allow outbound ICMP Type 3, Code 4 (from routers to senders).
  • Allow inbound ICMP Type 3, Code 4 (if hosts need to receive PMTUD messages).

 How enabling ICMP Type 3 helps our scenario

Enabling ICMP Type 3 ("Fragmentation Needed") on firewalls is critical for proper Path MTU Discovery (PMTUD) to work. Here's why it resolves our MSS/MTU issues and how to implement it:

Before (ICMP Blocked) After (ICMP Allowed)
MSS=1380 fails (no feedback) Router sends ICMP Type 3, telling client to use MTU=1290.
Client blindly retries with MSS=1250 (guessing) Client immediately adjusts MSS to 1250 (1290 - 40).
Inefficient retries and latency First attempt succeeds with correct MSS.

 Why ICMP Type 3 ("Fragmentation Needed") Matters

  1. How PMTUD Works:

    • Sender sends packets with the "Don’t Fragment" (DF) flag set.

    • If a router along the path encounters a packet larger than its MTU, it:

      • Drops the packet.

      • Sends an ICMP Type 3 (Code 4: "Fragmentation Needed") message back to the sender, including the next-hop MTU.

    • The sender then reduces its MSS/MTU to match.

  2. Blocking ICMP Type 3 Breaks PMTUD:

    • If firewalls block these ICMP messages, the sender never learns it needs to reduce MTU/MSS.

    • Result: Packets are silently dropped, causing timeouts and retries (like your MSS 1380 → 1250 fallback).

        

        

         

Criteria Solution 1 : MSS Clamping (Manual Adjustment) Solution 2 : PMTUD Enabled (Automatic Detection)
Mechanism Forcefully sets TCP MSS to a fixed value (e.g., 1250) Relies on ICMP Type 3, Code 4 to dynamically adjust MTU
Trigger
Pre-configured MTU mismatch (e.g., tunnel)
ICMP "Frag Needed" (Type 3, Code 4)
When Applied During TCP handshake (SYN/SYN-ACK) After packet loss (retransmission)
Implementation Configured on data center firewalls/Store network
Requires allowing ICMP "Fragmentation Needed" end-to-end
Mechanism
TCP option (e.g., MSS 1400)
ICMP error + PMTUD cache update
Pros - Guaranteed packet size reduction
- Always works (no dependency on ICMP)
- Auto-adapts to path changes
- RFC-compliant (RFC 1191)
Cons - Static (fails if path MTU changes)
- May fragment
- Fails if ICMP is blocked
- Slight initial delay
Impact Suboptimal for dynamic networks Industry best practice for reliability
Performance
Prevents fragmentation upfront
May cause delays due to retransmissions
Verification Check SYN packets for clamped MSS (e.g., tcpdump) Test with ping -M do -s 1400 for ICMP responses
Recommended? Fallback option Primary solution (enable ICMP Type 3, Code 4)

Why PMTUD (Option 2) is the Right Approach

  • SD-WAN Confirmed the Root Cause: ICMP blocking at 38.97.129.101 is breaking PMTUD.
  • "Black Hole" Issue Resolved: Unblocking ICMP ensures the server receives MTU feedback, preventing silent failures.
  • Future-Proof: Works seamlessly even if the tunnel MTU changes.

Key Decision Factors

  • If ICMP can be unblocked: PMTUD (Preferred) – Self-healing and scalable.
  • If ICMP must stay blocked: MSS Clamping – Static but predictable (set MSS=1250)

Immediate Actions Requested

  • Unblock ICMP Type 3, Code 4 on all firewalls/routers between the server and SD-WAN tunnel.
  • Monitor: Confirm the sender(client/server/router/firewall)  auto-adjusts MSS after receiving ICMP feedback.

 Why MSS Changes Despite Fixed Tunnel MTU

Intermediate Device Restrictions  : A router/firewall along the path may have a smaller MTU (e.g., 1290), forcing TCP to adjust MSS dynamically. Tunnel MTU is 1383, but the router caps packets at 1290 → MSS = 1290 - 40 = 1250.

PMTUD Behavior : If the initial packet (MSS=1380) is dropped due to fragmentation, ICMP "Packet Too Big" messages force the client to retry with a smaller MSS. Some networks block ICMP, breaking PMTUD and causing persistent failures.

Asymmetric Paths : Outbound/inbound paths may differ (e.g., traffic shaping on one leg). The client sees the strictest MTU.

TCP Stack Heuristics : Modern OSes (Linux/Windows) may aggressively reduce MSS after failures, even if the root cause isn't MTU.

Why the client might reduce its MSS from 1380 to 1250 despite both tunnels having the same MTU (1383)

      
Observation Possible Explanation
First attempt (MSS=1380) fails Path MTU discovery (PMTUD) detects fragmentation and triggers MSS reduction.
Retry (MSS=1250) succeeds Client adapts to a narrower bottleneck (e.g., intermediate device with smaller MTU).
Same tunnel MTU (1383) Tunnel endpoints support 1383, but the path may have a stricter limit (e.g., 1290).

The client reduces MSS because the path MTU is narrower than the tunnel MTU.

What is ICMP Type 3?

ICMP (Internet Control Message Protocol) Type 3 is a "Destination Unreachable" message sent by a router or host to indicate that a packet cannot be delivered to its intended destination. It includes various codes (subtypes) that specify the reason, such as:

  • Code 0 (Net Unreachable) – Network is not accessible.
  • Code 1 (Host Unreachable) – Host is not reachable.
  • Code 3 (Port Unreachable) – The requested port is closed.
  • Code 4 (Fragmentation Needed but DF set) – Indicates Path MTU Discovery (PMTUD) failure.
Is Enabling ICMP Type 3 Recommended?

Yes, in most cases, ICMP Type 3 should be enabled because:
  • Helps with troubleshooting – Without it, connectivity issues become harder to diagnose (e.g., "Request timed out" instead of "Destination unreachable").
  • Supports Path MTU Discovery (PMTUD) – Code 4 (Fragmentation Needed) is critical for TCP performance; blocking it can cause broken connections for large packets.
  • Prevents "black holes" – Without ICMP Type 3, a sender may keep retransmitting packets indefinitely, unaware that the destination is unreachable.

Default Value
  • Most firewalls and operating systems allow ICMP Type 3 by default since it is essential for proper network operation.
  • Some restrictive security policies may block it, but this can cause network issues.

  

How Enabling ICMP Type 3 

Enabling ICMP Type 3 ("Fragmentation Needed") on firewalls is critical for proper Path MTU Discovery (PMTUD) to work. Here's why it resolves your MSS/MTU issues and how to implement it:

Before (ICMP Blocked) After (ICMP Allowed)
MSS=1380 fails (no feedback) Router sends ICMP Type 3, telling client to use MTU=1290.
Client blindly retries with MSS=1250 (guessing) Client immediately adjusts MSS to 1250 (1290 - 40).
Inefficient retries and latency First attempt succeeds with correct MSS.

Key Benefits

  • Prevents silent packet drops: Devices adjust MSS/MTU proactively.
  • Eliminates guesswork: No more arbitrary MSS fallbacks (e.g., 1380 → 1250).
  • Improves performance: Reduces TCP retransmissions and latency.

Best Practices for Enabling PMTUD

On Firewalls/Routers:

  • Allow outbound ICMP Type 3, Code 4 (from routers to senders).
  • Allow inbound ICMP Type 3, Code 4 (if your hosts need to receive PMTUD messages).


Avoid:

  • Blocking all ICMP (breaks PMTUD and troubleshooting).
  • Filtering ICMP Type 3, Code 4 (causes "black hole" connections).

Real time use case :

Client-side TCP MSS values ranged across the following: 1243, 1250, 1259, 1261, 1273, 1291, 1322, 1323, 1331, 1343, 1380 .

The server always responds with MSS=1460, as MTU is 1500. However, the SD-WAN tunnel has an effective MTU of 1383 bytes due to 117 bytes of encapsulation overhead.

When a client advertises MSS=1380, the corresponding packet size becomes:

  • 1380 MSS + 40 (TCP/IP headers) = 1420 bytes MTU
    • This exceeds the SD-WAN tunnel MTU (1383), leading to packet drop.

Connections where the client MSS was set to 1380 consistently failed to complete the TLS handshake. The root cause is that the TLS ServerHello segment sent by the server was dropped in the SD-WAN tunnel, preventing it from reaching the client. As a result, the client receives only out-of-order segments, while the first segment (which is required to continue the TLS exchange) is missing, triggering retransmissions and eventually connection failure.

The server is advertising MSS correctly in the SYN/ACK (typically 1460), which aligns with our internal MTU configuration of 1500 bytes. However, the issue we're observing is that the client is not adhering to this in many cases and continues to send MSS = 1380 (MTU=1420), which causes packet drops due to overshoot beyond the SD-WAN tunnel MTU of 1383.

Step

Direction

TCP Flags

MSS Value Advertised

What it Means

1️

Client → Server

SYN

MSS = 1380

"Hey server, I can receive up to 1380-byte TCP payloads from you."

2️

Server → Client

SYN-ACK

MSS = 1460

"Okay, I acknowledge. I can receive up to 1460-byte TCP payloads from you."

3️

Client → Server

ACK

-

Connection established. Now both sides know each other's limits.

⚠️ Thumb Rule for SD-WAN Tunnel Compatibility

  • TCP Payload (MSS) + IP Header (20B) + TCP Header (20B) ≤ SD-WAN Tunnel MTU
    • If SD-WAN tunnel MTU = 1383 bytes, then: Max Safe MSS = 1383 – 20 (IP) – 20 (TCP) = 1343 bytes
    • If the client advertises MSS = 1380, total packet size becomes: 1380 (MSS) + 40 (headers) = 1420 bytes
      • This exceeds the tunnel MTU (1383) → packet will be dropped unless fragmented (which doesn't happen with DF=1).

Recommended Solutions

To resolve this, the client-side MSS must be clamped to a value that results in packets smaller than the tunnel MTU.

 Option 1: MSS/MTU Clamping at the Network Edge (POS Client)  

Approach

Description

Feasibility

1.a

Clamp MSS/MTU on the client OS

is the doable at every client OS endpoint?

1.b

Clamp MSS/MTU on the router/firewall

Is this manageable across network infrastructure?

If feasible, MSS should be clamped such that:

  • TCP segment size + headers ≤ 1383
    • Example: Clamp MTU between 1290–1383 bytes, or clamp MSS to 1240–1343

Option 2: Increase SD-WAN Tunnel MTU

If edge control is not viable, another option is:

  • Increase the SD-WAN/MPLS tunnel MTU from 1383 to 1420
    • This reduces tunnel overhead from 117 bytes to 80 bytes

   

Visual Flow Summary:

Frame

Direction

TCP Seq → Ack

Payload Length

What Happens

107711

Client → Server

SYN

0

🔹 TCP handshake initiation (MSS = 1380)

107712

Server → Client

SYN, ACK

0

🔹 Server responds (MSS = 1460)

107713

Client → Server

ACK

0

TCP 3-way handshake completed

107714

Client → Server

Seq=1 Ack=1

216

🚀 TLS ClientHello

107715

Server → Client

Seq=1 Ack=217

0

🔄 ACK for ClientHello

107716

Server → Client

Seq=1

2820

TLS ServerHello + Certificate — likely dropped (MSS/MTU overshoot)

107717

Server → Client

Seq=2761

1336

TLS Certificate chunk — received

107718

Server → Client

Seq=4097

1575

ServerKeyExchange + ServerHelloDone — received

107719

Client → Server

Ack=1

0

🛑 Dup ACK #1 — SACK: 2761–4097

107720

Client → Server

Ack=1

0

🛑 Dup ACK #2 — SACK: 2761–4097, 5477–5672

107721

Server → Client

Seq=1

1380

Fast Retransmit of missing segment (MTU-safe)


Key Observations :

  • Frame 107716 contains the first TLS ServerHello response from the server, starting at TCP sequence number Seq=1.
    • This segment is critical for initiating the TLS handshake and must be received by the client before any further TLS processing can occur.
  • However, the client's ACKs indicate it never received this segment :
    • In Frame 107719, the client responds with Ack=1, meaning it is still waiting for the segment starting at Seq=1.
  • The client does include SACK (Selective Acknowledgment) blocks in its duplicate ACKs, such as:
    • SLE=2761, SRE=4097, which means:
      • “I did not receive your segment starting at Seq=1, but I have received the out-of-order segment from 2761 to 4096.”

This behavior is consistent with a packet drop of the ServerHello segment, likely due to an MSS/MTU overshoot.

 Diagnostic Approach 

Tools

  • Wireshark: Inspect TCP MSS values, observe SYN/SYN-ACK, identify dropped packets or retransmissions.
  • ping -M do -s [size] [host]: Manually probe for MTU.
  • tracepath / traceroute --mtu: Discover where fragmentation occur     

Recommendations 

Immediate

  •  Enable MSS clamping on all WAN/tunnel-facing interfaces:
    •  ip tcp adjust-mss 1360
  • Verify that ICMP Type 3, Code 4 is allowed in firewalls and middleboxes. 
Long-Term
  • Perform regular MTU path testing across all critical network paths.
  • Document MTU constraints for each WAN circuit, tunnel, and overlay path.
  • Avoid blindly increasing interface MTU without end-to-end validation. 
Proper MTU management and MSS optimization are essential for reliable network communication, especially across complex SD-WAN and VPN architectures. By understanding and mitigating PMTUD black holes, enabling ICMP feedback, and applying MSS clamping, organizations can prevent silent failures and ensure stable connectivity.