Recently, we have been observing frequent connectivity drop issues on random application servers. Packet captures (tcpdump) revealed that some servers stop receiving ARP replies from their destination IP addresses—even though both are in the same VLAN and subnet.
This raises an important question: what could cause ARP failures in a flat Layer 2 domain?
A Quick Refresher: What is ARP?
Address Resolution Protocol (ARP) is the mechanism that maps an IP address to its corresponding MAC address on a local Ethernet network.
Without ARP, hosts cannot communicate within the same subnet.
- Layer 2 to Layer 3 Mapping: NICs don’t understand IP addresses—they only use MAC addresses. ARP provides the “glue” between IP (Layer 3) and Ethernet (Layer 2).
- MAC Addresses: Every Ethernet NIC has a 48-bit globally unique identifier, burned into ROM, which is used to deliver Ethernet frames.
- Operation: If host 10.0.0.11 wants to communicate with 10.0.0.22, it broadcasts an ARP request asking: “Who has 10.0.0.22?”. The host owning that IP replies with its MAC address.
Where Things Go Wrong: Possible ARP Table Overflow
In a /24 subnet (255.255.255.0), we can assign up to 254 usable IP addresses. If this limit is fully utilized (or close to it), ARP tables at switches, routers, or even servers may hit capacity constraints.
What we observed:
- No ARP reply received for some destination IPs.
- Symptoms appear random—one server works fine while another loses connectivity.
- Recovery sometimes only happens after rebooting the NIC (which clears and reloads its ARP cache).
This suggests a case of ARP table overflow, where the device managing ARP entries runs out of space.
 
Why Does ARP Overflow Matter?
When an ARP table is full:
- The garbage collector may discard ARP entries, sometimes randomly.
- A discarded entry means that the host can no longer resolve the MAC address for a destination IP.
- Until the NIC or OS refreshes its ARP cache—or the device itself is rebooted—communication to that destination fails.
Essentially, the host knows where to send packets (IP), but doesn’t know how to send them (MAC).
 
Potential Culprits
- Router ARP Table Limit – Routers maintain ARP caches for each connected subnet. Hitting the per-interface ARP entry limit can cause drops.
- Switch ARP/Forwarding Table Overflow – L2 switches may also have finite CAM/ARP tables. Overflow can lead to incomplete lookups.
- Server-Side ARP Cache Limits – Even Linux/Windows servers have configurable limits for ARP cache entries.
gc_thresh1 (since Linux 2.2) : The minimum number of
      entries to keep in the ARP cache. The garbage collector will not
      run if there are fewer than this number of entries in the cache. 
      Defaults to 128.
    
gc_thresh2 (since Linux 2.2) : The soft maximum number of entries to keep in the ARP cache. The garbage collector will allow the number of entries to exceed this for 5 seconds before collection will be performed. Defaults to 512.
gc_thresh3 (since Linux 2.2) : The hard maximum number of entries to keep in the ARP cache. The garbage collector will always run if there are more than this number of entries in the cache. Defaults to 1024.
Troubleshooting Commands
Here are useful commands to investigate ARP-related issues across different systems:
 # Show current ARP cache entries
arp -n
# Modern replacement command
ip neigh show
# Clear ARP cache
ip -s -s neigh flush all
# Check ARP kernel parameters
cat /proc/sys/net/ipv4/neigh/default/gc_thresh1
cat /proc/sys/net/ipv4/neigh/default/gc_thresh2
cat /proc/sys/net/ipv4/neigh/default/gc_thresh3
# Adjust thresholds (increase ARP cache size if needed)
sysctl -w net.ipv4.neigh.default.gc_thresh1=512
sysctl -w net.ipv4.neigh.default.gc_thresh2=1024
sysctl -w net.ipv4.neigh.default.gc_thresh3=2048
Recommendations
- Check ARP Table Sizes on routers, switches, and servers in the VLAN. Verify if limits are being hit.
- Segment Large VLANs – If the VLAN is hosting too many hosts (close to 254), consider subnetting further (e.g., /25, /26) to reduce ARP load.
- Monitor ARP Cache Usage – Many network devices provide counters/logs for ARP cache utilization.
- NIC/OS Tuning – Adjust ARP cache timeouts and maximum entries in server OS (Linux: /proc/sys/net/ipv4/neigh/*).
 
No comments:
Post a Comment