ARP problems in EVPN

In any L2 overlay network, ARP handling will always remain a big pain for network operators.

This post explains why you should always set ARP timeout to less than 5 minutes in L3 EVPN, and always be cautious of potential ARP suppression issues in L2 EVPN.

Introduction

Flat L2 networks are simple. Any host can send a packet to another host, and the network will deliver it. If a host wants to discover the MAC address of a remote host, it sends an ARP request, and the network will handle it just like any other broadcast packet. The remote host will send an ARP reply and we’re good.

In practice, this, of course, doesn’t scale and leads to all kinds of outages due to L2 loops and broadcast storms. This is why everyone hates large L2 networks.

Overlays that extend L2 over L3 attempt to improve scalability and reliability by:

  1. getting rid of the L2 core
  2. reducing the amount of BUM flooding

But if (1) is granted by pretty much every overlay, implementing (2) is really tricky. Flood-and-learn overlays like VPLS or basic VXLAN form don’t even try to mess with ARP.

ARP suppression

Overlays with more advanced control planes try to reduce BUM flooding by implementing ARP suppression. Most notable example here is EVPN, and some proprietary technologies like Cisco OTV or VMware NSX.

Fig. 1

Host A sends an ARP request to discover the MAC address of host B, but since SW1 already knows host B MAC address (from previously intercepted ARP requests), SW1 responds to host A on behalf of host B, without forwarding the ARP request over VXLAN.

Perfect idea! What could ever go wrong?

As it turns out, plenty of things. Since flat L2 networks were so basic and simple, application developers became over-reliant on the fact the network will just deliver anything they put on the wire.

Many application clusters use ARP as keepalives between their nodes. And instead of using unicast ARP or GARP (which should not be suppressed), some send regular broadcast ARP. Of course, ARP suppression completely breaks this functionality.

ARP probes are broadcast ARP with source IP 0.0.0.0. They are commonly used for duplicate IP detection. Proper EVPN implementation should not suppress ARP probes. Some earlier implementations did that, causing issues.

All modern switches forward transit packets with line-rate speeds. BUM packets are not necessarily forwarded line-rate, but the replication capacity is usually still very high. ARP suppression requires that all transit broadcast ARP packets be trapped to the CPU, which is also protected by control plane policy. A large volume of ARP traffic will cause some packets to be lost and queries left unanswered. Various bugs or TCAM misprogramming can sometimes cause all ARP packets to be dropped, or just on a specific VLAN etc. Good luck troubleshooting that in a large network.

And then, of course, vendors. On Cisco nexus switches, ARP suppression is disabled by default and can be enabled with “suppress-arp” under VNI. On Juniper QFX, ARP suppression is enabled by default and can be disabled with “no-arp-suppression” under vlans. On Arista, ARP suppression is enabled by default only on vlans with an SVI configured. But you can disable it for some specific subnets (e.g. running clusters that use ARP as keepalives):

router l2-vpn
   arp proxy prefix-list NO_SUPPRESS
!
ip prefix-list NO_SUPPRESS
   seq 10 deny 10.0.0.0/24

And I don’t even want to get started about Microsoft NLB with its ARP bindings to multicast MAC (explicitly prohibited by RFC1812). Microsoft is even worse than Juniper, when it comes to violating standards.

Adding some routing

Consider a simple network with an L3 switch separating two subnets.

Fig. 2

Assuming host A already has an ARP entry for SW1 IP, it sends a packet to host B, but SW1 doesn’t know host B MAC. So SW1 has to do the following:

  1. Trap the transit IP packet to CPU
  2. Hold the packet in the queue, create an INCOMPLETE ARP entry while sending ARP query for host B mac
  3. Once received ARP reply, update ARP entry as REACHABLE and forward the queued packet to host B

At least, this is how it works on Linux-based network OS. Exact number of packets that can be queued depends on implementation, but it’s around 256 256-byte packets per unresolved address. If the ARP can’t be resolved after 3 attempts (state FAILED), the switch will keep sending queries forever, while discarding extra transit packets and sending ICMP destination unreachables to host A.

Other network OS might behave slightly differently (e.g. Cisco IOS sends ARP requests every 2 seconds and drops transit packets destined to unresolved ARP), but the overall idea will be similar. Also, since SW1 is an L3 switch with SVI, it will also learn host B MAC upon hearing ARP replies.

ARP in L3 EVPN

Now let’s see how this works when L2 is replaced with VXLAN-EVPN.

L3 EVPN comes in 2 flavours – asymmetric and symmetric IRB.

Asymmetric IRB refers to L3 EVPN design in which all VTEPs have all SVI, and if a packet has to be routed between subnets, the ingress VTEP does routing, and the egress VTEP does only bridging.

When a VTEP needs to learn the ARP entry for a remote host reachable over VXLAN, it sends an ARP request, the host responds with an ARP reply, which might reach the VTEP or not – it doesn’t matter, because at this point we want the remote VTEP to generate MAC-IP routes and advertise them over BGP. Both ARP (L3) and MAC learning (L2) will be populated from the BGP-EVPN control plane. Whether we receive the ARP reply or not, is irrelevant. This is really important because in multihomed setups, ARP reply can land on another switch.

Fig. 3

Figure 3 illustrates that even if the ARP reply lands on SW2, SW1 stil learns the ARP entry for host B from BGP-EVPN. It also learns the MAC address of host B from EVPN.

Now what can go wrong here.

Consider a TCP-based application that can be silent for >5 minutes, while still keeping the TCP session alive. SW3 would age out the MAC address after 5 minutes, and withdraw the MAC-IP route. In EVPN, there are no separate route types for MAC and ARP – both are covered by type 2, so SW3 would withdraw both routes. This will force SW1/SW2 to unlearn the ARP. If, next time host A wants to initiate conversation, and if the whole process of re-discovering host B MAC and advertising it via BGP will take longer than 1 second, it will cause SW1/SW2 to send ICMP unreachables back to host A, which will reset the TCP session. Depending on implementation, this can fail even if everything takes less than 1 second.

Even if the application doesn’t go silent for 5 minutes, but return traffic is asymmetric and comes not over EVPN, the same problem can occur.

Blacklisted MACs in asymmetric IRB designs now can cause not only duplicate traffic, but also TCP resets in the application.

Symmetric IRB is a more advanced EVPN design, where the ingress VTEP does routing to a transit VNI, and the egress VTEP does routing to the destination VNI. It allows for better scalability, and also eliminates the problem described above.

Well, almost. Over 5 minute gaps in application traffic will trigger MAC timeouts, and subsequent withdrawals of MAC-IP routes, which will also cause churn of EVPN host routes.

Conclusion

Reducing ARP timeout from the default 4 hours to less than 5 minutes seems to solve all EVPN IRB ARP problems. They can be really difficult to detect and isolate, so in my opinion, ARP timeout < MAC age timer should be best practice in all EVPN deployments.

ARP suppression, on the other hand…I think it was a bad idea in first place. If an L2 network doesn’t scale, design a proper L3 network. But if people want to step on rakes, why discourage them.

12 thoughts on “ARP problems in EVPN”

  1. Despite the standard allows ARP suppression, I don’t expect to see it turned on by default. For our tasks, we always disable this feature. Unfortunately, it’s not always easy or even possible. For example, let’s consider the MX series. First of all ‘no-arp-suppression’ is a hidden knob here, I’ve asked colleagues from Juniper about it and they said this command is fully supported and won’t be deleted in the future. But it’s hidden anyway which is suspicious… The second problem is when you use this knob it just doesn’t work, a device is keeping to suppress ARPs. Somebody told me I needed to disable all units that resided in service after activation of the knob and activate them again. Not the most straightforward way to configure a box. I can imagine people who gave up trying to disable suppression and received some problems with scaling or service interruption after that.

    1. Thanks Igor, great comment. Vendor implementations can be tricky indeed, and not always properly documented. The worst thing about ARP suppression is that most of time it just works, until it doesn’t. I personally have little experience with EVPN on Juniper, so it’s always good to learn about such quirks.

      1. Yeah I saw that KB article as well. Interestingly enough though I am running some QFX5120 switches in a lab environment, and while those are running Junos 20.4R3.8 (far newer than 19.1R1) the “no-arp-suppression” command is still available and working, so it seems to me like they reversed course at some point and never updated that KB article for some reason (it is still a hidden command though, that remains the same).

    2. Mostly we have had the same issues in our EVPXN/VXLAN environment with two QFX5100 running 20.2R2-S3.5. We don’t use VXLAN routing, only L2. Configuring the hidden knob “no-arp-suppression” was not helpful and after reading this article I don’t know anymore whether this command has any effect at all . We have configured the following timers:
      arp-aging-timer 10
      global-mac-table-aging-time 1200
      global-mac-ip-table-aging-time 600
      Can someone recommend optimal settings for the timer? ARP < 5min. as Dmytro recommended. Most problems we have are large Vlans mit hundreds of IP addresses such Proxmox Cluster. In the past we have been using version 17.4R1.16 successfully for a long time without any problems. Starting upgrade from 17.4 to 18.4R3 the problems began.

      1. Behaviour change after upgrade – this is something for JTAC to look at. Implementations of ARP suppression differ in some details across vendors, or even platforms of the same vendor, believe it or not.

        Unfortunately, my experience with Juniper is mostly limited to interop with Arista and Cisco. Also, this is the reason I keep ranting about Juniper violating standards – in fairness, I think they have many awesone products and deserve some credit for that.

  2. Great article! Thank you!
    So if the switches perform ARP / ND probes for locally learned IPv4 and IPv6 addresses, the problem is solved as well, right?
    The content of the MAC address, ARP/Neighbor table is decoupled from the application behavior this way.

    1. The problem is that MAC addresses by default age out after 5 minutes, whereas ARP/ND entries time out after 4 hours. Forcing switches to refresh ARP/ND more often is needed to keep the host MAC address in the MAC table.

  3. Hi Dmytro,

    Great blog as usual…
    default mac-aging timer on NX-OS was changed to 1800 sec & the ARP aging timer was changed to 1500 secs long back i guess for the very same reasons/problems you elaborated.

  4. The silent TCP session reminds me of seeing the same problem when adding stateful firewalls in-path in networks. It seems to be the older apps that don’t take into account they can’t just rely on being completely silent for long periods of time and having a working session still when they want it later. I honestly like the idea of these things breaking as it forces apps that should have updated long ago to update. There’s a whole discussion that could be had on the topic of whether network engineering is focusing too much effort on “just making it work” for apps that selected the wrong protocols and didn’t implement the correct already existing protocol features to do the job. Of course I say this here but in practice my answer is always “let me see what I can do” 🙂

  5. Thank you very much for this great contribution, I am new to this topic, but I need to implement EVPN/VxLAN to interconnect data centers with Arista equipment, I have come across a configuration in which I have many doubts and it is VARP, I would like to know what is VARP for in the EVPN/VXLAN fabric, and I want to mention that I am not using MLAG or MC-LAG, I want to know what benefit VARP gives to EVPN/VXLAN or I am misapplying this configuration.

    1. VARP is designed to be used specifically with MLAG, without VXLAN-EVPN. It can work with EVPN, but there are some issues and it’s not a good idea to use it. Arista has “ip address virtual” which should be used instead of VARP in EVPN environments.

      My advise is that you check Arista design guides on their website, more specifically the EVPN deployment guide.

Leave a Reply

Your email address will not be published. Required fields are marked *