In highload cloud networks there is so much traffic that even 100G/400G port speeds do not suffice, so sharing the load over multiple links is the only feasible solution.
Introduction
ECMP stands for Equal Cost Multi-Path – when a route points to multiple nexthops so that the router can load balance traffic among them.
What if you have multiple links between 2 routers like on figure 1?
Fig. 1
In a small network it doesn’t matter whether to run ECMP or LAG (Port-channel) in this setup, but in a scaled network the correct solution is always LAG. First of all, ECMP would imply a separate routing protocol session per link. In link-state IGPs, it is possible to reduce the amount of flooding by use of dynamic flooding [draft-ietf-lsr-dynamic-flooding] or static flood suppression knobs provided by some vendors, but sessions will still remain and furthermore, IS-IS LSP or OSPF LSA will still list all links and those will be be used for SPF calculation by the entire routing domain. Therefore, ECMP is almost never used for multiple links across the same pair of routers.
ECMP routing is more common in topologies like on figure 2:
Fig. 2
Load sharing happens not just over multiple links, but over multiple routers – running LAG is not possible here1.
Best path selection
In IS-IS and OSPF there is not much specific to ECMP. All routers synchronize their LSDB, and calculate the best path – if there are multiple equal best paths, that’s fine. One caveat: if you write your own Dijkstra algorithm implementation and have to deal with a mix of ECMP paths over P2P and multipoint links, evaluate the multipoint links first2 – long time ago even some mainstream vendors had bugs because of that. I also encountered this problem while writing node protection checker for TI-LFA.
BGP always needs to choose one best path that it will propagate further.
Fig. 3
On figure 3, R3 receives a prefix from 2 different AS. Assuming all attributes relevant for best path selection are the same and ECMP is enabled. AS_PATH length is also the same, but contains different AS. The behaviour here depends on implementation, for instance Arista EOS will use both routes (although it is possible to disable this by “no bgp bestpath as-path multipath-relax”). Cisco IOS/IOS-XR by default don’t ECMP across multiple AS and require “bgp bestpath as-path multipath-relax” – on IOS this command is hidden for some reason. On JUNOS this is called “multipath multiple-as” under bgp peer group.
In either case, assume we configured it correctly and R3 installed an ECMP route. But it still will advertise only one path to R4. Depending on implementation, it will either apply a tie-breaker to choose which path to advertise, or just advertise the first route that was installed in FIB.
Output on EOS:
R3#sh ip bgp 192.168.0.0/24 BGP routing table information for VRF default Router identifier 3.3.3.3, local AS number 300 BGP routing table entry for 192.168.0.0/24 Paths: 2 available 100 500 10.5.5.5 from 10.5.5.5 (5.5.5.5) Origin IGP, metric 0, localpref 100, IGP metric 1, weight 0, received 00:00:19 ago, valid, external, ECMP head, ECMP, best, ECMP contributor Rx SAFI: Unicast 200 500 10.4.4.4 from 10.4.4.4 (4.4.4.4) Origin IGP, metric 0, localpref 100, IGP metric 1, weight 0, received 00:00:16 ago, valid, external, ECMP, ECMP contributor Rx SAFI: Unicast
R3 uses both routes, but only one with AS_PATH 100 500 is advertised to R4.
R4#sh ip bgp 192.168.0.0/24
BGP routing table information for VRF default
Router identifier 4.4.4.4, local AS number 300
BGP routing table entry for 192.168.0.0/24
Paths: 1 available
100 500
10.5.5.5 from 10.2.2.3 (3.3.3.3)
Origin IGP, metric 0, localpref 100, IGP metric 0, weight 0, received 00:02:13 ago, valid, internal, best
Rx SAFI: Unicast
While in the scenario described above it doesn’t really matter what route is advertised to R4, it becomes important with MPLS. In https://routingcraft.net/segregated-routing/ I described a scenario with ECMP in MPLS L3VPN and EVPN. There it is solved by using a unique RD per PE. If the topology on figure 3 runs MPLS (without VPNs) and the prefix is propagated in default table, ECMP will not work because R4 will resolve the route via an MPLS LSP (e.g. to R1) and forward all traffic there. So the only solution is to use BGP add-path so that R3 would advertise both routes.
R3(config-router-bgp)#bgp additional-paths send any R4#sh ip bgp nei 10.2.2.3 | grep "Additional-paths recv capability" -A1 Additional-paths recv capability: IPv4 Unicast: negotiated
Now R4 receives both routes, distinguished by path ID:
R4#sh ip bgp 192.168.0.0/24 BGP routing table information for VRF default Router identifier 4.4.4.4, local AS number 300 BGP routing table entry for 192.168.0.0/24 Paths: 2 available 100 500 10.5.5.5 from 10.2.2.3 (3.3.3.3) Origin IGP, metric 0, localpref 100, IGP metric 0, weight 0, received 00:02:22 ago, valid, internal, best Rx path id: 0x1 Rx SAFI: Unicast 200 500 10.4.4.4 from 10.2.2.3 (3.3.3.3) Origin IGP, metric 0, localpref 100, IGP metric 0, weight 0, received 00:02:21 ago, valid, internal Rx path id: 0x2 Rx SAFI: Unicast
Path ID is not a BGP attribute, it’s more like an extension to NLRI to make it unique (even if prefix is the same). It is similar to Route Distinguisher, but can be used in regular (not multiprotocol) BGP. See also [RFC7911]
Inter-AS VPN option B and ECMP
There is another problem related to BGP best path and ECMP, specific to MPLS Inter-AS VPN option B. This is the most commonly used flavour of inter-AS VPN since it doesn’t require ASBR to have VRFs (which is the case in option A) and is less complex than option C.
Consider the following topology:
Fig. 4
PE1 and PE2 run L3VPN, with unique RD per PE, so that other routers could do ECMP towards CE1 prefixes. While this works fine within AS, now when ASBR1 advertises those routes to AS200, it changes nexthop to self and reallocates VPN labels.
ASBR1#sh bgp vpn-ipv4 192.168.101.0/24 BGP routing table information for VRF default Router identifier 3.3.3.3, local AS number 100 BGP routing table entry for IPv4 prefix 192.168.101.0/24, Route Distinguisher: 1.1.1.1:1 Paths: 1 available 65001 (Received from a RR-client) 1.1.1.1 from 1.1.1.1 (1.1.1.1) Origin IGP, metric -, localpref 100, weight 0, valid, internal, best Extended Community: Route-Target-AS:1:1 Remote MPLS label: 116384 Local MPLS label (allocated for received VPN routes): 116385 BGP routing table entry for IPv4 prefix 192.168.101.0/24, Route Distinguisher: 2.2.2.2:1 Paths: 1 available 65001 (Received from a RR-client) 2.2.2.2 from 2.2.2.2 (2.2.2.2) Origin IGP, metric -, localpref 100, weight 0, valid, internal, best Extended Community: Route-Target-AS:1:1 Remote MPLS label: 100000 Local MPLS label (allocated for received VPN routes): 116386
ASBR1 installs respective routes in LFIB: swap 116385 with 116384 and forward to PE1; swap 116385 with 100000 and forward to PE2.
ASBR1#sh mpls lfib route B3 116385 [0] via IS-IS SR tunnel index 2, swap 116384 payload ipv4, apply egress-acl via 10.0.0.1, Ethernet1, label imp-null(3) B3 116386 [0] via IS-IS SR tunnel index 3, swap 100000 payload ipv4, apply egress-acl via 10.4.4.2, Ethernet2, label imp-null(3)
ASBR2 again changes nexthop to self, allocates new MPLS labels (which in this example happen to be the same values as ASBR1 allocated, just by coincidence). PE3 learns 2 routes, and installs them in the VRF routing table.
PE3#sh bgp vpn-ipv4 192.168.101.0/24 BGP routing table information for VRF default Router identifier 5.5.5.5, local AS number 200 BGP routing table entry for IPv4 prefix 192.168.101.0/24, Route Distinguisher: 1.1.1.1:1 Paths: 1 available 100 65001 4.4.4.4 from 4.4.4.4 (4.4.4.4) Origin IGP, metric -, localpref 100, weight 0, valid, internal, best Extended Community: Route-Target-AS:1:1 Remote MPLS label: 116385 BGP routing table entry for IPv4 prefix 192.168.101.0/24, Route Distinguisher: 2.2.2.2:1 Paths: 1 available 100 65001 4.4.4.4 from 4.4.4.4 (4.4.4.4) Origin IGP, metric -, localpref 100, weight 0, valid, internal, best Extended Community: Route-Target-AS:1:1 Remote MPLS label: 116386 PE2#sh ip ro vrf one 192.168.101.0/24 B I 192.168.101.0/24 [200/0] via 4.4.4.4/32, IS-IS SR tunnel index 1, label 116385 via 10.3.3.4, Ethernet1, label imp-null(3) via 4.4.4.4/32, IS-IS SR tunnel index 1, label 116386 via 10.3.3.4, Ethernet1, label imp-null(3)
What is peculiar about those routes is that they both have the same nexthop. In this particular case, the router must distinguish ECMP routes by the set of [nexthop, label] rather than just nexthop.
It happens to work on Arista EOS (because this is a fresh implementation of option B, that takes care of this caveat), but most implementations can’t use ECMP in this scenario.
If PE3 runs Cisco IOS in the same topology, it can’t ECMP because it identifies a path only by nexthop, not by [nexthop, label]:
PE3#sh bgp vpnv4 unicast all 192.168.101.0/24 BGP routing table entry for 1.1.1.1:1:192.168.101.0/24, version 19 Paths: (1 available, best #1, no table) Not advertised to any peer Refresh Epoch 1 100 65001 4.4.4.4 (metric 10) (via default) from 4.4.4.4 (4.4.4.4) Origin IGP, metric 0, localpref 100, valid, internal, best Extended Community: RT:1:1 mpls labels in/out nolabel/18 rx pathid: 0, tx pathid: 0x0 BGP routing table entry for 2.2.2.2:1:192.168.101.0/24, version 22 Paths: (1 available, best #1, no table) Not advertised to any peer Refresh Epoch 1 100 65001 4.4.4.4 (metric 10) (via default) from 4.4.4.4 (4.4.4.4) Origin IGP, metric 0, localpref 100, valid, internal, best Extended Community: RT:1:1 mpls labels in/out nolabel/20 rx pathid: 0, tx pathid: 0x0 PE3# sh ip ro vrf one | sec 192.168.101.0 B 192.168.101.0/24 [200/0] via 4.4.4.4, 00:02:19
A possible workaround for ECMP in older inter AS option B implementations is to configure ASBR2 so that it would advertise different routes with different nexthop (e.g. by creating multiple loopbacks and rewriting nexthops with route maps).
The same problem by the way would occur with default VRF + add path or other MPLS services. See also draft-mohanty-bess-mutipath-interas
Load sharing in data plane
So we have multiple equal paths and start getting data traffic. How to load balance that traffic over multiple paths? Doing per-packet round-robin is not a good idea: out-of-order delivery will destroy TCP performance. We have to make sure that all packets belonging to the same flow are always sent over the same link, but spread different flows as equally as possible over different links. But how to identify a flow? IPv6 uses Flow Label [RFC6437] – very convenient, but the vast majority of traffic today is still IPv4.
Stateful firewalls inspect transit traffic, based on L4 or L7 data identify sessions – and then the session ID can be used for ECMP load sharing3.
Linux routers with kernel-based dataplane on pre-3.6 kernel versions did ECMP by using route caching. This means all packets to the given destination IP would be sent over the same link. After 3.6, ECMP was broken because route caching was removed4 and the kernel began to do per-packet load balancing which never works properly. In 4.4 hash-based ECMP was introduced so Linux now behaves more or less like all normal routers (see below). Check this out for more details https://cumulusnetworks.com/blog/celebrating-ecmp-part-two/
What most routers and switches do is hashing based on packet headers. Common options are 5-tuple (IP src, IP dst, IP protocol, TCP/UDP src port, TCP/UDP dst port) or 7-tuple (same + src/dst MAC addresses). The router calculates a hash-value based on all that information, and that hash value is used to determine the outgoing link.
Overlay networks
Some encapsulations used for overlays have been designed with ECMP in mind. VXLAN [RFC7348] uses varied source UDP port numbers so that hosts in underlay don’t have to look inside VXLAN header in order to calculate ECMP hashes. GUE [draft-ietf-intarea-gue] is also UDP-based, so follows the same approach as VXLAN. NVGRE [RFC7637] has the FlowID field for ECMP entropy. MPLS initially did not have any special functionality to support ECMP, and this led to a lot of problems, described below.
By the way, strictly speaking. ECMP is not load balancing, it is load sharing (although both terms are commonly used). There is no guarantee for any equal distribution; some links can be more utilized than others. Having said that, with a high enough number of flows, modern switches utilize all paths almost equally. If you see links over- or under-utilized, there is either something very specific about those flows, or perhaps traffic types (there are plenty of caveats related to MPLS, see below).
Some old switches like Cisco 6500 had limitations like if the number of ECMP paths is not a power of 2, load sharing would occur in weird proportions with some links getting more traffic than others.
Traffic polarization
Calculating ECMP hash based only on 5- or 7-tuple will cause a problem when ECMP is done in 2 stages. Consider the following topology:
Fig. 5
A lot of traffic flows with different IP and ports enter SuperSpine-1 from upstream. SuperSpine-2 sends approximately half of the flows via Spine-1, and the other half via Spine-2. For simplicity assume that the flows with even hash value are sent over the blue path and the flows with odd hash value are sent over the red path. Now, think about this: Spine-1 receives only “even” flows and Spine-2 receives only “odd” flows. If all switches use exactly the same algorithm, hash result on each Spine switch will be the same for all flows it receives. This phenomenon is known as traffic polarization. It can happen in many different topologies that do ECMP in 2 or more stages. The solution to polarization is very simple: when calculating hash, include also some information unique to the router – like router ID, incoming interface number, another arbitrary value (hash seed) etc. Using a different hash algorithm (if configurable) on different levels can also help.
Fragmentation
Believe it or not, IPv4 fragmentation still exists and must be supported by routers. If L4 info is used for hash calculations, this becomes a problem since all fragments but the first one won’t have L4 headers. So fragmented packets of the same flow can be sent to different links. A simple (but incomplete) solution is not to use L4 headers for hashing if a packet is fragmented (either More Fragments flag is set or Fragment Offset is not 0). But the same flow can include fragmented and unfragmented packets – in that case, for some packets L4 headers will be used for hashing while for others not.
Linux kernel by default does not use L4 headers for hashing. As of hardware-based routers, it depends. L4 headers are used for hashing in most cases. If the packets arrive already fragmented, the same problem will occur.
In either case, if you absolutely have to use fragmentation together with ECMP, the only 100% working solution is to completely exclude L4 headers from hashing.
This is by the way one of the cases where the layered networking model we were taught at school (OSI or TCP/IP) fails: even though transit routers don’t really care what application runs on TCP or UDP ports, they still use those headers sometimes. The other case is MPLS reviewed below.
Ghetto load balancing
A load balancer is a sophisticated device that not only distributes load across multiple servers, but also monitors service availability and reroutes traffic in case of server failures; load balancers also provide various sorts of performance optimizations and basic DDoS protection.
Now consider the following topology:
Fig. 6
What some people do instead of buying expensive load balancers: setup BGP sessions between servers and switches and advertise the same anycast prefix to the switch. This design very much relies on ECMP done properly, since any packet reordering would result in TCP sessions failing.
If ECMP routes are added/removed, an existing flow can be rehashed to a different link. Not a big deal for most ECMP deployments, but in this case it’s not just ECMP but also anycast. The solution here is resilient ECMP: an algorithm that guarantees that traffic flows will stay on old links even after changes to the ECMP route.
MPLS
MPLS and ECMP are known to not work well together. RSVP-TE does not consider ECMP when setting up LSP; so in some cases the only way to utilize all paths is to setup many tunnels and share traffic among them. Segment Routing brought some improvements, but also new problems.
SR-TE
A major advantage of SR-TE over RSVP-TE is the ability of using ECMP. Consider the following topology:
Fig. 7
If an SR-TE policy on R1 is configured to steer traffic with segment list [SID5, SID7], and R2 has an ECMP route towards R5, traffic will be load shared as shown on the diagram. There is no need to setup 2 SR-TE policies on R1 as that would be the case with RSVP-TE.
FEC and EEDB optimizations
FEC is Forwarding Equivalence Class – a collection of traffic types treated by MPLS network in the same way (e.g. forwarded via the same LSP). EEDB is Egress Encapsulation Database – label stacks used to encapsulate traffic forwarded via LSP.
Consider the following topology running LDP:
Fig. 8
PE1 receives the route to PE2 from 4 different routers so that it can do ECMP. But P1-P4 all allocate LDP labels independently, and those will be different labels (they might be the same just by coincidence). For each ECMP route, PE1 will have to install a separate label stack that it should push.
If in the same topology LDP is replaced with Segment Routing:
Fig. 9
If all routers use the same SRGB (which is best practice for all SR deployments), PE1 will push the same label to reach PE2 via any ECMP path. Therefore, it is possible to optimize FIB programming by installing fewer entries in EEDB.
As of today, Arista does this optimization on 7280/7500 (R series) by default, and Cisco does something similar on NCS 5500 series (both platforms are Jericho based). Probably there are other vendors/platforms doing the same.
How significant the EEDB optimization is, depends on platform hardware architecture and network design (how many ECMP paths are used), but this is an important step for MPLS in the Data Center and other networks that heavily rely on ECMP.
4/6 MAC problem
This is a very well known one. Consider the following topology:
Fig. 10
PE1 and PE2 run L2 and L3 MPLS VPN services. If VPN and LDP labels are the same, P1 must check the payload and attempt to use L3/L4 headers for hashing. But it doesn’t know whether that payload is IPv4, IPv6 or non-IP. What all MPLS routers do is take a guess – if the first nibble (half-byte) is 4, they parse the payload as IPv4, if the first nibble is 6 – they parse it as IPv6, otherwise assume it’s non-IP and don’t do ECMP. The obvious problem here is that ECMP load sharing for L2VPN traffic is not possible.
Other than than, it more or less worked, until IEEE started allocating MAC addresses that start from 4. Packets destined to those MAC addresses forwarded over L2VPN were misinterpreted as IPv4, and that caused packet reordering and all sorts of problems.
The solution is to use the control word for L2VPN – another 4 bytes inserted between VPN label and payload, so that the first nibble beyond the VPN label will be 0.
See also https://seclists.org/nanog/2016/Dec/29 and BCP128
The Flow-Aware Transport (FAT) label [RFC6391] enables ECMP for L2VPN traffic. It is just a dummy label allocated by PE, inserted between the VPN label and the control word. P routers don’t have to understand it, what matters is that different flows will have different FAT labels, so that they can be hashed across different links. So the label stack in L2VPN looks like this:
Fig. 11
Entropy label
Like I wrote in the previous section, the 4/6 nibble matching doesn’t work very well; besides if the label stack is large, most ASICs can’t look very deep to find L3/L4 headers for ECMP hashing. So IETF had to invent the Entropy Label [RFC6790] – used for both L2 and L3 VPNs as well as other MPLS services.
While the FAT label is signaled between the PE and all transit routers are completely oblivious of FAT functionality (although they use the label value for ECMP hashing), EL must be supported by all routers on the path in order for this to work properly. It is preceded by Entropy Label Indicator (ELI) – special label value 7. The [ELI + EL] stack is inserted after the transport label.
Fig. 12
The penultimate LSR (P2 on figure 12) considers EL for ECMP hashing (if applicable), but pops [ELI+EL] along with the transport label.
With hierarchical LSP (e.g. LDP over RSVP or BGP-LU over LDP), routers can insert multiple [ELI+EL] pairs. The label stack growth won’t break ECMP as in either case the ELI is just beyond the top transport label, so even if the LSR cannot look very deep inside the packet, it will still be able to ECMP properly. However, MTU can become a problem with huge label stacks; also besides ECMP there is other functionality that relies on MPLS payload recognition (4/6 bit) – for example ICMP errors generation, which is required for things like PMTUD and traceroute.
If a transit LSR does not support EL, it will just ignore it and try to match 4/6 bit for ECMP hashing.
Entropy label and SR-TE
Segment Routing Traffic Engineering (SR-TE) is stateless. The tunnel headend pushes a label stack to steer traffic via the desired path. This is different from hierarchical LSP I mentioned above (strictly speaking, there is no concept of LSP in Segment Routing).
The ingress LSR can push a lot of labels to steer traffic over the desired path.
Fig. 13
In this topology, R1 pushes 5 transport labels to steer traffic with SR-TE (assume there are very imbalanced link metrics). This scenario is of course made up, but the problem can occur in scaled deployments. R2 needs to look 7 labels deep to find the entropy label and do proper ECMP hashing. The hardware might not support looking so deep. Attempting to solve the problem by inserting multiple [EL+ELI] pairs will only aggravate the Maximum SID Depth (MSD) problem.
So they came up with Entropy Label Readable Depth (ELRD) – let SR routers signal how many labels deep they can lookup to find the entropy label. Based on that info, the controller might decide not to program SR-TE headend to push too many labels.
See also RFC8662 and draft-ietf-isis-mpls-elc
Conclusion
Generally speaking, for vendors with heavy presence in Data Center/Cloud networks, ECMP is their bread and butter so it works very well. It is common for modern DC switches to support 128-way ECMP – which is more than the number of ports a typical switch has!
MPLS used to be bad with ECMP in the past, but this has changed with the advent of FAT, Entropy Labels and Segment Routing.
Newer encapsulations are ECMP-aware from day one. IPv6 hosts can use Flow Label, VXLAN and GUE use varied source UDP ports, NVGRE has the FlowID field.
There is also Unequal Cost Multi Path (UCMP) – a relatively rarely used feature, where traffic is shared in proportions defined by routing policies. Guess how that is implemented in hardware: if I want to share traffic across 2 links in proportions 1:3, there would be 1 FIB entry pointing to the first link and 3 FIB entries pointing to the second link, then ECMP across those 4 entries. With larger numbers without a common divisor this consumes even more FIB entries.
References
- Dynamic Flooding on Dense Graphs https://tools.ietf.org/html/draft-ietf-lsr-dynamic-flooding-04
- Advertisement of Multiple Paths in BGP https://tools.ietf.org/html/rfc7911
- BGP Multipath in Inter-AS Option-B https://tools.ietf.org/html/draft-mohanty-bess-mutipath-interas-01
- IPv6 Flow Label Specification https://tools.ietf.org/html/rfc6437
- Celebrating ECMP in Linux — part two https://cumulusnetworks.com/blog/celebrating-ecmp-part-two/
- Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks https://tools.ietf.org/html/rfc7348
- Generic UDP Encapsulation https://tools.ietf.org/html/draft-ietf-intarea-gue-09
- NVGRE: Network Virtualization Using Generic Routing Encapsulation https://tools.ietf.org/html/rfc7637
- Forwarding issues related to MACs starting with a 4 or a 6 https://seclists.org/nanog/2016/Dec/29
- Avoiding Equal Cost Multipath Treatment in MPLS Networks https://tools.ietf.org/html/bcp128
- Flow-Aware Transport of Pseudowires over an MPLS Packet Switched Network https://tools.ietf.org/html/rfc6391
- The Use of Entropy Labels in MPLS Forwarding https://tools.ietf.org/html/rfc6790
- Entropy Label for Source Packet Routing in Networking (SPRING) Tunnels https://tools.ietf.org/html/rfc8662
- Signaling Entropy Label Capability and Entropy Readable Label Depth Using IS-IS https://tools.ietf.org/html/draft-ietf-isis-mpls-elc-11
- Segregated Routing https://routingcraft.net/segregated-routing/
Notes
- ^Assuming all links are L3, of course. In L2 networks, there are MC-LAG technologies (like MLAG or vPC) to terminate a port-channel on multiple switches.
- ^The Complete ISIS Routing Protocol (ISBN-13: 978-1852338220) chapter 10.2.3
- ^While load sharing based on session ID seems the most reliable, sometimes it yields not obvious results. For example, when you run “ping”, from the host standpoint it’s one constant stream of packets, but a firewall might identify each pair of ICMP echo request/reply as a separate session, that can be sent via a different path.
- ^Although there is a different kind of cache post-3.6, not for every single destination IP but only for some IP that triggered PMTUD, IP redirects etc
that moment when you are looking for an explanation to swap all trident to jericho ))
nice one, all in one place!