Packet duplication wastes bandwidth and can lead to significant network performance degradation or even outages.
In multicast routing, packets are replicated by the network, so there is always a fundamental risk of duplicate traffic. Special safeguards exist to avoid duplicates in multicast routing.
In unicast routing, normally there is no replication, hence there is no way duplicates can occur. In L2 switching, unknown unicast packets can be replicated by switches. Duplicates can occur in the state of an L2 loop, but that’s already a problem way more serious; duplicate traffic will be only one of its symptoms.
New overlay cloud designs bring duplicates to unicast routing! This is truly the innovation we deserved.
Unicast
If an Ethernet switch hasn’t learnt the destination MAC of an Ethernet frame, it floods that frame to all ports in the respective vlan. This is known as unknown unicast flooding. Sometimes, switches start flooding all unicast traffic like unknown unicast. This usually happens in case of (a) CAM overflow attacks; (b) hash collisions – used to be common with old cheap switches.
Flooding all unicast traffic is bad, but it doesn’t cause duplicates! (unless there is a loop). In the old days, L2 loops were catastrophic failures rendering network inoperable. Since then, the switch architecture has evolved a lot. Storm control, loop detection, control-plane policy and much more powerful CPU are found in modern switches. Therefore, a broadcast storm is of course a problem which will lead to significant network performance degradation, but it will not be a catastrophic failure. The switches will still be accessible, some traffic will still flow. A lot of things will alert about the problem, but if you don’t read logs, turn off all monitoring and ignore user complaints (e.g. by outsourcing your helpdesk), everything will look fine.
EVPN
Ethernet VPN is the most hyped routing technology today. To be fair, it is being deployed for many valid reasons. It was originally designed for ISPs and used MPLS encapsulation to provide layer 2 VPN service, thereby solving some problems with VPLS like poor scalability or lack of active/active multihoming.
Ever since, EVPN has been extended to use different encapsulations (the most popular of which is VXLAN) and routing in overlay, and since recently even multicast.
EVPN is sometimes called “MAC routing”, because MAC addresses are advertised by BGP, as opposed to data-plane learning that happens in L2 switching. MAC routing and ARP suppression reduce the amount of BUM flooding across overlay – but still, BUM flooding cannot be completely removed as sometimes it is the only way to deliver packets to end hosts.
Note: I use Arista EOS for EVPN simulations. The problems described below apply to all implementations.
MPLS-EVPN dual-homed CE
Active/active multihoming1 is probably the biggest advantage of EVPN over VPLS. When a CE is connected to multiple PE, they elect a Designated Forwarder (DF) – one router that will forward BUM traffic to the CE.
Fig. 1
PE3 and PE4 CE-facing interface, and excerpt from the BGP config:
interface Port-Channel10 switchport trunk allowed vlan 10 switchport mode trunk ! evpn ethernet-segment identifier 0050:1300:316f:b000:0000 route-target import 50:13:00:31:6f:b0 lacp system-id dead.beef.1337 ! router bgp 65003 router-id 3.3.3.3 ! vlan 10 rd 3.3.3.3:10 route-target import 65001:10 route-target export 65001:10 redistribute learned
With Ethernet Segment (ES) config, PE3 and PE4 advertise type 1 and 4 routes – those are specifically used for multihoming. Namely, type 1 (Auto Discovery) is used for aliasing – that is, ingress PE sends traffic to the end host not towards the IP address of egress PE, but towards the ethernet segment (identified by ESI), shared across multiple egress PE. Type 4 route (Ethernet Segment) is used to elect the designated forwarder – i.e., the only router that can forward BUM traffic to CE. If the PE routers connected to the same ES don’t agree who is the DF, this will lead to duplicate traffic delivery to CE.
In this example, PE3 and PE4 agreed that PE3 is the DF.
PE3#sh bgp evpn instance EVPN instance: VLAN 10 Route distinguisher: 3.3.3.3:10 Service interface: VLAN-based Local IP address: 3.3.3.3 Encapsulation type: MPLS Label allocation mode: per-instance MAC route label: 1040210 IMET route label: 1042201 AD route label: 1040210 Local ethernet segment: ESI: 0050:1300:316f:b000:0000 Interface: Port-Channel10 Mode: all-active State: up ESI label: 1045878 ES-Import RT: 50:13:00:31:6f:b0 Designated forwarder: 3.3.3.3 Non-Designated forwarder: 4.4.4.4
In order to identify BUM traffic, EVPN stacks special MPLS labels on them. This is a very useful property of MPLS: a label can mean anything, whatever is signaled by control plane. Adding new functionality to MPLS means adding new functionality to control plane – there is no need for new data plane encapsulations.
PE1 received plenty of routes, but of interest to us are AD per ESI routes with ESI label.
PE1#sh bgp evpn route-type auto-discovery esi 0050:1300:316f:b000:0000 detail --- 65101 65003 3.3.3.3 from 5.5.5.5 (5.5.5.5) Origin IGP, metric -, localpref 100, weight 0, valid, external, ECMP, ECMP contributor Extended Community: Route-Target-AS:65001:10 TunnelEncap:tunnelTypeMpls EvpnEsiLabel:1045878 MPLS label: 0 --- 65101 65004 4.4.4.4 from 5.5.5.5 (5.5.5.5) Origin IGP, metric -, localpref 100, weight 0, valid, external, ECMP, ECMP contributor Extended Community: Route-Target-AS:65001:10 TunnelEncap:tunnelTypeMpls EvpnEsiLabel:1045878 MPLS label: 0
Whenever PE1 does ingress replication of a BUM packet, it stacks ESI label2 on those packets, besides the usual service and transport labels.
PE1#show l2Rib output floodset L2 RIB Output flood set: Source: Local Dynamic, Local Static, BGP Vlan Address Type Destination -------- ------------------- --------- ---------------------------------------- 10 0000.0000.0000 All MPLS BGP LU (3), TEP 3.3.3.3/32, 1042201 MPLS BGP LU (5), TEP 4.4.4.4/32, 1042201 MPLS BGP LU (6), TEP 2.2.2.2/32, 1042201
Note that even if CE1 is multihomed to PE1 and PE2, PE1 replicates BUM packets to all PE routers on ES – even PE2 (which will drop those packets due to split horizon rule, which also relies on ESI label). PE3 and PE4 will also receive replicated packets. The DF (PE3) will forward it to CE2, non-DF (PE4) will drop those packets.
Fig. 2
Once PE1 learns the MAC address of CE2 (advertised by PE3 or PE4 into EVPN), it becomes known unicast and no replication is needed anymore. Therefore, no ESI label is used.
All this is quite complicated, but works. Doing L2 active/active multihoming wasn’t that easy! Read RFC7432 for more details.
VXLAN-EVPN
This is de facto standard design for DC networks right now.
Most VXLAN-EVPN deployments rely on vendor-proprietary MC-LAG technologies (such as Arista MLAG or Cisco vPC) for multihoming. Fortunately, there is not much of vendor lock here: MLAG runs only between a pair of switches to which hosts are multihomed. This pair of switches connects to the rest of the network using standard protocols.
This is how a typical DC network looks like:
Fig. 3
Leaf-1/2 and Leaf-3/4 are 2 pairs of switches running MLAG, and hosts are dual-homed to them. Each MLAG cluster advertises a shared anycast IP, so that VXLAN-encapsulated traffic can land on either of the switches. This way, load sharing is achieved.
Leaf config will look like:
interface Loopback0 ip address 1.1.1.1/32 ! interface Loopback1 description ANYCAST_IP ip address 2.2.2.1/32 ! interface Vxlan1 vxlan source-interface Loopback1 vxlan udp-port 4789 vxlan vlan 10 vni 10010 ! router bgp 65001 router-id 1.1.1.1 neighbor 1.1.1.5 remote-as 65000 neighbor 1.1.1.5 update-source Loopback0 neighbor 1.1.1.5 send-community extended ! address-family evpn neighbor SPINE_EVPN activate ! vlan 10 rd 1.1.1.1:10 route-target import 65001:10 route-target export 65001:10 redistribute learned
BGP sessions are terminated on Lo0 (unique IP per switch), while Lo1 (anycast IP shared across a pair of switches) is used as VXLAN source IP.
This design is simple, it works well and is the best current practice. Anycast routing also ensures faster convergence after failures: traffic can be rerouted to the other egress Leaf switch by Spine, before remote ingress Leaf learns about the failure.
It is safe to say there will be no duplicate traffic, even if some MAC addresses are not propagated and some switches begin to flood traffic as unknown unicast. Nevertheless, there are schlimazels who manage to screw it by advertising a unique IP from each MLAG switch instead of anycast IP as VXLAN source – then unknown unicast replication will result in duplicates. Anyway this is a major config mistake that people make very rarely.
However, there are disadvantages of this design:
- Vendor proprietary MLAG
- Suboptimal traffic for single-homed hosts, or after link failures
- Multihoming only to 2 switches (EVPN multihoming can work across more than 2 switches)
- Extra ports used for peer link where there is normally no traffic
Not a big deal for most deployments, but for some networks this is not satisfactory, so it is possible to do EVPN multihoming with type 1 and 4 routes, using VXLAN encapsulation.
EVPN multihoming with VXLAN
Fig. 4
Now, instead of MLAG we are using EVPN multihoming. Configs are similar to MPLS examples above, but VXLAN encapsulation is used instead.
interface Port-Channel10 switchport trunk allowed vlan 10 switchport mode trunk ! evpn ethernet-segment identifier 0050:1300:316f:b000:0000 route-target import 50:13:00:31:6f:b0 lacp system-id dead.beef.1337 ! interface Vxlan1 vxlan source-interface Loopback0 vxlan udp-port 4789 vxlan vlan 10 vni 10010 ! router bgp 65003 router-id 3.3.3.3 ! vlan 10 rd 3.3.3.3:10 route-target import 65001:10 route-target export 65001:10 redistribute learned
DF election, flood list – all looks really similar to how it was in MPLS.
Leaf3#show bgp evpn instance EVPN instance: VLAN 10 Route distinguisher: 3.3.3.3:10 Service interface: VLAN-based Local IP address: 3.3.3.3 Encapsulation type: VXLAN Local ethernet segment: ESI: 0050:1300:316f:b000:0000 Interface: Port-Channel10 Mode: all-active State: up ESI label: ES-Import RT: 50:13:00:31:6f:b0 Designated forwarder: 3.3.3.3 Non-Designated forwarder: 4.4.4.4 Leaf1#show l2Rib output floodset L2 RIB Output flood set: Source: Local Dynamic, Local Static, BGP, VXLAN Static Vlan Address Type Destination ---------- -------------------- ---------- ------------ 10 0000.0000.0000 All VTEP 4.4.4.4 VTEP 2.2.2.2 VTEP 3.3.3.3
In data plane, instead of the transport label, we have the outer IP header. Instead of service labels – VNI ID in the VXLAN header. But there is no analog of ESI label in the VXLAN header. That would be B bit in VXLAN-GPE [draft-ietf-nvo3-vxlan-gpe] – still a draft, and unlikely to be widely supported anytime soon.
Among BUM packets, it’s easy to detect broadcast and multicast packets – if the eighth bit in MAC address is 1, the usual BUM forwarding rules (DF/non-DF) apply to them. For unknown unicast it’s not that simple – as the same packet can be known unicast to one switch and unknown to another one. RFC8365#section-8.3.3 mentions this can lead to transient duplicate traffic during convergence.
Even with MLAG, it can occur that a MAC address is not learnt over EVPN and the switch begins to flood unknown unicast. This typically occurs because of MAC flaps3, problems with BGP route propagation/VRF import, misconfiguration, hardware scale limitations etc. From the platform standpoint, there is typically a different pipeline for replication – which perhaps has lower performance than the unicast forwarding pipeline, but still can handle a lot of traffic.
In case of Active/active multihoming, if the ingress PE begins to flood traffic as unknown unicast due to one of the above issues, the egress PE will treat those as usual unicast packets, so both DF and non-DF will forward traffic to CE.
Take for example host flap:
Feb 22 15:34:19 Leaf-2 Bgp: %EVPN-3-BLACKLISTED_DUPLICATE_MAC: MAC address 50:13:00:1f:b7:44 on VLAN 10 has been blacklisted for moving 5 or more times within the past 180 seconds
Leaf-3 and Leaf-4 will flood traffic towards Host-1 as unknown unicast, but for Leaf-1 and Leaf-2 it is known unicast.
Fig. 5
Now there is persistent duplicate traffic.
[admin@Host-2 ~]$ ping 192.168.0.1 PING 192.168.0.1 (192.168.0.1) 56(84) bytes of data. 64 bytes from 192.168.0.1: icmp_seq=1 ttl=64 time=45.1 ms 64 bytes from 192.168.0.1: icmp_seq=1 ttl=64 time=47.5 ms (DUP!) 64 bytes from 192.168.0.1: icmp_seq=2 ttl=64 time=29.7 ms 64 bytes from 192.168.0.1: icmp_seq=2 ttl=64 time=32.2 ms (DUP!) 64 bytes from 192.168.0.1: icmp_seq=3 ttl=64 time=54.9 ms 64 bytes from 192.168.0.1: icmp_seq=3 ttl=64 time=57.3 ms (DUP!)
While the original design of RFC7432 was to leave flapped MACs blacklisted until human intervention (as extra flooding was deemed to be less of a problem than BGP updates churn), the best current practice seems to be changing to clear the blacklisted MACs automatically after a while. This will clear blacklisted MACs 5 minutes after MAC flaps were detected:
event-handler MAC_FLAP action bash FastCli -p 15 -c “clear bgp evpn host-flap” delay 300 ! trigger on-logging regex EVPN-3-BLACKLISTED_DUPLICATE_MAC
This is not a universal recommendation yet, so it applies only to setups susceptible to this sort of problems (e.g. using some server clusters that go split-brain sometimes). An alternative solution suggested by some vendors is to drop traffic to blacklisted MACs (see EANTC interop test white paper 2020 page 7). I think this is a really bad idea for most DC networks, but perhaps makes sense for IXPs.
Perhaps it will take a few years until widespread adoption of VXLAN-EVPN multihoming, then it will become clear what to do. As of today, most deployments are MLAG-based.
Multicast
The basic PIM mechanisms related to duplicate traffic prevention are:
- RPF check – packet accepted only from the interface to which the RPF route (usually unicast route) to the source points.
- DR election – on receiver’s subnet, only one router on vlan is elected as PIM DR so that it will be the only one sending PIM joins to upstream, and therefore, forwarding traffic downstream to the receiver vlan.
- Assert mechanism – if the router receives a multicast packet from the interface which is in OIL for this group, it triggers Assert so that all routers on the subnet choose one router that will be sending multicast for that group. In simple multicast designs, assert mechanism never triggers and remains just a safeguard. There is a high chance that if assert ever triggers, something is wrong. Although, in some cases assert is actually part of the design.
If you have multiple routers on receiver’s subnet and they fail to establish PIM adjacency (e.g. interfaces configured as PIM passive, or neighbour filter), they will neither elect DR nor be able to use assert. Therefore, duplicates are guaranteed in this situation.
Assert
If there is a transit network with more than 2 PIM routers, in some cases assert is the only thing that can prevent duplicate traffic.
Fig. 6
The first hop router (R1) is connected to R2 and R3. R4 happens to have the RPF route to the source pointing to R2, R5 has the RPF route pointing to R3. At some point, R2 and R3 will both send traffic to LAN. Once they receive copies of each others’ packets, that will trigger assert and only the winner (e.g. R2) will keep sending traffic to LAN.
It’s important to understand this basic assert scenario, since the same problem reoccurs in more complex MVPN designs.
MVPN
In poorly designed MVPN networks, duplicate packets flourish like COVID-19 at religious ceremonies.
Multicast VPN is a service to provide multicast routing in L3VPN. PE routers establish PIM adjacencies with CE in VRF4, and then they have to somehow communicate multicast control plane information between PE. There are 4 ways of doing this:
- Static configuration. Nuff said5.
- In-band signaling (PIM messages are translated into mLDP), data flows over the same mLDP LSP.
- Emulate a LAN (multipoint tunnel) per VRF and exchange PIM messages over it. The same tunnel can be reused to send data traffic.
- Translate PIM messages into BGP. Then a tunnel is needed anyway to forward data traffic.
Options (3) and (4) are interesting because emulating LAN (in a form of mGRE tunnel or multipoint MPLS LSP) can lead to a scenario depicted on figure 6.
RFC6513#section-9 proposes 3 ways of preventing duplicate traffic in MVPN:
- Discarding data packets received from the “wrong” PE
- Single Forwarder Selection
- Native PIM methods
There are 2 articles written by Luc De Ghein – covering PIM assert over Data MDT (native PIM methods), and strict RPF check (discarding data packets from the “wrong” PE)
Dual-Homed Source and Data MDT in mVPN https://www.cisco.com/c/en/us/support/docs/ip/multicast/118986-technote-mvpn-00.html
Strict RPF Check for mVPN https://www.cisco.com/c/en/us/support/docs/ip/multicast/118677-technote-mvpn-00.html
Not all these methods are applicable to every MVPN profile, and there are further caveats associated with each method. With PIM signaling, sometimes it is possible to use Assert, like on any LAN. With BGP signaling, there is no Assert, so we must think of another way of preventing duplicates. Also SPT switchover works differently with BGP signaling.
Note: I use Cisco IOS for MVPN simulations. The problems described below apply to all implementations.
PIM signaling and data MDT
One way to exchange PIM control plane between PE is to emulate a LAN in overlay – by setting up a multipoint GRE or mLDP tunnel, which acts like a LAN for the VRF. All PE routers see each other as PIM neighbours on LAN, and all PIM signaling happens there. This LAN is called default MDT, or Inclusive PMSI.
With only default MDT, all PE routers will receive all multicast data traffic, regardless if they have receivers for those groups or not. Therefore, data traffic is usually moved to data MDT (or Selective PMSI) – a different multicast group or P2MP MPLS tunnel in the core, to which only the PE interested in that traffic subscribe. Default MDT is used only for control plane signaling, or perhaps some groups with very low amount of traffic.
The scenario described below applies to any MVPN profile with I-PMSI (default MDT) and PIM signaling. I use Rosen MVPN with GRE encapsulation for this simulation.
The topology below is really similar to figure 6, but instead of LAN there is default MDT:
Fig. 7
On PE routers, the interfaces connected to CE and receivers are in VRF, and all PE see each other as PIM neighbours on Tunnel interface which works over default MDT. To avoid sending traffic to R-PE not interested in it, S-PE will move data traffic to data MDT to which only interested R-PE will subscribe.
Excerpt of PE1 config:
vrf definition one rd 1.1.1.1:1 ! address-family ipv4 mdt default 239.1.1.1 mdt data 239.10.10.0 0.0.0.255 route-target export 1:1 route-target import 1:1 exit-address-family
PE3 and PE4 have RPF routes to the source in VRF pointing to different S-PE (PE1 and PE2 respectively).
When the source begins to forward traffic to which receivers are subscribed, the following sequence of events occurs:
- SPT switchover on R-PE: PE3 and PE4 send (S,G) joins for C group on default MDT to their respective RPF neighbors (PE1 and PE2).
- Both PE1 and PE2 begin to forward traffic on default MDT.
- PE1 and PE2 see each other’s traffic (because all routers are subscribed to default MDT), this triggers Assert which PE1 wins.
- PE1 sends MDT join TLV, signaling that it will switch traffic for the given C-group to data MDT in 3 seconds.
- R-PE interested in traffic join data MDT and traffic is forwarded on data MDT.
On steps 2-3 there is a brief moment of duplicate traffic, but then it works fine.
PE1 is the assert winner for C-group 233.1.1.1:
PE1#sh ip mro vrf one 233.1.1.1 (*, 233.1.1.1), 00:05:45/00:02:39, RP 7.7.7.7, flags: S Incoming interface: Ethernet1/0, RPF nbr 172.16.0.7 Outgoing interface list: Tunnel2, Forward/Sparse, 00:05:45/00:02:39 (172.16.4.8, 233.1.1.1), 00:00:51/00:02:08, flags: Ty Incoming interface: Ethernet1/0, RPF nbr 172.16.0.7 Outgoing interface list: Tunnel2, Forward/Sparse, 00:00:51/00:02:41, A
PE4, despite having PE2 as RPF neighbour for this (S,G), will send PIM joins to PE1 (1.1.1.1). The actual RPF neighbour has changed because of assert.
PE4#sh ip mro vrf one 233.1.1.1 (*, 233.1.1.1), 00:06:04/stopped, RP 7.7.7.7, flags: SJC Incoming interface: Tunnel2, RPF nbr 2.2.2.2 Outgoing interface list: Ethernet1/0, Forward/Sparse, 00:06:04/00:02:56 (172.16.4.8, 233.1.1.1), 00:01:10/00:01:49, flags: TY Incoming interface: Tunnel2, RPF nbr 1.1.1.1*, MDT:239.10.10.0/00:02:51 Outgoing interface list: Ethernet1/0, Forward/Sparse, 00:01:10/00:02:56
If for some reason (e.g. a brief period of packet loss) data MDT switchover happens before assert, this can become a problem, because each S-PE will start its own data MDT, which all R-PE will join, thus leading to duplicate traffic.
Fig. 8
If step (3) from the list above was missed, PE3 joins data MDT of PE1 and PE4 joins data MDT of PE2, and now we get duplicate traffic.
PE1#sh ip mro vrf one 233.1.1.1 (*, 233.1.1.1), 00:01:36/00:02:53, RP 7.7.7.7, flags: S Incoming interface: Ethernet1/0, RPF nbr 172.16.0.7 Outgoing interface list: Tunnel2, Forward/Sparse, 00:01:36/00:02:53 (172.16.4.8, 233.1.1.1), 00:01:36/00:01:23, flags: Ty Incoming interface: Ethernet1/0, RPF nbr 172.16.0.7 Outgoing interface list: Tunnel2, Forward/Sparse, 00:01:36/00:02:53 PE2#sh ip mro vrf one 233.1.1.1 (*, 233.1.1.1), 00:01:43/00:02:46, RP 7.7.7.7, flags: S Incoming interface: Ethernet1/1, RPF nbr 172.16.1.7 Outgoing interface list: Tunnel2, Forward/Sparse, 00:01:42/00:02:46 (172.16.4.8, 233.1.1.1), 00:01:43/00:01:16, flags: Ty Incoming interface: Ethernet1/1, RPF nbr 172.16.1.7 Outgoing interface list: Tunnel2, Forward/Sparse, 00:01:43/00:02:46
To prevent this from happening, PE routers join each other’s data MDT, so that they are able to use assert on data MDT instead of default MDT. The rule is that if the S-PE has a C-mroute with valid RPF and Tunnel is in OIL, and that mroute has been switched to data MDT, then the S-PE will also join data MDT of any other S-PE for the same mroute. In the example from figure 7 this didn’t happen since because of assert on default MDT only one S-PE will send traffic to data MDT.
In the current state, both PE1 and PE2 have mroutes on data MDT with Tunnel in OIL. Therefore, PE1 joins data MDT of PE2 and PE2 joins data MDT of PE1, then they can do assert on C-mroute.
In this example, PE2 wins assert. Since PE1 does not satisfy the above condition (tunnel is no longer in OIL after losing assert), it also prunes the data MDT of PE2.
PE1#sh ip mro 239.10.10.0 (*, 239.10.10.0), 00:02:44/stopped, RP 5.5.5.5, flags: SPF Incoming interface: Ethernet0/1, RPF nbr 10.0.0.5 Outgoing interface list: Null (2.2.2.2, 239.10.10.0), 00:00:14/00:02:45, flags: PJT Incoming interface: Ethernet0/1, RPF nbr 10.0.0.5 Outgoing interface list: Null (1.1.1.1, 239.10.10.0), 00:01:10/00:02:23, flags: FT Incoming interface: Loopback0, RPF nbr 0.0.0.0 Outgoing interface list: Ethernet0/1, Forward/Sparse, 00:01:10/00:03:20
After a while, the pruned MDT entry will disappear. If for some reason S-PE are not able to join each other’s data MDT, this can lead to persistent duplicates. See also https://www.cisco.com/c/en/us/support/docs/ip/multicast/118986-technote-mvpn-00.html
BGP signaling and no asserts on default MDT
Default MDT is effectively a LAN with all PE routers, so the number of PIM neighbourships grows by O(n^2), besides PIM is a soft-state protocol which sends Join/Prune every minute even when nothing changes – therefore PIM signaling doesn’t scale well in large MVPN networks. BGP signaling improves scalability: BGP updates are sent only when something changes, and the number of sessions can be reduced by using route reflectors. The information from PIM messages is translated into BGP MVPN updates (SAFI 129) and advertised to remote PE.
There is no 1 to 1 mapping between PIM message types and MVPN route types. For instance, there is no analog of (S,G,RPT) Join/Prune and no Assert. On the other hand, there are Source Active messages, so BGP MVPN replaces not only PIM but also MSDP. This means, asserts won’t work and a different duplicate prevention mechanism is needed.
What is described below applies to any PMSI type when using BGP signaling. Topology and configs are similar to the previous example, but BGP signaling is used instead of PIM.
Sample config:
vrf definition one rd 1.1.1.1:1 ! address-family ipv4 mdt auto-discovery pim mdt default 239.1.1.1 mdt data 239.10.10.0 0.0.0.255 mdt overlay use-bgp route-target export 1:1 route-target import 1:1 exit-address-family
SA routes
Fig. 9
In normal PIM ASM, the Last Hop Router (LHR) first signals a shared tree to the RP, and after receiving traffic it can decide to switch to source tree (or not), depending on configuration. In MVPN with BGP signaling, if source and RP are connected to different PE (like on figure 9), the LHR (or R-PE connected to LHR) doesn’t have to signal a shared tree. Once the FHR (PE2) registers traffic on C-RP (which is not MVPN-aware), it generates Source Active (type 5) routes. BGP distributes those routes to all other PE, so that R-PE (PE3 and PE4) can join the SPT right away.
Now, what if PE3 cannot switch over to SPT, and traffic stays on default MDT.
PE3(config)#ip pim vrf one spt-threshold infinity
PE1 and PE2:
PE1(config-vrf-af)#mdt data threshold 5
PE3 sends (*,G) join (MVPN route type 6) towards PE1, and PE4 sends (S,G) join (MVPN route type 7) to PE2 like on figure 10.
Fig. 10
PE3#sh bgp ipv4 mvpn all route-type 6 1.1.1.1:1 100 7.7.7.7 233.1.1.1 BGP routing table entry for [6][1.1.1.1:1][100][7.7.7.7/32][233.1.1.1/32]/22, version 221 Paths: (1 available, best #1, table MVPNv4-BGP-Table) Advertised to update-groups: 1 Refresh Epoch 1 Local 0.0.0.0 from 0.0.0.0 (3.3.3.3) Origin incomplete, localpref 100, weight 32768, valid, sourced, local, best Extended Community: RT:1.1.1.1:1 rx pathid: 1, tx pathid: 0x0 PE4#sh bgp ipv4 mvpn all route-type 7 2.2.2.2:1 100 172.16.4.8 233.1.1.1 BGP routing table entry for [7][2.2.2.2:1][100][172.16.4.8/32][233.1.1.1/32]/22, version 232 Paths: (1 available, best #1, table MVPNv4-BGP-Table) Advertised to update-groups: 1 Refresh Epoch 1 Local 0.0.0.0 from 0.0.0.0 (4.4.4.4) Origin incomplete, localpref 100, weight 32768, valid, sourced, local, best Extended Community: RT:2.2.2.2:1 rx pathid: 1, tx pathid: 0x0
All traffic forwarding happens on default MDT, but assert is not possible…duplicates here are prevented by SA routes: if S-PE with an (*,G) (PE1 in this example) receives an SA route for this group, it removes tunnel from OIL (and if the OIL becomes empty, also sends an (S,G,RPT) prune upstream)
Debugs on PE1:
*Feb 15 23:05:38.392: PIM(1): Prune-list: (172.16.4.8/32, 233.1.1.1) RPT-bit set BGP C-Route *Feb 15 23:05:38.392: PIM(1): Prune Tunnel2/239.1.1.1 from (172.16.4.8/32, 233.1.1.1) *Feb 15 23:05:38.392: PIM(1): Insert (172.16.4.8,233.1.1.1) sgr prune in nbr 172.16.0.7's queue - deleted *Feb 15 23:05:38.392: PIM(1): Building Join/Prune packet for nbr 172.16.0.7 *Feb 15 23:05:38.392: PIM(1): Adding v2 (172.16.4.8/32, 233.1.1.1), RPT-bit, S-bit Prune *Feb 15 23:05:38.392: PIM(1): Send v2 join/prune to 172.16.0.7 (Ethernet1/0)
Upstream multicast hop selection
What happens if the S-CE is dual-homed, like in the scenarios reviewed earlier?
In scenario like on figure 7, with PE3 choosing PE1 and PE4 choosing PE2 as their RPF neighbours, there will be persistent duplicates because no assert is possible. One solution is to configure Upstream Multicast Hop (UMH) selection to force all R-PE to choose only one S-PE as RPF neighbour.
Config added to PE3 and PE4:
router bgp 100 address-family ipv4 vrf one mvpn single-forwarder-selection highest-ip-address
Now for C-mroutes in VRF one they will choose the S-PE with highest IP as RPF neighbour, regardless of RPF route metrics. In this case they prefer PE2.
Fig. 11
Even though PE3 prefers the route to source IP 172.16.4.8 via PE1 (1.1.1.1), the RPF neighbour will be PE2 (2.2.2.2)
PE3#sh ip ro vrf one | in 172.16.4 B 172.16.4.0/24 [200/0] via 1.1.1.1, 00:29:47 PE3#sh ip rpf vrf one 172.16.4.8 RPF information for ? (172.16.4.8) RPF interface: Tunnel2 RPF neighbor: ? (2.2.2.2) RPF route/mask: 172.16.4.0/24 RPF type: multicast (bgp 100) Doing distance-preferred lookups across tables BGP originator: 1.1.1.1 RPF topology: ipv4 multicast base, originated from ipv4 unicast base
PE1 in this topology will not forward any traffic. RFC6513#section-5.1 describes several other UMH selection procedures that enable load balancing across S-PE. There can be for example hashing across (S,G) – if all R-PE use the same algorithm, they will select the same UMH for each given (S,G), so there will be no duplicates. And the last option is to trust the metrics and let each R-PE select each other UMH – this can result in persistent duplicate traffic in some topologies (e.g. figure 11) unless another precaution is used, such as strict RPF check reviewed below.
Partitioned MDT
A few sections above I wrote that Selective PMSI means the same thing as data MDT – an optimization to offload traffic from default MDT so that only the interested PE receive it. There are also MVPN profiles which use only S-PMSI, called Partitioned MDT. Sometimes it is called Multidirectional Selective PMSI, because in some cases traffic can be inserted into MVPN by a PE other than the MDT root.
Fig. 12
This is the same as all topologies above, just PE config is different.
vrf definition one rd 1.1.1.1:1 ! address-family ipv4 mdt auto-discovery mldp mdt partitioned mldp p2mp mdt overlay use-bgp route-target export 1:1 route-target import 1:1 exit-address-family
Instead of a single default MDT now there is a separate partitioned MDT per S-PE (built on demand, when an R-PE wants to receive traffic from that S-PE). If PE3 prefers PE1 and PE4 prefers PE2, there will be 2 separate S-PMSI, with separate LSP signaled by MLDP. This will work fine.
A problem can occur if PE3 also joins the partitioned MDT of PE2 or PE4 joins the partitioned MDT of PE1. This can occur in multiple situations, such as:
- There are receivers for several (S,G), RPF routes for which point to different S-PE and S-PE are not configured to move traffic to data MDT
- RPF routes for (*,G) and (S,G) point to different S-PE
- Anycast source is used, with servers connected to different S-PE
For example, PE2 does not advertise the route to the RP (7.7.7.7) anymore. Now the RPF route for (*,G) on PE4 points to PE1, while RPF route for (*,G) still points to PE2.
PE2(config)#no ip route vrf one 7.7.7.7 255.255.255.255 172.16.1.7 PE4#sh ip mro vrf one 233.1.1.1 (*, 233.1.1.1), 01:48:24/stopped, RP 7.7.7.7, flags: SJCg Incoming interface: Lspvif0, RPF nbr 1.1.1.1, Mbgp Outgoing interface list: Ethernet1/0, Forward/Sparse, 01:48:24/00:02:36 (172.16.4.8, 233.1.1.1), 01:45:23/00:01:50, flags: JTgQ Incoming interface: Lspvif0, RPF nbr 2.2.2.2, Mbgp Outgoing interface list: Ethernet1/0, Forward/Sparse, 01:45:23/00:02:36
As I wrote above, there is no SPT switchover or (S,G,RPT) prune in BGP-based MVPN. R-PE advertise both (*,G) and (S,G) join routes and it is up to the S-PE to generate SA (type 5) routes so that the RP-PE stops forwarding traffic, letting S-PE do that. In this case, PE1 becomes RP-PE for PE4 and S-PE for PE3, so it will keep forwarding traffic, but since PE4 now joins the partitioned MDT of PE1, PE4 will get duplicates.
Fig. 13
This is by the way not unique to partitioned MDT, the same problem can occur in default MDT. In this particular scenario even UMH selection won’t help – if I configure PE4 to prefer the S-PE with highest IP, it will still send (*,G) joins to PE1 (1.1.1.1) as it is the only available RPF neighbor for (*,G), and (S,G) joins to 2.2.2.2 as it has higher IP.
PE4#sh ip rpf vrf one 7.7.7.7 233.1.1.1 | in RPF neigh|inter RPF interface: Lspvif0 RPF neighbor: ? (1.1.1.1) PE4#sh ip rpf vrf one 172.16.4.8 233.1.1.1 | in RPF neigh|inter RPF interface: Lspvif0 RPF neighbor: ? (2.2.2.2)
Strict RPF check
The usual RPF rule in PIM checks whether the packet has been received on the correct interface, but it doesn’t check from which neighbour that packet was received. In MVPN, it is possible to check that by looking at the MPLS label.
In the example above traffic for (*,G) and (S,G) is received on the same interface (Lspvif0), but neighbours are different. Traffic from different S-PE can be identified in data plane by incoming MPLS label.
PE4#sh mpls mldp database opaque_type gid 65536 | in FEC Root|Local Label|Lspvif FEC Root : 1.1.1.1 Local Label (D): 29 Next Hop : 10.3.3.6 Interface : Lspvif0 FEC Root : 2.2.2.2 Local Label (D): 27 Next Hop : 10.3.3.6 Interface : Lspvif0 PE4#show mpls for | in 27|29 27 [T] No Label [V] 385312 aggregate/one 29 [T] No Label [V] 32448 aggregate/one
To avoid duplicates in this scenario, I will configure strict RPF check on PE4.
PE4(config-vrf-af)#mdt strict-rpf interface PE4(config-vrf-af)#do clear ip bgp *
Now only traffic incoming from Lspvif2 will be accepted.
PE4#sh ip rpf vrf one 172.16.4.8 233.1.1.1 | in RPF neigh|inter RPF interface: Lspvif0 Strict-RPF interface: Lspvif2 RPF neighbor: ? (2.2.2.2)
And Lspvif2 is identified by incoming MPLS label.
PE4#sh mpls mldp database opaque_type gid 65536 | in FEC Root|Local Label|Lspvif FEC Root : 1.1.1.1 Local Label (D): 26 Next Hop : 10.3.3.6 Interface : Lspvif1 FEC Root : 2.2.2.2 Local Label (D): 27 Next Hop : 10.3.3.6 Interface : Lspvif2
Strict RPF check is the least optimal duplicate prevention method as it does not prevent duplicate traffic on MVPN, it only protects receivers from getting it – which is still something, but extra bandwidth utilization is not good. So strict RPF is more of a safeguard.
For strict RPF usage with anycast source, see https://www.cisco.com/c/en/us/support/docs/ip/multicast/118677-technote-mvpn-00.html
EVPN multicast
Pretty much the same that applies to duplicates in L2 unicast, applies also to L2 multicast. Unless there is an L2 loop, there will be no duplicates in usual L2 networks, there is no need for things like Assert in PIM. In EVPN, if PE to which a CE is multihomed can’t agree who is the DF (e.g. if they don’t receive each other’s type 4 routes), they will both forward the same stream to the CE – just like with unicast. The unknown unicast problem in VXLAN described above does not apply to multicast as each PE can identify multicast frames by 8th bit in dst MAC set to 16.
Having said that, one problem specific to L2 multicast is that only PIM DR forwards traffic to the receiver – if L2 is extended over EVPN, this leads to very suboptimal traffic flows, as I wrote in Barbaric Balancing. So they introduced Optimized Inter-Subnet Multicast (OISM), whereby SMET routes are sent not per vlan, but on a Supplementary Broadcast Domain (SBD), one route per (*,G) or (S,G) for all vlans, and then S-PE sends IGMP reports to mrouter port, based on the data in received SMET routes. What if S-PE has mrouter ports on multiple vlans, like on figure 14?
Fig. 14
Per draft-ietf-bess-evpn-irb-mcast-04#section-6.1.1, in such situations S-PE MUST NOT reconstruct IGMP reports from SMET routes. This also means, no traffic will be forwarded in such scenarios. As of the time of this writing, it is still a very new technology, so maybe they will come up with something better in the future.
Impact
Multicast mostly uses UDP, so in the end it’s up to the application to deal with duplicate packets. For instance, RTP has sequencing so it can detect duplicates, if there are any. However, for some applications duplicate traffic can lead to a critical failure.
As of unicast applications (which mostly use TCP), sequencing saves them from unexpected duplicates, but TCP performance can get severely impacted by duplicate ACKs. TCP normally sends dup ACKs to inform the other side that some segments were lost and have to be retransmitted. Upon receiving those, the TCP sender not only retransmits requested segments, but also slows down.
Fig. 15
Picture stolen from TCP/IP Illustrated Vol. 1. In this example, the receiver receives a segment of 1024 bytes with SEQ 1, sends ACK 1025, but then receives a segment with SEQ 2049 (instead of expected 1025). The receiver assumes the segment with SEQ 1025 was lost, and sends a duplicate ACK 1025, thus asking the sender to resend the segment with SEQ 1025. In practice, it also tells the sender to slow down.
It’s not difficult to imagine how fake dup acks impact network performance.
Take for example, iperf without dup acks (it’s a virtual lab simulation, hence very low throughput).
Without dup acks
[admin@Host-1 ~]$ iperf -s [admin@Host-2 ~]$ iperf -c 192.168.0.1 ------------------------------------------------------------ Client connecting to 192.168.0.1, TCP port 5001 TCP window size: 45.0 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.0.2 port 49366 connected with 192.168.0.1 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.6 sec 1.75 MBytes 1.38 Mbits/sec
Now the same network, but with duplicates:
[admin@Host-1 ~]$ iperf -s [admin@Host-2 ~]$ iperf -c 192.168.0.1 ------------------------------------------------------------ Client connecting to 192.168.0.1, TCP port 5001 TCP window size: 45.0 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.0.2 port 49346 connected with 192.168.0.1 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-12.5 sec 768 KBytes 504 Kbits/sec
Conclusion
When designing a network that has packet replication, there is always a fundamental risk of duplicate traffic arriving to the receiver. It is important to understand how those problems can occur and take necessary precautions to detect and mitigate duplicate traffic.
The approaches for duplicate prevention for unicast and multicast are very different. Unknown unicast replication is seen as a problem on its own: even if it doesn’t cause duplicate traffic, it can lead to degraded switch performance. For multicast, replication is the usual way of traffic handling, but special techniques like PIM assert and MVPN UMH selection are used to prevent duplicates by choosing one forwarder on transit LAN; also in some cases duplicates on the network are inevitable, but strict RPF check can protect receiver from getting them.
Besides, L2 loops are still common, and perhaps even more common in modern SDN/NFV networks than in old L2 networks based on spanning tree. This is because bad design can prevail over any benefits of a new technology.
References
- BGP MPLS-Based Ethernet VPN https://tools.ietf.org/html/rfc7432
- A Network Virtualization Overlay Solution Using Ethernet VPN (EVPN) https://tools.ietf.org/html/rfc8365
- Generic Protocol Extension for VXLAN https://tools.ietf.org/html/draft-ietf-nvo3-vxlan-gpe-09
- Arista EVPN Deployment Guide https://www.arista.com/en/solutions/design-guides (Registration required)
- EANTC Multi-Vendor Interoperability Test 2020 http://www.eantc.de/fileadmin/eantc/downloads/events/MPLS2020/EANTC-MPLSSDNNFV2020-WhitePaper.pdf
- Cisco Systems’ Solution for Multicast in BGP/MPLS IP VPNs https://tools.ietf.org/html/rfc6037
- Multicast in MPLS/BGP IP VPNs https://tools.ietf.org/html/rfc6513
- Dual-Homed Source and Data MDT in mVPN https://www.cisco.com/c/en/us/support/docs/ip/multicast/118986-technote-mvpn-00.html
- Strict RPF Check for mVPN https://www.cisco.com/c/en/us/support/docs/ip/multicast/118677-technote-mvpn-00.html
- EVPN Optimized Inter-Subnet Multicast (OISM) Forwarding https://tools.ietf.org/html/draft-ietf-bess-evpn-irb-mcast-04
- Barbaric Balancing – routingcraft.net https://routingcraft.net/barbaric-balancing/
Notes
- ^Not just dual-homing – a CE can be multihomed to more than 2 PE!
- ^In this example, label values advertised by PE3 and PE4 happens to be the same, but they could differ. What matters is that PE3 and PE4 are on the same ESI.
- ^Per RFC7432#section-15.1, if a MAC moves between different PE 5 times within 3 minutes, switches should stop blacklist that MAC address until human intervention. This can occur because of transient loops, or some server clusters going split-brain.
- ^PE routers can also have directly connected sources or receivers in VRF.
- ^“SDN”, like Tree SID in SR also fall in this category.
- ^although the capability of looking inside the VXLAN header depends on hardware.
This is all fine and dandy, but it leaves out the issue of fragmentation in the face of packet duplication. Consider the scenario where a system is connected to redundant CEs in an active-active configuration where there is less than optimal configuration of its L2 redundancy signalling. The issue I have faint recall off was with some hypervisor and virtual host which may have added to the mismatch but those details escape me.
The symptom was a hung IP stack in the vhost. Errors hinted at it running out of buffer space. It worked quite well for a while, then boom and only a reboot could solve the problem. It was somewhat predictable, so I could SSH to it and tcpdump when it was about time. What the capture told us was that among the last packets getting through were some duplicated, large and fragmented UDP packets (Kerberos-related, if I recall correctly) where reassembly completed when the first copy of the final fragment arrived. The second copy of that fragment never found a header fragment to reassemble with, so it stayed in its buffer forever. This likely repeated until eventually all the available buffers were gone: voila, hang!
I think that when the link aggregation was changed so it matched between switch and system, the duplication (and thus the hangs) went away, but I really don’t remember. The team responsible for the virtualization wasn’t too eager to cooperate with us in the network team, in their view we should just deliver the stupid packets and shut up. That they were responsible for one significant part of the network (their vswitches), which configuration we had legitimate reasons to opine about, took quite a few High-Severity incidents whose RCA told similar stories of unsuitable vswitch configurations to sink in. This problem was acerbated by them needing their network connections configured one way when installing their hypervisors, and another when deploying VMs in them.
Thanks for the comment! That’s an great combination: (1) EVPN A-A multihoming (therefore a potential for duplicate packets); (2) IP fragmentation (really everyone hates that); (3) UDP (therefore no timeouts on a socket). It would be interesting to see a detailed RCA for this.