Recently I stumbled upon the IETF draft about PIM Designated Router Load Balancing (DRLB) and it reminded me of something absolutely barbaric.
Introduction
In L3 multicast, load balancing is simple. If the RPF route is ECMP, the router can choose a different RPF neighbor per mroute, thereby achieving load balancing. In L2 multicast, one does not simply load balance traffic.
Consider the network on figure 1:
Fig. 1
Receiver is on the same subnet as R1 and R2. Source sends a lot of streams with different (S,G), and that traffic is load balanced across R1 and R2. It’s been working fine for years, but there is one small problem: if either R1 or R2 fails and the other router takes over traffic forwarding, and then the failed router recovers, all traffic still flows via one router. Issuing “clear ip mroute *” on the router forwarding all traffic returns load balancing.
The reader familiar with multicast might wonder: what load balancing am I talking about? If there are multiple routers on the receiver’s LAN (known as Last Hop Routers or LHR), they elect one as PIM DR and it will forward all traffic. It is not possible to do any kind of load balancing in this topology. This was also my initial reaction, but as it turned out, it actually worked.
The 2 mechanics of PIM relevant to this scenario are:
- Designated Router (DR) election: the router with higher priority or higher IP (if priorities tie) becomes DR and it will be the only router sending PIM Joins upstream and, consequently, forwarding all traffic.
- Assert mechanism: if the router receives multicast traffic on an interface which is part of OIL for that mroute, it sends PIM Assert message, and then compares parameters like AD and metric in Asserts received from other routers so that they agree on which router will forward traffic on LAN. Assert is normally used in scenarios with multiple PIM routers on transit LAN (as opposed to receiver’s LAN); in this topology assert is just a safeguard which should never trigger.
Both these mechanics require routers on LAN to be able to communicate between each other on 224.0.0.13 using PIM. If that communication is somehow broken (e.g. access list, PIM neighbor filter or the routers are configured as stub), neither DR election nor assert will work. In that case, the receiver will get duplicate traffic which is extremely bad and can break the multicast application.
Broken DR, working assert
Now imagine a scenario where PIM DR election is broken, but Assert mechanism still works. In reality I saw this happen because of a software bug, but it is possible to intentionally break it in the lab by filtering some PIM message types or packet lengths.
Assert winner is determined as follows:
- (S,G) wins over (*,G)
- Lower Administrative distance/route preference wins
- Lower metric wins
- Higher IP address wins
The values to compare on steps (2) and (3), are taken from the RPF route for the respective mroute entry.
Now take the topology from figure 1, source is sending 2 streams: (172.16.0.1, 239.1.1.1) and (172.16.0.2, 239.1.1.2). Receiver is subscribed to both. DR election between R1 and R2 is broken, but they still can do PIM assert. R1 has a better RPF route for 172.16.0.1, R2 has a better RPF route for 172.16.0.2.
R1(config)#ip mroute 172.16.0.1 255.255.255.255 10.1.1.3 R2(config)#ip mroute 172.16.0.2 255.255.255.255 10.2.2.3
Both R1 and R2 start sending multicast traffic for both groups; but upon seeing each other’s traffic on the receiver’s LAN, they start the assert procedure.
Debugs on R1:
*Jan 12 17:27:07.394: PIM(0): Received v2 Assert on Ethernet0/0 from 10.0.0.2 *Jan 12 17:27:07.394: PIM(0): Assert metric to source 172.16.0.1 is [115/20] *Jan 12 17:27:07.394: PIM(0): We win, our metric [1/0] *Jan 12 17:27:07.394: PIM(0): (172.16.0.1/32, 239.1.1.1) oif Ethernet0/0 in Forward state *Jan 12 17:27:22.829: PIM(0): Received v2 Assert on Ethernet0/0 from 10.0.0.2 *Jan 12 17:27:22.829: PIM(0): Assert metric to source 172.16.0.2 is [1/0] *Jan 12 17:27:22.829: PIM(0): We lose, our metric [115/20] *Jan 12 17:27:22.829: PIM(0): Prune Ethernet0/0/239.1.1.2 from (172.16.0.2/32, 239.1.1.2)
As a result, R1 forwards traffic for (172.16.0.1, 239.1.1.1), R2 forwards traffic for (172.16.0.2, 239.1.1.2).
R1#sh ip mroute (*, 239.1.1.1), 00:00:28/stopped, RP 3.3.3.3, flags: SC Incoming interface: Ethernet0/1, RPF nbr 10.1.1.3 Outgoing interface list: Ethernet0/0, Forward/Sparse, 00:00:28/00:02:31, A (172.16.0.1, 239.1.1.1), 00:00:19/00:02:40, flags: T Incoming interface: Ethernet0/1, RPF nbr 10.1.1.3, Mroute Outgoing interface list: Ethernet0/0, Forward/Sparse, 00:00:19/00:02:40, A (*, 239.1.1.2), 00:00:29/stopped, RP 3.3.3.3, flags: SC Incoming interface: Ethernet0/1, RPF nbr 10.1.1.3 Outgoing interface list: Ethernet0/0, Forward/Sparse, 00:00:28/00:02:31, A (172.16.0.2, 239.1.1.2), 00:00:13/00:02:46, flags: PT Incoming interface: Ethernet0/1, RPF nbr 10.1.1.3 Outgoing interface list: Null R2#sh ip mroute (*, 239.1.1.1), 00:01:48/stopped, RP 3.3.3.3, flags: SJC Incoming interface: Ethernet0/2, RPF nbr 10.2.2.3 Outgoing interface list: Ethernet0/0, Forward/Sparse, 00:01:48/00:02:18 (172.16.0.1, 239.1.1.1), 00:00:39/00:02:20, flags: PJT Incoming interface: Ethernet0/2, RPF nbr 10.2.2.3 Outgoing interface list: Null (*, 239.1.1.2), 00:01:40/stopped, RP 3.3.3.3, flags: SJC Incoming interface: Ethernet0/2, RPF nbr 10.2.2.3 Outgoing interface list: Ethernet0/0, Forward/Sparse, 00:01:40/00:02:17 (172.16.0.2, 239.1.1.2), 00:00:24/00:02:35, flags: JT Incoming interface: Ethernet0/2, RPF nbr 10.2.2.3, Mroute Outgoing interface list: Ethernet0/0, Forward/Sparse, 00:00:24/00:02:35, A
Traffic flows illustrated:
Fig. 2
Now if R1 fails, R2 takes over all forwarding.
R2#sh ip mroute (*, 239.1.1.1), 00:00:10/stopped, RP 3.3.3.3, flags: SJC Incoming interface: Ethernet0/2, RPF nbr 10.2.2.3 Outgoing interface list: Ethernet0/0, Forward/Sparse, 00:00:10/00:02:49 (172.16.0.1, 239.1.1.1), 00:00:05/00:02:54, flags: JT Incoming interface: Ethernet0/2, RPF nbr 10.2.2.3 Outgoing interface list: Ethernet0/0, Forward/Sparse, 00:00:05/00:02:54 (*, 239.1.1.2), 00:00:12/stopped, RP 3.3.3.3, flags: SJC Incoming interface: Ethernet0/2, RPF nbr 10.2.2.3 Outgoing interface list: Ethernet0/0, Forward/Sparse, 00:00:12/00:02:47 (172.16.0.2, 239.1.1.2), 00:00:11/00:02:48, flags: JT Incoming interface: Ethernet0/2, RPF nbr 10.2.2.3, Mroute Outgoing interface list: Ethernet0/0, Forward/Sparse, 00:00:11/00:02:48
What happens when R1 comes back, depends on the sequence of events, but if assert triggers before SPT switchover, R1 will always lose because (*,G) always loses to (S,G). Therefore, all traffic will keep flowing through R2.
Absolutely barbaric! But works.
PIM DRLB
draft-ietf-pim-drlb attempts to make this balancing more civilized. PIM DR will be different per flow (either (S,G) or (*,G)), using hash calculated from S, G or RP addresses. As of the time of this writing, it doesn’t seem to be widely implemented, only on Cisco IOS there is DRLB functionality which is not documented and probably not officially supported, but works as a proof of concept.
On receiver-facing interfaces:
R1(config-if)#ip pim drlb grp-mask 0.0.0.255 R1#sh ip pim drlb gdr-selected et0/0 Under the 'GDR Selected' field * denotes GDR address on the selected interface Source, Group GDR Selected Interface *, 239.1.1.1 10.0.0.2 Ethernet0/0 *, 239.1.1.2 * 10.0.0.1 Ethernet0/0
Therefore, traffic will be load balanced across R1 and R2.
R1#sh ip mro (*, 239.1.1.1), 00:06:16/stopped, RP 3.3.3.3, flags: SJC Incoming interface: Ethernet0/1, RPF nbr 10.1.1.3 Outgoing interface list: Ethernet0/0, Forward/Sparse, 00:01:47/00:01:12 (10.3.3.4, 239.1.1.1), 00:00:18/00:02:41, flags: PJT Incoming interface: Ethernet0/1, RPF nbr 10.1.1.3 Outgoing interface list: Null (*, 239.1.1.2), 00:00:40/stopped, RP 3.3.3.3, flags: SJC Incoming interface: Ethernet0/1, RPF nbr 10.1.1.3 Outgoing interface list: Ethernet0/0, Forward/Sparse, 00:00:40/00:02:42 (10.3.3.4, 239.1.1.2), 00:00:07/00:02:52, flags: JT Incoming interface: Ethernet0/1, RPF nbr 10.1.1.3 Outgoing interface list: Ethernet0/0, Forward/Sparse, 00:00:07/00:02:52
Multicast in MLAG
Multi-Chassis Link Aggregation is a set of mostly vendor-proprietary technologies that allow to terminate a port-channel on 2 switches. Examples of such technologies are Arista MLAG and Cisco vPC. In order for this to work, the switches establish a communication channel over which they synchronize some L2 information. From what is relevant to multicast, the IGMP snooping table is synchronized, but everything L3 (including DR and asserts) works as usual. This means, the DR will forward all traffic to the receiver connected with an MLAG port-channel.
Fig. 3
On figure 3, receiver is dual-homed to SW1 and SW2 with an MLAG port-channel. SW2 happens to be the DR on the receiver’s vlan. When receiver sends IGMP reports, they can be sent on either LAG member link (depending on hashing) – it doesn’t really matter. Only the DR will process those IGMP reports, send PIM joins towards RP/source, and forward traffic.
On non-DR router it will show the following (Arista EOS in this example):
SW1#sh ip mro det --- 239.1.1.1 0.0.0.0, 0:01:25, RP 3.3.3.3, flags: W Incoming interface: Null Interfaces not in OIL: Vlan100: Not DR
It is possible to speed up convergence by letting non-DR to signal PIM tree and receive traffic, but then just drop it. If it detects DR failure (BFD should be enabled for that), it will assume DR role and begin to forward traffic.
Fig. 4
Config on EOS (right now works only for SSM):
SW1(config-if-Vl100)#pim ipv4 non-dr install-oifs
See also https://eos.arista.com/eos-4-22-1f/pim-ssm-ipv4-non-dr-oif-installation-for-fast-failover/ (registration required).
Note this is different from Multicast-only Fast Reroute (MoFRR) where the router builds a backup RPF tree and receives 2 streams of traffic. When MoFRR is enabled on LHR, still only DR will send upstream joins (either for the primary or backup tree). The disadvantage of both these technologies is double bandwidth consumption. In practice, network designs that are okay with that usually just have two redundant streams using different (S,G), so that all network elements are redundant, including multicast sources. This is especially common in financial networks.
Multicast in EVPN
Ethernet VPN is a relatively new L2VPN technology. One of its goals is to reduce the amount of signaling on overlay – for unicast that is ARP/ND, for multicast – IGMP/MLD. Instead, all control plane information is signaled by BGP (AFI/SAFI 25/70). Initially it was designed only for unicast, so multicast over EVPN was flooded like over any LAN. EVPN also features multihoming: If a CE is connected to multiple PE, they elect a Designated Forwarder (DF) – one PE responsible for forwarding all BUM traffic to the CE; non-DF PE will drop BUM traffic – this is needed to avoid duplicates.
Fig. 5
On figure 5, source behind CE1 sends multicast traffic. Assuming CE1 has an mroute state (either from receiving IGMP reports or from static configuration), it sends traffic to PE1 which replicates it to all remote PE, regardless whether they want to receive that stream or not. PE3 and PE4 are connected to the same Ethernet Segment (ES), so they elect a DF (in this example PE3) and it will be the only router forwarding multicast traffic to CE3.
IGMP proxy for EVPN
draft-ietf-bess-evpn-igmp-mld-proxy adds new BGP EVPN route types 6/7/8 to emulate IGMP or MLD proxy functionality. Type 6 (Selective Multicast Ethernet Tag or SMET) route is used by R-PE to signal their interest in receiving a certain multicast stream; types 7 and 8 (IGMP join and leave sync) are used by PE on the same ES to synchronize the IGMP join/leave state on that ES. Therefore, PE that are not interested in multicast stream will not receive it, but in either case only the DF will forward all multicast traffic to multihomed CE. It is not possible to load balance multicast traffic across multiple PE routers.
Fig. 6
On figure 6, receiver CE3 sends IGMP report which happens to be hashed to PE4 (non-DF). PE4 advertises EVPN type 7 route so that DF PE3 installs IGMP join state and generates a SMET route, thus signaling to PE1 that PE3 is interested in receiving the multicast stream. IGMP reports from R-CE can be hashed to any R-PE, DF or not DF – in either case they all synchronize the state so that they know what to do upon receiving IGMP leaves – but only the DF will be forwarding traffic.
Note: BGP routes are propagated to all PE routers – the diagram shows only selected route advertisements for brevity.
Upon receiving the SMET route, PE1 realises that it has an mrouter port facing CE1 (either because it heard PIM hello’s from CE1 or because of static configuration), so it generates an IGMP report, using information from the SMET route and forwards it to CE1. Then traffic begins to flow.
Multicast routing in EVPN
If source and receiver are on different subnets, we have to do multicast routing. Normally, only PIM DR on receiver’s subnet would generate PIM joins and forward traffic. Since L2 is now not only between the LHR and receiver (like on figures 1, 2, 3), but is extended over EVPN, this now becomes a problem leading to very suboptimal traffic flows.
Fig. 7
On figure 7, source behind CE1 is sending multicast traffic on vlan 200, CE2 and CE3 want to receive it on vlan 100. PE4 is elected as PIM DR on vlan 100, so according to the usual PIM rules, it will be sending PIM joins upstream, receive traffic on vlan 200, route it to vlan 100 and then send to interested PE routers on vlan 100 (relying on SMET routes advertised by other PE, like on figure 6). Doesn’t look good!
Fig. 8
To optimize this behaviour, they came up with a Supplementary Broadcast Domain (SBD) – sort of a transit LAN. On figure 8, R-PE (PE2 and PE3) send SMET routes not for vlan 100, but for the SBD. They will receive traffic on that SBD and route it to vlan 100 locally, therefore each PE will behave like PIM DR on vlan 100. But for multi-homed CE, DF forwarding rules still apply.
draft-ietf-bess-evpn-irb-mcast describes how it works in detail. It also covers EVPN interworking with MVPN.
On all diagrams, I illustrated traffic flows as ingress replication – in fact, replication can happen in core as well, either using PIM signaling and VXLAN encapsulation or MPLS multipoint LSP. Selective PMSI can be created for different groups – draft-ietf-bess-evpn-bum-procedure-updates – which by the way has been in VPLS for a while – RFC7117.
Either way, EVPN multicast and MVPN are huge topics worth a separate series of posts. The scope of this post was to review how last hop multicast (between LHR and receiver) works and its implications to traffic flows and load balancing.
Conclusion
For most deployments, load balancing of multicast traffic across multiple LHR doesn’t seem to be needed that much. This explains little interest in PIM DRLB.
However when the L2 network between LHR and receiver is not one link, but rather a large overlay network with EVPN, PIM DR becomes a bottleneck. Therefore, OISM is used to optimize traffic flows and allow each R-PE to forward multicast traffic.
References
- Protocol Independent Multicast – Sparse Mode (PIM-SM): Protocol Specification (Revised) https://tools.ietf.org/html/rfc7761
- PIM Designated Router Load Balancing https://tools.ietf.org/html/draft-ietf-pim-drlb-15
- Pim SSM IPV4 Non-DR OIF Installation for Fast Failover https://eos.arista.com/eos-4-22-1f/pim-ssm-ipv4-non-dr-oif-installation-for-fast-failover/
- IGMP and MLD Proxy for EVPN https://tools.ietf.org/html/draft-ietf-bess-evpn-igmp-mld-proxy-04
- EVPN Optimized Inter-Subnet Multicast (OISM) Forwarding https://tools.ietf.org/html/draft-ietf-bess-evpn-irb-mcast-04
- Updates on EVPN BUM Procedures https://tools.ietf.org/html/draft-ietf-bess-evpn-bum-procedure-updates-08
- Multicast in Virtual Private LAN Service (VPLS) https://tools.ietf.org/html/rfc7117
- Plutarch – The Life of Camillus (inspiring the title of this post)
Great article Dmitrii, Thank you!
Cheers!
Kr,
Dan