MPLS or Anycast Routing – for a long time, you had to choose one. Segment Routing allows you to have both.
Introduction
It’s hard to overstate how important anycast routing is. DNS root servers and CDN rely on it to make the Internet fast and reliable. VXLAN designs in data center networks use things like anycast VTEP or anycast gateway which allows to scale the network while letting VMs migrate to different hosts without having to change IP. It’s even possible to replace expensive load balancers with anycast routing and ECMP, in some cases. Multicast designs use techniques like anycast RP, anycast source and root node redundancy in MVPN.
Yet MPLS, until recently, was deprived of anycast routing. This is because MPLS is not a pure packet switching technology, but has a control plane based on virtual circuit switching. There is always a Label Switch Path (LSP) with one destination.1 Even if MPLS control plane just follows IGP metrics without any traffic engineering, anycast routing is still not possible.
Segment Routing
SR reinvents the MPLS control plane, making it entirely based on packet switching. Since there are no LSP, and global labels are mapped to IP prefixes, nothing prevents us from assigning the same prefix, and same global label (segment) to multiple routers. Great success.
Fig. 1
Consider figure 1 – R1 and R2 are the ASBR, both advertising routes from the Internet to AS100 which has a BGP-free core. All routers in AS100 run IS-IS with SR extensions. In traditional MPLS designs, R1 and R2 would advertise routes with their respective loopback IP and other routers in AS100 would pick the best exit point based on IGP metrics. This causes multiple problems:
- Slow convergence if R1 or R2 fails – RR has to withdraw each BGP prefix and advertise it with the new nexthop.
- Suboptimal routing, because the best IGP path for RR is not always the same as for other routers in the AS. One way to solve this is BGP ORR [RFC9107].
- Attempting to use BGP add-path to solve these problems will increase memory usage on the RR.
With Segment Routing, it is possible to assign an anycast SID to R1 and R2 and use it as a nexthop when advertising routes throughout the AS. This way, routing always follows IGP metrics, without having to use BGP add-path or ORR. Convergence speed is prefix-independent and depends on IGP timers rather than BGP withdrawal and advertisement. If necessary, convergence speed can be further improved by enabling TI-LFA.
Anycast SID is just a prefix SID without the N flag (in IS-IS or OSPF). The same SID can be assigned to multiple routers.
Note a separate BGP session between R1 and R2 on figure 1. If the eBGP session between R1 and its external peer fails, R1 will keep receiving traffic sent towards the anycast SID, so it must route that traffic towards R2. Anycast SID must not be used as the nexthop for this BGP session. Suboptimal traffic in this case is a drawback of the anycast routing design.
Anycast SID doesn’t have to be the tail-end of the LSP, it can as well be in the middle, allowing SR-TE policies to steer traffic via it, like in the example below.
Large Scale Interconnect
Large Scale Interconnect [RFC8604] is a very scalable SDN design with Segment Routing, an alternative to Seamless MPLS for very large networks. In theory, it can scale almost infinitely because if a million SID is not enough to address all routers, same SID can be reused in different leaf domains. It requires a controller which will see the entire topology (e.g. using BGP-LS or streaming telemetry) and program SR-TE policies on routers.
There are multiple flavours of this design, but usually they involve multiple IGP domains, e.g. multi-level IS-IS with blocked L1->L2 route propagation or multi-instance IS-IS [RFC8202]. The IGP domains can be completely isolated, or a few selected core (L2) prefixes can be redistributed into leaf (L1) domains. Anycast routing also comes in handy here.
Fig. 2
In the example on figure 2, the SDN controller programmed an SR-TE policy on PE1, so that it pushes the following label stack:
- Anycast SID of the local ABR
- Anycast SID of the remote ABR
- Prefix or node SID of the remote PE
If we redistribute core SID into the Leaf IGP domains, label stack PE has to push can be reduced from 3 to 2 labels, but the routing table usage and the amount of IGP flooding in Leaf domains will increase. So there is always a tradeoff in such designs.
IS-IS area proxy
[draft-ietf-lsr-isis-area-proxy] is an alternative solution to the IGP scale problem in large networks. Instead of splitting the network into multiple IGP domains and interconnecting them with BGP or SR-TE policies, area proxy allows to create L1 islands inside L2, and represent them to the rest of the L2 network as just one node. This way, it is possible to reduce the number of nodes in SPF calculations and the amount of required LSP flooding (which can be further reduced by [draft-ietf-lsr-dynamic-flooding]).
Fig. 3
Consider the topology on figure 3. Routers R1-R5 support area proxy and are all configured as L1L2 routers. Sample config (Arista EOS):
router isis 1 net 49.0001.0001.0001.0001.00 ! address-family ipv4 unicast ! area proxy net 49.0101.0101.0101.0101.00 is-hostname AR101 area segment 101.101.101.101/32 index 101 router-id ipv4 101.101.101.101 no shutdown ! segment-routing mpls router-id 1.1.1.1 no shutdown
Links on R1-R4 that are connected to routers outside the proxy area, are configured as L2-only circuits and proxy boundary:
interface Ethernet1 no switchport ip address 10.0.0.1/30 isis enable 1 isis circuit-type level-2 isis area proxy boundary isis network point-to-point
PE1-PE3 are oblivious of area proxy and don’t have to support it. R1-R5 elect the area leader which generates a proxy LSP. Routers outside the proxy area are not aware of area 49.0001 and see the entire proxy area as one big router in area 49.0101.
The main benefit of area proxy is reducing the amount of LSP in the area, and the number of routers that are included in SPF runs. The tradeoff is lack of visibility of the entire network topology that link-state IGPs promise – hence difficulty in traffic engineering. Therefore, it makes sense to put routers from one POP or city in the same proxy area, and configure expensive long-haul links as L2 circuits, where traffic engineering is more likely to be required.
Anycast is not strictly required in area proxy designs; in example on figure 3 traffic can travel from PE1 to PE2 with just the respective node SID as the only transport label. But for traffic engineering purposes it is possible to allocate the anycast area SID, owned by all routers in the proxy area.
TI-LFA and anycast SID
Just like with protecting any SID, TI-LFA attempts to steer traffic on the post-convergence path. Depending on topology, it can be the other router sharing the anycast SID.
Fig. 4
On figure 4, R2 and R5 share anycast SID 25. If the R1-R2 link fails, R1 will reroute traffic to that anycast SID to R5, because this is the new post-convergence path.
A bit less obvious example:
Fig. 5
On figure 6, node protection is enabled on R1, and a new link between R2 and R6 with lower metric is added. In this case, using anycast SID 25 on the backup path can result in R6 sending traffic to R2 before R6 detects the failure and converges. Therefore, R1 must use node SID of R5 instead of the anycast SID for the backup path.
Check also Topology Dependent LFA where I have written more about not-so-obvious TI-LFA scenarios.
SRGB
So far it’s all been great, but there is an elephant in the room, and that is the Segment Routing Global Block (SRGB). Best practice is that all routers in the SR domain must use exactly the same SRGB. But there is no standard that would mandate what should be the SRGB range, so in practice every vendor uses a different SRGB by default, letting the operator to decide what to use in a multivendor network.
Segment Routing architecture allows for different SRGB to be used on different routers. When propagating SR extensions in IS-IS, OSPF or BGP, a router advertises its SRGB base + segment identifiers, so that other routers can calculate label values. But in practice, using different SRGB causes a lot of issues, such as:
- Increased operational and troubleshooting complexity without any added benefit
- No anycast routing
- No TCAM optimizations
Yes, no anycast routing at all. [draft-ietf-spring-mpls-anycast-segments] proposes some solutions for anycast with different SRGB (one of these solutions is a designated SRGB for anycast SID, which should be the same on all routers). But it doesn’t even mention TI-LFA for anycast SID. I don’t know whether any vendor implemented this draft.
MPLS services
Sadly, but so far there are no extensions to MPLS services to support anycast. This means, even if we use Segment Routing as the MPLS transport, L3 VPNs and VPLS are still not going to be anycast-aware and will work in the old way of the end-to-end LSP.
Conclusion
Anycast routing proves yet another time that Segment Routing is superior to legacy MPLS control planes. But there are caveats to anycast designs, like SRGB issues and sometimes suboptimal routing after failures.
References
- Segment Routing, Part I – Clarence Filsfils, Kris Michielsen, Ketan Talaulikar, ISBN-13: 978-1542369121
- Segment Routing Architecture https://datatracker.ietf.org/doc/html/rfc8402
- BGP Optimal Route Reflection (BGP ORR) https://datatracker.ietf.org/doc/html/rfc9107
- Interconnecting Millions of Endpoints with Segment Routing https://datatracker.ietf.org/doc/html/rfc8604
- IS-IS Multi-Instance https://datatracker.ietf.org/doc/html/rfc8202
- Area Proxy for IS-IS https://datatracker.ietf.org/doc/html/draft-ietf-lsr-isis-area-proxy-06
- Dynamic Flooding on Dense Graphs https://datatracker.ietf.org/doc/html/draft-ietf-lsr-dynamic-flooding-09
- Anycast Segments in MPLS based Segment Routing https://datatracker.ietf.org/doc/html/draft-ietf-spring-mpls-anycast-segments-03
- Equal Routes https://routingcraft.net/equal-routes/
- Topology Dependent LFA https://routingcraft.net/topology-dependent-lfa/
Notes
- ^multipoint LSP can have multiple destinations, but not ANY destination, like in anycast
Hello! Many thanks for a really good post!
I was interested in the first scenario and I have a couple of words about it. First of all, you can throw a stone to my side I’m not a big fan of Segment Routing (of any flavor). If I were a marketing engineer…
Despite I don’t see any problem with LDP or BGP-LU and anycast for your scenario, I want to clarify other points here.
“Slow convergence if R1 or R2 fails – RR has to withdraw each BGP prefix and advertise it with the new nexthop”. This is not always the case. Probably your network can notice that the loopback of R1 or R2 is gone due to IGP LSP/LSA (which is faster than BGP reaction) and invalidate all routes that use this loopback as a next-hop. If you use some kind of BGP FRR (PIC) or multipath here it will also save time. But I agree it’s still possible to rely on BGP only in some cases (i.e. hierarchical design).
“Attempting to use BGP add-path to solve these problems will increase memory usage on the RR”. I believe that RR in most cases has all routes whenever it is configured with Add-paths or not. But it depends on hierarchy, yes. The memory consumption will increase on RR clients due to several paths that RR propagates into a network. But maybe there are some stories with Adj-RIB-out or something else on a RR which I’ve missed. Anyway, it would be really helpful if you clarified your statement.
The real problem with prefix-dependent convergence is when a PE-CE link fails instead of a router itself. This way we always will rely on BGP convergence which might be slow, even too slow (a long network diameter (count of BGP hops/speakers), BGP path hunting, sub-optimal prefixes processing, and so on) if the router changes next-hops. Anycast can help us in this scenario but I see a better way. I prefer to save an original next-hop for prefixes and propagate a PE-CE prefix into your BGP. It’s even possible to schedule such prefixes for a BGP send process and for RIB/FIB installation. It also opens the way for EPE in the case of BGP-LU. And yes, without a fancy SR.
So, today I was proven wrong by Ivan Pepelnjak on this twitter thread:
https://twitter.com/ioshints/status/1460891199116558338
Legacy MPLS indeed can do anycast routing, in some cases, even though it wasn’t designed for it. LDP will throw errors at you (but it will sort of work, at least in the lab), and BGP-LU is no different than BGP-SR as far as anycast routing is concerned.
Now regarding my BGP scenario on figure 1.
1. If R1 or R2 fails, IGP will invalidate nexthop indeed, but then RR will have to withdraw each BGP prefix from R3/R4, and advertise the new best prefix. So the convergence will not be prefix-independent. BGP PIC core helps with P router failures, but not PE.
2. You’re right, RR will receive all routes anyway. RR clients (e.g. PE routers) will need to store extra routes in case of BGP add path. My bad.
3. Preserving eBGP peer’s nexthop when advertising to iBGP peers (default BGP behaviour) is best for convergence in non-MPLS networks. MPLS doesn’t like non-/32 FEC – perhaps, in some cases (Cisco) it can work, but not with every vendor. Also, in MPLS VPN there is no place for anycast, but fast convergence for PE-CE link failures can be achieved by BGP PIC edge (which brings its own problems with label allocation modes, but that’s another story).
Thanks for sharing this!
1. Yes, RR will have to withdraw each BGP prefix. But it won’t influence a convergence process if any router in the same IGP domain has a way to converge (extra path provided by add-path). If a router doesn’t have such a way I’m afraid it’s strange to discuss convergence itself (at least in modern SP networks). If a router has a backup route, there is a question of how fast it completes the BGP BPS but it doesn’t depend on the RR and anycast won’t improve this aspect. BGP PIC (edge, not core) will provide a backup FIB record for the route and select it before BGP BPS is done.
3. It works in MPLS networks as well. There are two options. We might advertise a PE-CE link as a vanilla BGP prefix (without a label) with loopback as a next-hop and customer’s prefixes with an unchanged next-hop. So for a remote router, there will be recursion a BGP next-hop over a BGP next-hop. All that we need is to withdraw the PE-CE prefix to increase the convergence speed. The second option is to advertise the PE-CE prefix with a label. For example, I want to skip egress IP lookup for some reason. No problem to advertise a prefix with any mask length in BGP-LU. At least Juniper, Nokia, and Huawei do it.
Yeah, I got it. I just recall there was something problematic with resolving MPLS LSP over non-/32 routes (even BGP-LU), but can’t remember what exactly. Maybe that’s specific to Cisco. Thanks for the valuable comment.
Regarding MPLS Services, JUNOS offers this “context-identifier” (virtual IGP next-hop instead of pinning VPN next-hops to loopback addresses) option mostly for a redundancy point of view. Can it also be used with anycast services?
Do you mean draft-minto-2547-egress-node-fast-protection? I had it on my mind when writing this. I read about this feature in MPLS in the SDN Era but never had a chance to play with it.
My understanding from reading the book and the draft is that this context ID is supposed to work as a temporary forwarding entry, until normal convergence occurs. Not “true anycast” like for example anycast VTEP in VXLAN-EVPN + MLAG. I wonder if anybody uses this context ID, seems like an interesting tech.