Segment Routing was supposed to make MPLS easier and give more power to network operators. Sadly, vendors decided to make it harder by selling weird protocols and over-engineered controller bloatware.
MPLS is actually great
Despite some anti-MPLS marketing from SD-WAN vendors and the likes, as a transport technology there is no real alternative to MPLS1. It provides:
- Low-overhead underlay for services, so only provider edge routers (PE) need to carry full Internet tables and VRFs for L3VPN
- Traffic engineering allows to evenly load balance traffic in any topology, provide low-latency path for services that need it, disjoint path for A/B feeds etc
- Fast reroute enables sub-50ms convergence after link/node failures
However, over the course of its evolution, MPLS has become unnecessarily complex. Let’s try to understand why.
Traditional MPLS stack
Normally the MPLS network would run IS-IS or OSPF for IP routing, LDP or RSVP for label distribution, and BGP with its many address families for MPLS services. Something like this:
There are 3 problems with the traditional MPLS stack: operational complexity, protocol limitations and interoperability.
Operational complexity
LDP is fairly simple to configure and operate but it doesn’t support Traffic Engineering and has limited support for fast reroute using remote LFA.
RSVP supports Traffic Engineering but requires a mesh of p2p tunnels which limits scalability.
So in practice, “LDP or RSVP” often becomes “LDP and RSVP”: LDP within each PoP and RSVP on WAN links connecting different PoPs; with targeted LDP sessions over RSVP links. Which further increases complexity as now you have to operate 2 label protocols.
Something like this:
Protocol limitations
These are fundamental and apply to all implementations:
- Poor ECMP support. RSVP doesn’t support ECMP at all; LDP supports it but can run into limitations ([1], [2]). ECMP and Leaf/Spine designs are very common in modern networks.
- LDP-IGP sync. While its purpose is to prevent traffic blackholing during reconvergence, it’s easy to shot yourself in the foot and create a situation when IGP sessions get stuck and never come up [3].
- Traffic Engineering with RSVP is non-deterministic. Since every router signals its LSP independently of others, the ultimate state of routing depends on the sequence of events and can be different every time the network reconverges.
- Ironically, independent LSP signaling doesn’t mean better scalability. On the contrary, every router must maintain state for all transit LSP going through it, so RSVP doesn’t scale well.
Interoperability issues
Since we’re dealing with a stack of at least 3 protocols and many extensions to them, building an MPLS network using hardware from different vendors becomes a big pain:
- Not all implementations support all features and extensions. With traditional MPLS architecture this often means if one router doesn’t support a certain feature, you can’t use it.
- Different vendors sometimes interpret RFC differently so they both claim to support a certain standard but when you connect their boxes together, it just doesn’t work.
Back in TAC I spent many days troubleshooting things like RSVP refresh reduction or fast reroute between 2 big mainstream vendors, or even 2 different OS of the same vendor! Imagine what happens when you introduce less known implementations.
What all of this means in practice is that the only viable way to build a traditional MPLS network is to buy all routers from one of the big vendors and follow their validated design. Which results in vendor lock-in and makes MPLS inaccessible for a lot of smaller networks that cannot afford to spend fortunes on network hardware.
Segment Routing basics
Segment Routing throws away all the label distribution garbage and adds a few extensions to IS-IS and OSPF to advertise Segment ID (SID) along with links and prefixes. There is no more “label switching”2; the ingress LER just pushes the SID of the egress LER onto the packet and all transit LSR don’t change that label.
Since SR is IP-based and has no circuit-switching roots like RSVP, it natively supports ECMP and anycast. With good network design, SR scales very well.
Traffic Engineering with SR
So shortest path forwarding works great, now what about traffic engineering? Unlike RSVP, SR doesn’t signal LSP, but just adds multiple SID to forward the packet along the traffic-engineered path.
In this topology, in order to forward a packet from R1 to R6 via blue links, R1 pushes 3 labels:
- Node SID of R3 -> forward packet to R3 using shortest path
- Adj SID of R3 towards R5 -> use that specific link
- Node SID of R6 -> deliver packet to the actual destination (R6)
Here is a catch: since SR is stateless, proper traffic engineering requires a controller. Without a controller it’s impossible to use bandwidth reservations; and while you can do CSPF with just affinity or explicit path, it requires extra functionality on routers, making implementation more complex.
Lowering barrier to entry into MPLS
Since we’re outsourcing SR-TE policy calculation to the controller anyway, this makes SR implementation on the router very simple:
- IGP extensions to advertise SRGB, SRLB, prefix and adjacency SID (just a few new TLV)
- BGP-LS to advertise link-state topology to the SR controller.
- BGP-SRTE or PCEP to receive SR-TE policies from the controller.
Actually (2) and (3) are optional, and so are other features like TI-LFA, Flex Algo etc. The very minimal implementation of SR is just adding a couple of new IS-IS TLV and that’s it. Then you can use one router from another vendor to export IGP topology to BGP-LS; as for receiving policies, good old BGP-LU serves as a nice workaround for routers that don’t support BGP-SRTE or PCEP.
Unlike RSVP with its many extensions that took many years for vendors to implement and refine, basic SR routers can be built by small companies or open source projects. So the network operators have a much wider choice of available hardware, or can build something on their own using open source software like FRR and VPP.
Easier interoperability
With SR there is much lower risk of interoperability issues compared to the traditional MPLS stack. Since there is no RSVP signaling, no LDP-IGP sync, interop problems can happen pretty much only on IGP level (e.g. different timers or TLV format), but those happen in LDP/RSVP setups as well.
Perhaps the only annoying thing is SRGB: every vendor decided to use a different default range, but at least you can change it when required.
SR also makes a hybrid approach viable – when using hardware from big vendors for network core, but something cheaper for aggregation and access.
What went wrong with Segment Routing
While SR has its roots in Cisco, the technology is open standard so everyone can implement it. And as I explained above, implementing basic SR is actually very easy and there are lots of implementations supporting it to some extent.
Selling a simple network design that everyone can replicate was unacceptable for big vendors. So they came up with “best practice” designs using Traffic Engineering with PCEP (with its many extensions) and over-engineered controllers that lock you in with their SR implementation.
An SR-TE controller is just a router with some extra functionality like processing BGP-LS and calculating policies with CSPF. It should be even easier than your average MPLS-TE implementation, since there is no need for LSP signaling.
Yet what you see in actual controller implementations is some disgusting bloatware that needs a supercomputer to run and does all kinds of things like network monitoring, automation, netflow collecting and OSS/BSS functionality. Which is great but who asked for any of those on a routing platform?
Self-defeating paradigm
Of course, the real business reasons for this over-engineering are:
- Vendor lock-in (as explained above). Since implementing SR is much easier than implementing RSVP with its many extensions, vendors moved proprietary magic to the controller.
- Selling services. If the controller is so difficult to deploy and operate, network operators will have to buy expertise from vendors.
The second point is very ironic. If the product is on purpose made too complicated to deploy and operate for network operators, it will also be too complicated for engineers working for a vendor.
Recently I was reading Introduction to SRv6 from Juniper and the chapter about Traffic Engineering has a beautiful sentence:
As we don’t have a controller in our setup, we will not demonstrate the PCEP provisioning from the controller.
Juniper sells their own SR-TE controller! Yet it is so complex that even Juniper engineers writing a book about SR-TE couldn’t set it up in their lab. This is not a complaint about the book (which is actually good and I recommend reading it), just an illustration of the point I made above. Also the same applies to other SR controller vendors, not just Juniper.
Dumbest network design ideas
Quote from the book “Segment Routing, Part 1”:
“…the always-on RSVP-TE full-mesh model is way too complex because it creates continuous pain for no gain as, in most of the cases, the network is just fine routing along the IGP shortest-path. A tactical TE approach is more appealing. Remember the analogy of the raincoat. One should only need to wear the raincoat when it actually rains.”
The engineers who invented Segment Routing based their research on decades of industry experience, and the feedback they collected from many MPLS network operators, so I think we should listen to their advice on network design.
What they recommend is to use IGP shortest path routing whenever possible and then deploy some SR-TE policies for traffic that needs to be forwarded via a different path. This might not be always possible, but in either case regular IGP routing (with SR extensions) should be the baseline the network can always fall back to should the SR-TE controller fail.
In spite of this common sense advice, some people have come up with designs that make even basic end-to-end connectivity depend on the SR-TE controller, since this somehow makes the network “software-defined” and “programmable”.3 I haven’t been able to understand how it’s more “programmable” than the normal design where the controller is deployed on top of basic IGP routing. The only real difference is that now controller failure leads to a catastrophic network outage, which should not be the case in a good SR design.
The real world consequence of those designs is making company execs scared of any network automation and SDN as in their minds it now equates to fragility.
Building a user-friendly SR-TE controller
Considering all of the above, a good SR-TE controller should be:
- A purely routing platform. Collect routing information, calculate policies, that’s it. Provide API for automation and CLI for troubleshooting but don’t attempt to combine all network management platforms in one.
- Easy to deploy, configure and operate. Industry-standard CLI has been working great for routers and switches, there is no reason it shouldn’t work for an SR controller. Some people really love to hate CLI, but in practice a networking product without a good CLI is unusable.
The SR-TE controller belongs strictly to the control-plane. Turning the controller into a management platform was a mistake.
- Supporting basic routers. It’s great to support a lot of features, but the controller should work with the minimal SR implementation that doesn’t support PCEP, ODN and other complex stuff. Just SR extensions for IGP and BGP-LU to install policies – pretty much everyone supports that.
- Lightweight in basic setup. It’s very handy to just deploy a controller as a Docker container running on a router, so there is no need to maintain an extra server in a remote data center, setting up redundant connections etc.
- Natively supporting SR designs. Using ECMP and anycast SID in policies, Egress Peer Engineering, Null endpoint etc. It’s not enough to take CSPF algorithms from traditional MPLS-TE and replace RSVP with SR; the proper SR controller must natively use SR capabilities.
To illustrate the last point, consider a typical Leaf-Spine topology.
Normally traffic from L1 to L6 will ECMP via all spines. As I pointed out earlier in this article, SR already gives multiple advantages over LDP in this topology, especially as we try to scale it ([1], [2]).
Now what if for Traffic Engineering reasons we want traffic from L1 to L6 go strictly via S1 and S2. There are 2 ways this can be done:
- Use link affinities (also known as admin groups or colors)
- Use explicit path
If we configure a loopback with anycast IP on S1 and S2 and use that IP as explicit path loose, the controller should resolve it via both routers and use ECMP. This was not possible with RSVP.
Now, what should be the segment list in this policy? Using 2 segment lists: <S1, L6> and <S2, L6> will consume more FEC entries in hardware. If S1 and S2 share an anycast SID, the controller should figure that out and use the anycast SID in the segment list4.
Egress Peer Engineering
EPE is a way for an ingress router to forward traffic to a specific egress peer of a specific egress router. An egress router should allocate an MPLS label (BGP Peer SID or BGP-LU) per egress peer and advertise it to the controller, so the controller can program a policy instructing the ingress router how to forward traffic.
It’s easy to integrate SR with EPE as we can just add an EPE label to the policy.
Bandwidth and affinity constraints for EPE
There is no RFC to advertise TE extensions with BGP Peer SID, similar to RFC 3630 / RFC 5305 / RFC 5329 for IGPs. So we can just configure constraints on the controller which will correlate those with BGP Peer SID it receives from the egress routers.
Null Endpoint
It is possible to configure SR-TE policies with Null endpoint (0.0.0.0 or ::). This is perfect when we want to send traffic to the closest egress peer matching the constraint. In network design this is also known as hot potato routing.
Variable endpoint
SR-TE policy can also change from a regular (node) endpoint to an egress peer endpoint. Consider a topology:
Site 1 and site 2 both advertise their prefixes to the Internet, but the preferred method of communication is over dark fibre links. If one of the links fails, and the remaining link doesn’t have enough bandwidth to accommodate all traffic, SR-TE policy can be rerouted to Null endpoint – i.e. to the egress peer. In other words, cold potato routing changes to hot potato routing.
Poor man’s Automated Steering
Automated Steering (AS) is a powerful way to map service routes (IP or VPN prefixes) to SR-TE policies. A route with color extcommunity matching the SR-TE policy color is mapped to that policy.
What if the router doesn’t support BGP SR-TE or PCEP? Earlier I made a point that a good controller should work even with the most basic SR implementation.
Sure, we can use BGP-LU to advertise policies from the controller, but then there is no way to map different services to different policies. Actually there is:
- Configure a separate loopback that is NOT advertised into IGP
- Advertise this loopback in BGP-LU with low LOCAL_PREF
- Change nexthop of the service routes to this loopback
- When SR-TE controller uses BGP-LU to send SR-TE policies, it advertises this “service-loopback” rather than the actual policy endpoint
Works almost as great as automated steering! Just needs a bit more configuration, but this is the price to pay for using cheap routers.
Introducing Traffic Dictator
I developed Traffic Dictator as a minimalistic, user-friendly SR-TE controller. It is a routing platform, with configuration resembling a router so every network engineer familiar with SR and BGP can intuitively figure out how to use it.
root@TD1:/# tdcli ### Welcome to the Traffic Dictator CLI! ### TD1#conf TD1(config)#traffic-eng policies TD1(config-traffic-eng-policies)#policy ? <POLICY_NAME> Traffic-eng policy name
It can run in a docker container (even on a router that can run containers). It is SR-native, so it supports ECMP, anycast SID, mixed IPv4/IPv6 SID, policies through IS-IS/OSPF/BGP domains, EPE, Null endpoint and so on. Although BGP SR-TE with automated steering is preferred, the controller can also work with very basic SR implementations, using BGP-LU with service-loopback to install policies.
You can download Traffic Dictator from the Vegvisir website and follow the documentation to install it. Also check out the white paper.
Or try in a pre-configured Containerlab setup with Cisco XRd or Arista cEOS: https://vegvisir.ie/2024/06/11/traffic-dictator-quick-start-with-containerlab/
I will be posting more technical articles about Segment Routing, Egress Peer Engineering, network design and automation. Whether you agree or disagree with my ideas, have any suggestions or want to defend the big vendor approach, please leave a comment or write me an email.
Notes
- ^Some networks use VXLAN even for pure L3 connectivity but that’s just because at a given moment they found cheaper switches that can do VXLAN
- ^At least in theory; practical implementations still just swap label with same label
- ^The inspiration for those designs is [RFC8604] which is just a hypothetical concept illustrating that SR allows to build a network with more routers than 20 bit MPLS label space can address; this never made any sense in the actual ISP network design with a few hundreds or thousands of routers.
- ^Yes, I know, in simple this topology Spine SID wouldn’t be actually pushed on the wire, they will just be used to resolve nexthop. In more complex topologies anycast SID becomes more relevant.
Thanks for the write-up. I really learnt a lot.
Please I will like to learn more about Segment Routing, if you have written stuffs on them, kindly link me to them.
Thank you for the article. I really learned a lot.
Great article. Looking forward to playing with TD in my SR Containerlab environment tomorrow.
great article