Segment Routing simplifies MPLS for the network operator – but not for the developer.
How Traffic Engineering works
Consider the topology:
I want to steer traffic from R1 to R7 using only blue links. R1 (or controller) runs Constrained Shortest Path First (CSPF) algorithm by excluding all other (non-blue) links and gets the following path: [R1-R2, R2-R4, R4-R6, R6-R8, R8-R7]. In MPLS-TE, this is called Explicit Route Object (ERO) and is expressed with interface IP addresses. Then R1 uses RSVP to signal an LSP via this path, and once the LSP has been signaled, it can be used to forward traffic.
Segment Routing
There are multiple problems with RSVP-TE, one of them being statefullness which leads to poor scalability and interoperability issues.
Segment Routing is stateless: there is no LSP signaling, the headend pushes a list of segments (MPLS labels or SRv6 headers) to steer traffic via the desired path.
Naive implementation
The most dumb and obvious way of doing that is just use the adjacency segments: for each link, one adj SID:
This of course is a terrible idea because it will lead to absurd label stack, leading to MTU issues and hardware limitations.
Improvement: prefix SID
A better approach is using a prefix SID:
Now instead of pushing 4 labels, we push just 2: first steer traffic to R8 along the shortest path and then from R8 to R7, along the shortest path.
In some cases, a mix of prefix and adj SID is required:
Now there are 2 links between R7 and R8 so after reaching R8, we use adj SID to steer traffic over the blue link (because the shortest path would be ECMP across both links).
Basic algorithm
Having figured out the basics of SR-TE operations, let’s think of an algorithm to generate these segment lists.
- Run CSPF to get the ERO
- Starting from the endpoint node, run regular SPF from the previous node to the current (endpoint) node and compare results with CSPF. Also keep the “SID_required” flag – always true for endpoint, also set to true is the most recently added SID was an adj SID
- If SPF == CSPF and SID_required is true: add prefix SID
- If SPF == CSPF and SID_required is false: don’t add SID, move to the previous node
- If SPF is different – add adj SID
- Each subsequent SPF is run from previous node to the node whose SID we added most recently – let’s call it “last_anchored_node”
Simplified flowchart:
Problems with basic algorithm
This works for some topologies, but there are problems with the basic algorithm. Consider the topology:
The link between R3 and R5 has a higher metric. I want to steer traffic from R1 to R7 using only blue links.
- From R5, SPF to R7 is the same as CSPF -> add node SID of R7 (because SID_required is true)
- From R6, SPF to R7 is NOT the same as CSPF -> add adj SID to R5
- From R4 , SPF to R6 is the same as CSPF, SID_required is true – add node SID of R6
This will work but is suboptimal, because [Node R5, Node R7] is a sufficient label stack to achieve the same result.
Improved algorithm
Let’s add the following modification:
- If SPF differs from CSPF, instead of just adding adj SID, run another SPF from previous node to current node (instead of last anchored node)
- If the second SPF differs from CSPF as well, add adj SID
- Else add prefix SID
Flowchart:
This works much better and is more or less what most vendors do – at least judging from lab results, the actual algorithms they use might be different, I haven’t seen their code! Sadly open source implementations (e.g. FRR) don’t generate segment lists and just support static configuration.
However, this is not good enough for a proper Segment Routing implementation! At least for a controller that specialises in Traffic Engineering.
Elephant in the room: ECMP
Unlike RSVP-TE, Segment Routing is an IP-native technology that supports ECMP and anycast.
Therefore, reusing CSPF algorithms from RSVP-TE to build SR-TE paths is not a good idea.
Instead of expressing the ERO as Vector and having the algorithm loop through it, it’s a better idea to use HashMap where a node can point to one or more links which points to the next node; and then use a queue to process this HashMap. Something like this:
"0005.0005.0005.00": { "[E][L2][I101][N[c65002][b0][s0005.0005.0005.00]][R[c65002][b0][s0008.0008.0008.00]][L[i10.100.9.5][n10.100.9.8][i2002:100:9::5][n2002:100:9::8]]": "0008.0008.0008.00", }, "0001.0001.0001.00": { "[E][L2][I101][N[c65002][b0][s0001.0001.0001.00]][R[c65002][b0][s0005.0005.0005.00]][L[i10.100.3.1][n10.100.3.5][i2002:100:3::1][n2002:100:3::5]]": "0005.0005.0005.00", },
Those long strings are links as they are represented by BGP-LS.
Multiple segment lists
With the new ERO structure, the algorithm now can use ECMP and generate multiple segment lists. Consider the topology:
In order to steer traffic from R11 to R1 using blue or orange links, we can use 2 segment lists: [Node R2, Node R1] and [Node R5, Node R1].
This is allowed by SR architecture but there are a couple of problems:
- Multiple segment lists will consume more FIB entries in hardware
- If a controller is used to calculate SR-TE policies, there might be a problem with advertising multiple segment lists:
- BGP SR-TE supports multiple segment lists
- PCEP has a draft for multipath but it’s not widely supported as of now
- BGP-LU can use add-path but that increases complexity
Anycast SID
In some cases, multiple segment lists can be reduced to one by the use of anycast SID.
In the same topology as above, if we configure an anycast SID on R2 and R5, the same traffic engineering objective can be achieved with just one segment list:
This solves the issues with policy advertisement from the controller and optimizes hardware resource utilization on the router.
Anycast SID – how to use
This is the tricky part which took me some time to get right when working on the algorithm for Traffic Dictator.
The idea is that when processing ERO and generating segment lists, we don’t write the actual SID right away, just some metadata of what type of SID is required, from which node or link, from which node to use SRGB (if it’s different) etc. This is the struct I use:
pub struct SlStructureElement { pub af: String, pub lan_segment: bool, pub link_route_key: Option<String>, pub next_node: String, pub next_sid: String, pub sid_type: String, pub suggested_anycast_sid: Option<PrefixSidSubObject> }
If ECMP is used, the algorithm checks ECMP nodes for anycast SID eligibility:
- Does anycast SID exist on all ECMP nodes (and only on them)?
- After ECMP that – does the path converge on one node, or is there more ECMP (check recursively)
- If there is more ECMP – do all nodes have the same SRGB?
There are a few more checks but this is the idea. If all conditions are satisfied, we can replace prefix SID with anycast SID. The last step is to deduplicate the structure, which hopefully will result in just one segment list.
Anycast SID is especially useful in multi-domain topologies:
In multi-domain designs (e.g. Seamless MPLS), anycast SID provides load balancing and resiliency and is one of the reasons to prefer SR over LDP or RSVP.
Vendor implementations of ECMP and anycast SID
From my lab testing results, vendors can generate some basic segment lists which are mostly ok in basic topologies, but break when I introduce ECMP. In slightly non-conventional topologies, the generated segment lists are wrong and don’t satisfy the specified constraints. As I haven’t seen the code of Cisco or Juniper, but presume they reused the logic from RSVP-TE and that doesn’t work well for SR-TE.
As for anycast, I tried to enable anycast-sid-inclusion on IOS-XR but it didn’t work in the topology where anycast can be obviously used. My assumption is that it was developed for a specific use case in a specific topology and does not work in all topologies.
Conclusion
It seems that Traffic Dictator is currently the only SR-TE implementation that is fully ECMP and anycast aware. In a basic topology, probably any router/controller would generate the same segment list, but it’s better to have a controller that guarantees the path that will satisfy the given constraints in every topology.
What can happen with other implementations, especially in a complex topology with a lot of ECMP, is that you configure an SR-TE policy with some constraints, but the router/controller generates a segment list that doesn’t entirely satisfy that constraint, and part of the traffic goes to different links, which can lead to suboptimal routing and congestions.
Whichever SR-TE implementation you are using, it’s a good idea to test different policy types in your network topology and understand the limitations, so there will be no surprises when traffic suddenly takes a path different than the configured constraints.