Close to the Edge

PE-CE routing in MPLS L3VPN is an important topic which confuses a lot of people. Thanks to EVPN, it is now used not only in ISP but also DC networks.

Fundamentals of PE-CE routing

Usually either static routing or eBGP is used. If the customer has their own ASN, it can be used, otherwise the ISP assigns a private ASN to the customer and configures a BGP session in VRF on PE routers towards CE using that AS number. Something like this:

Fig. 1

Of course, the ISP will not assign a separate private ASN1 for each customer site – rather it will use an AS per customer. Since AS numbers in BGP are used for loop prevention (routers don’t accept updates with own AS in AS_PATH), the topology on figure 1 will not work with default BGP configs. You have to configure either “allowas-in” on CE to accept updates with own AS, or “as-override” on PE to replace the customer’s ASN with ISP ASN. Another option (if private ASN are used) is “remove-private-as” in BGP updates sent to CE (although this feature is more commonly used to prevent private ASN from being advertised to the Internet). This means, disabling the main BGP loop prevention mechanism – which will likely result in routing loops, unless we do something about it.

Fig. 2

If CE1 is multihomed to PE1 and PE2 and one of the three aforementioned techniques of disabling loop prevention is used, PE2 can advertise routes received from CE back to it. In such a primitive topology this is not a big deal, but if there are multiple CE routers on site, this can lead to suboptimal routing whereby intra-site traffic is routed via L3VPN, or in some cases routing loops might form.

The solution is to use the Site of Origin BGP extended community. When PE receives a route from CE, it attaches this community with a site ID, and no other PE router will advertise this route to a CE with the same site ID.

This was PE-CE routing 101. For most deployments, you don’t need to know more. If you want to secure your job by designing a solution no one else will be able to support (or someone already did that before you), keep reading.

Other routing protocols

iBGP

Yes, internal BGP as a PE-CE routing protocol. Why would anyone do that?

Fig. 3

If the customer has own AS and wants to advertise routes to the Internet, those routes propagated via L3VPN will look like they originated in provider’s AS, which doesn’t look good and upstream ISP filters might not like it (especially now with RPKI). Even if we do allowas-in on CE instead of as-override on PE, provider’s AS will still be in AS_PATH. In order to prevent this from happening, we can use iBGP for PE-CE routing.

There are multiple problems here. By default, routes learnt from iBGP peers are not advertised to other iBGP peers. Therefore, we must configure CE routers as route-reflector clients of PE routers, and also enable next-hop-self. Since customer and provider are in fact in different AS, on PE we must configure “local-as” towards CE, using customer’s ASN.

There are plenty of attributes in BGP that are propagated within AS – like LOCAL_PREF, MED, CLUSTER_LIST, ORIGINATOR_ID etc. What we don’t want to happen is to let customer’s attributes “leak” into provider’s AS and the other way around. If that happened, customer’s and provider’s routing policies could conflict, leading to suboptimal routing, or duplicate router IDs in CLUSTER_LIST could block route propagation.

Therefore, [RFC6368] introduces a new BGP attribute ATTR_SET. PE packs BGP attributes from customer’s routes into ATTR_SET, propagates the route through provider’s AS using provider’s attributes while ATTR_SET is left intact, then the other PE unpacks customer’s attributes from ATTR_SET, and advertises routes with those attributes to CE. ATTR_SET also preserves the original customer’s ASN, so that if CE has to advertise routes propagated through L3VPN to the Internet, provider’s ASN won’t be in the AS_PATH.

Fig. 4

After getting propagated via ISP’s network, the route will look like this:

PE1# sh bgp vpnv4 unicast all 6.6.6.6/32
BGP routing table entry for 1.1.1.1:1:6.6.6.6/32, version 14
Paths: (1 available, best #1, table one)
  Advertised to update-groups:
     4         
  Refresh Epoch 2
  Local, imported path from 3.3.3.3:1:6.6.6.6/32 (global)
    3.3.3.3 (metric 20) (via default) from 4.4.4.4 (4.4.4.4)
      Origin IGP, metric 0, localpref 100, valid, internal, best
      Originator AS(ibgp-pece): 200
      Originator: 6.6.6.6, Cluster list: 3.3.3.3
      mpls labels in/out nolabel/23
      rx pathid: 0, tx pathid: 0x0
BGP routing table entry for 3.3.3.3:1:6.6.6.6/32, version 13
Paths: (1 available, best #1, no table)
  Not advertised to any peer
  Refresh Epoch 3
  Local
    3.3.3.3 (metric 20) (via default) from 4.4.4.4 (4.4.4.4)
      Origin IGP, localpref 100, valid, internal, best
      Extended Community: RT:1:1
      Originator: 6.6.6.6, Cluster list: 4.4.4.4, 3.3.3.3
      ATTR_SET Attribute:
        Originator AS 200
        Origin IGP
        Aspath
        Med 0
        LocalPref 100
        Cluster list
        3.3.3.3,
        Originator 6.6.6.6
      mpls labels in/out nolabel/23
      rx pathid: 0, tx pathid: 0x0

Note that customer’s CLUSTER_LIST is different from provider’s CLUSTER_LIST, in this example 4.4.4.4 is a route reflector from the ISP network, customer needs not to know about it.

The ISP can use route reflectors in their network without impacting customer’s CLUSTER_LIST. Also, provider’s routing policies can set custom LOCAL_PREF, MED and other attributes and that won’t impact customer’s routing policies. For dual-homed CE, the site of origin community is no longer needed for loop prevention (route reflection takes care of that), but it might still be needed to prevent suboptimal routing.

For more details on how iBGP PE-CE with ATTR_SET works, check out this article by Luc De Ghein:

https://www.cisco.com/c/en/us/support/docs/ip/border-gateway-protocol-bgp/117567-technote-ibgp-00.html

OSPF

OSPF PE-CE is a perfect example of overengineering and perhaps for this reason it is such a popular topic in vendor certifications.

OSPF routers know all topology information within the same area, but between areas it’s more like distance-vector routing. All areas must connect to area 0. What they came up with in PE-CE is Superbackbone – sort of an area that stands above area 0. Routes are redistributed from OSPF into BGP on one PE, then from BGP into OSPF on the other PE and they are marked as inter-area (type 3 LSA) routes – like if they were advertised from another area. External routes are propagated as usual.

Fig. 5

On figure 5, single area OSPF is configured between PE and CE routers. Sample PE config:

router ospf 1 vrf one
 router-id 1.1.1.1
 redistribute bgp 100 subnets
 network 172.16.0.1 0.0.0.0 area 0
!
router bgp 100
 !
 address-family ipv4 vrf one
  redistribute ospf 1 match internal external 1 external 2

Take for example 6.6.6.6/32: CE2 advertises it in type 1 LSA, PE3 redistributes it in OSPF and this is what arrives on PE1:

PE1#sh bgp vpnv4 unicast rd 1.1.1.1:1 6.6.6.6/32
BGP routing table entry for 1.1.1.1:1:6.6.6.6/32, version 8784
Paths: (1 available, best #1, table one)
  Not advertised to any peer
  Refresh Epoch 1
  Local, imported path from 3.3.3.3:1:6.6.6.6/32 (global)
    3.3.3.3 (metric 20) (via default) from 4.4.4.4 (4.4.4.4)
      Origin incomplete, metric 11, localpref 100, valid, internal, best
      Extended Community: RT:1:1 OSPF DOMAIN ID:0x0005:0x000000010200
        OSPF RT:0.0.0.0:2:0 OSPF ROUTER ID:3.3.3.3:0
      Originator: 3.3.3.3, Cluster list: 4.4.4.4
      mpls labels in/out nolabel/21
      rx pathid: 0, tx pathid: 0x0

There are plenty of extcommunities here. OSPF domain ID is equal to process ID (unless explicitly configured to another value). If domain ID doesn’t match on PE, all OSPF routes will be treated as external. OSPF RT refers to route type: 0.0.0.0 is area 0 (meaningful only for internal routes), 2 means the route comes from type 1 or 2 LSA (other possible values are 3, 5, 7 – refer to the respective LSA and 129 – Sham link – covered below), route type is meaningful only for type 5/7 routes: – if set to 0, its metric type 1, if set to 1 – metric type 2. Router ID refers to the OSPF router ID of PE3 (in this example it is the same as BGP router ID). Also, OSPF metric is mapped to BGP MED.

But that’s not it. Now PE1 and PE2 redistribute 6.6.6.6/32 from BGP into OSPF; since route type is internal and domain ID matches, it will be internal.

PE1#sh ip ospf 1 database summary 6.6.6.6

            OSPF Router with ID (1.1.1.1) (Process ID 1)

        Summary Net Link States (Area 0)

  LS age: 1721
  Options: (No TOS-capability, DC, Downward)
  LS Type: Summary Links(Network)
  Link State ID: 6.6.6.6 (summary Network Number)
  Advertising Router: 1.1.1.1
  LS Seq Number: 80000001
  Checksum: 0x3D58
  Length: 28
  Network Mask: /32
    MTID: 0     Metric: 11

  LS age: 996
  Options: (No TOS-capability, DC, Downward)
  LS Type: Summary Links(Network)
  Link State ID: 6.6.6.6 (summary Network Number)
  Advertising Router: 2.2.2.2
  LS Seq Number: 80000002
  Checksum: 0x1D73
  Length: 28
  Network Mask: /32
    MTID: 0     Metric: 11

Note the Down bit. It is set by PE when redistributing routes from BGP to OSPF in VRF. No other router running VRF will install the OSPF route with down bit. This is done to prevent routing loops and suboptimal routing.

If a CE runs VRFs, it will not install routes with Down bit by default, some vendors have special knobs to change this behaviour (e.g. capability vrf-lite on IOS) – beware of routing loops with it!

For external routes, there is also VPN route tag:

PE1#sh ip ospf database ext 192.168.200.0

            OSPF Router with ID (1.1.1.1) (Process ID 1)

        Type-5 AS External Link States

  LS age: 240
  Options: (No TOS-capability, DC, Downward)
  LS Type: AS External Link
  Link State ID: 192.168.200.0 (External Network Number )
  Advertising Router: 1.1.1.1
  LS Seq Number: 80000002
  Checksum: 0x397A
  Length: 36
  Network Mask: /32
    Metric Type: 2 (Larger than any link state path)
    MTID: 0
    Metric: 20
    Forward Address: 0.0.0.0
    External Route Tag: 3489661028

This is not something specific to L3VPN, it was defined in RFC1745 for the generic case of BGP<->OSPF redistribution. In this example tag is computed per RFC, it can be also configured manually. The point of VPN tag is that if one PE redistributes a route from BGP into OSPF, the other PE (with the same tag) will not consider this route for best path selection. In PE-CE routing there is Down bit, so VPN tag is not really needed.

Also, external routes propagated through L3VPN are re-originated by PE that redistributes them from BGP – therefore, there is no need for PE to generate type 4 LSA (which are normally generated by ABR, so that routers can resolve type 5/7 LSA coming from another area).

Sham Links

It gets even worse. What if there is another link between CE1 and CE2:

Fig. 6

Assuming CE routers are in the same area, the route via direct link will be intra-area – hence always preferred over inter-area routes propagated through L3VPN. This is unfortunate since in many designs L3VPN is the main link and the backdoor link is a lower-bandwidth connection (e.g. VPN over Internet) to be used only in case of primary link failure.

The solution is to configure what is known as sham link between PE routers – sort of a virtual OSPF link over which it is possible to flood LSA, so that the path over L3VPN will be also considered intra-area.

Sample config:

interface Loopback100
 vrf forwarding one
 ip address 172.16.255.1 255.255.255.255
!
router ospf 1 vrf one
 area 0 sham-link 172.16.255.1 172.16.255.3
!
router bgp 100
 address-family ipv4 vrf one
  network 172.16.255.1 mask 255.255.255.255

Lo100 must be advertised in BGP, but not in OSPF. Now there are virtual OSPF sessions between PE, and all LSA can be flooded through those sessions:

PE1#sh ip ospf 1 nei

Neighbor ID     Pri   State           Dead Time   Address         Interface
3.3.3.3           0   FULL/  -           -        172.16.255.3    OSPF_SL1
5.5.5.5           0   FULL/  -        00:00:33    172.16.0.5      Ethernet1/0

PE1#sh ip ospf database

            OSPF Router with ID (1.1.1.1) (Process ID 1)

        Router Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum Link count
1.1.1.1         1.1.1.1         554         0x80000005 0x002C1B 3
2.2.2.2         2.2.2.2         654         0x80000004 0x002816 3
3.3.3.3         3.3.3.3         555         0x80000004 0x00020D 4
5.5.5.5         5.5.5.5         1188        0x80000009 0x009761 7
6.6.6.6         6.6.6.6         1145        0x80000007 0x002778 5

Now the remote CE is seen in type 1 LSA!  Even PE1’s route towards CE2 is via OSPF:

PE1#sh ip ro vrf one 6.6.6.6

Routing Table: one
Routing entry for 6.6.6.6/32
  Known via "ospf 1", distance 110, metric 12, type intra area
  Redistributing via bgp 100
  Advertised by bgp 100 match internal external 1 & 2
  Last update from 3.3.3.3 00:42:41 ago
  Routing Descriptor Blocks:
  * 3.3.3.3 (default), from 6.6.6.6, 00:42:41 ago
      Route metric is 12, traffic share count is 1
      MPLS label: 21
      MPLS Flags: MPLS Required

Since those routes are intra-area, they can be preferred over the routes via backdoor link (depending on costs).

IS-IS

There are no IS-IS extensions for PE-CE routing. Nevertheless, you can use it for this purpose (but please don’t).

The config is similar to OSPF: IS-IS process in VRF on PE routers, and mutual redistribution between IS-IS and BGP. Sample config on IOS:

router isis one
 vrf one
 net 49.0001.0001.0001.00
 is-type level-2-only
 metric-style wide
 redistribute bgp 100
!
router bgp 100
 address-family ipv4 vrf one
  redistribute isis one level-2

Now it sort of works. Redistributed routes will be advertised in the LSP of the PE routers – similar to how IS-IS routes are advertised between levels; but things like attached bit or down bit will not be preserved – as there are no special BGP communities to advertise those (like in OSPF). Furthermore, with dual-homed PE you can run into suboptimal routing and redistribution loops.

I wrote a bit about loops with 2 redistribution points in Seamless Suffering – exactly the same applies here. What in OSPF is taken care of by down bit, in IS-IS we have to prevent with route tags. Consider the topology like on figure 5, but IS-IS is used instead of OSPF. Adding the following config to PE1 and PE2:

route-map BGP_TO_ISIS permit 10
 set tag 666
!
route-map ISIS_DIST deny 10
 match tag 666
route-map ISIS_DIST permit 20
!
router isis one
 redistribute bgp 100 route-map BGP_TO_ISIS
 distribute-list route-map ISIS_DIST in

This achieves the same result as Down bit in OSPF: routes redistributed from BGP into IS-IS by one PE will not be installed as IS-IS routes by the other PE. Without this config, dual-homed PE with IS-IS as PE-CE protocol will result in permanent route flapping and loops.

IS-IS metric is not preserved when routes are propagated by BGP. While most implementations copy IGP metric into BGP MED when doing IGP->BGP redistribution (either by default or it is possible to enable this behaviour); during BGP->IS-IS redistribution BGP MED is not mapped to IS-IS cost.

Therefore, while it is possible to implement the critical part of PE-CE functionality (loop prevention) by RIB filtering, IS-IS has no analogs to other OSPF extensions for PE-CE routing. At least sham links are not needed since IS-IS doesn’t separate internal and external routes.

EIGRP

EIGRP remained Cisco proprietary for too long, so no other vendor implemented it – which means almost nobody uses it in ISP networks. But the idea of how they designed PE-CE routing in EIGRP is quite neat. All EIGRP metrics are packed into a special BGP cost community, which has a thing called Point of Insertion (POI). POI 128 means the cost community can override BGP best path selection process – which in this case is used to compare the metrics of BGP and EIGRP routes – quite a significant deviation from the way all other routing best path selection works (usually comparing routes from different protocols doesn’t make sense and AD is used to prefer one protocol over another one).

Consider the topology like on figure 6, but EIGRP is enabled instead of OSPF. Sample PE config:

router eigrp 100
 !
 address-family ipv4 vrf one autonomous-system 100
  redistribute bgp 100 metric 10000 100 255 1 1500
  network 172.16.0.1 0.0.0.0
 exit-address-family
!
router bgp 100
 !
 address-family ipv4 vrf one
  redistribute eigrp 100
 exit-address-family

PE1 received a EIGRP route propagated through L3VPN from PE3:

PE1#sh bgp vpnv4 unicast rd 3.3.3.3:1 6.6.6.6
BGP routing table entry for 3.3.3.3:1:6.6.6.6/32, version 4694
Paths: (1 available, best #1, no table)
  Not advertised to any peer
  Refresh Epoch 1
  Local
    3.3.3.3 (metric 20) (via default) from 4.4.4.4 (4.4.4.4)
      Origin incomplete, metric 409600, localpref 100, valid, internal, best
      Extended Community: RT:1:1 Cost:pre-bestpath:128:409600 0x8800:32768:0
        0x8801:100:153600 0x8802:65281:256000 0x8803:65281:1500
        0x8806:0:101058054
      Originator: 3.3.3.3, Cluster list: 4.4.4.4
      mpls labels in/out nolabel/21
      rx pathid: 0, tx pathid: 0x0

PE1#sh ip eigrp vrf one topology 6.6.6.6/32  
EIGRP-IPv4 Topology Entry for AS(100)/ID(172.16.0.1) VRF(one)
EIGRP-IPv4(100): Topology base(0) entry for 6.6.6.6/32
  State is Passive, Query origin flag is 1, 1 Successor(s), FD is 409600
  Descriptor Blocks:
  3.3.3.3, from VPNv4 Sourced, Send flag is 0x0
      Composite metric is (409600/0), route is Internal (VPNv4 Sourced)
      Vector metric:
        Minimum bandwidth is 10000 Kbit
        Total delay is 6000 microseconds
        Reliability is 255/255
        Load is 1/255
        Minimum MTU is 1500
        Hop count is 1
        Originating router is 6.6.6.6
  172.16.0.5 (Ethernet1/0), from 172.16.0.5, Send flag is 0x0
      Composite metric is (435200/409600), route is Internal
      Vector metric:
        Minimum bandwidth is 10000 Kbit
        Total delay is 7000 microseconds
        Reliability is 255/255
        Load is 1/255
        Minimum MTU is 1500
        Hop count is 2
        Originating router is 6.6.6.6

EIGRP metrics from the cost community are translated, and the route that was sent through L3VPN can be compared to the EIGRP route propagated through the backdoor link between CE1 and CE2. It appears that the route via L3VPN has better metrics. AD does not play a role here (otherwise EIGRP would win over iBGP).

When I worked with EIGRP and had to decode cost communities, I always used a nice picture, the origins of which I never knew. After short googling, I found the source article by Costi Serban:

https://costiser.ro/2014/03/27/pre-bestpath-cost-community/#.XoouD3JJlhF

Read this for more details of cost communities in EIGRP, the author did a really good write-up on the subject.

Also, cost communities in BGP are not strictly specific to EIGRP, but I haven’t seen them being used anywhere else. There is also an effort to standardize them: draft-ietf-idr-custom-decision

RIP

RIP in 2020? No way. But yes, it is supported as PE-CE protocol. Sample PE configuration:

ip prefix-list RIP_IN seq 5 deny 6.0.0.0/8
ip prefix-list RIP_IN seq 10 permit 0.0.0.0/0 le 32
!
router rip
 version 2
 !
 address-family ipv4 vrf one
  redistribute bgp 100 metric transparent
  network 172.16.0.0
  distribute-list prefix RIP_IN in Ethernet1/1
  no auto-summary
 exit-address-family
!
router bgp 100
 address-family ipv4 vrf one
  redistribute rip

Fig. 7

As with other IGPs, RIP metric is copied into BGP MED, but to copy BGP MED back into RIP metric on the remote PE, I use the “metric transparent” statement. So it sort of works. Also for multihomed CE, I have to configure a prefix list on PE1 and PE2 with remote site prefixes and block them from being learnt via CE. Not doing so might result in suboptimal routing or loops.

Mixing PE-CE protocols

Totally possible, why not? Different PE-CE links can run the mix of all aforementioned protocols + static routing. In OSPF and EIGRP, redistributed routes will appear as external, originated by the redistributing PE. Of course, solving all problems related to best path selection, route flaps and loops are up to the operator in that case. All those PE-CE extensions reviewed above exist to simplify operator’s job by carrying as much customer’s routing information as possible across MPLS VPN.

Carrier’s carrier

Carrier’s Carrier (CsC) is a design where an ISP provides VPN services to another ISP. If the Customer-ISP wants to advertise a lot of external routes (e.g. BGP full view), CSC-ISP doesn’t have to learn all those routes – instead they learn only C-ISP’s loopbacks + MPLS labels; C routers exchange their routes without CsC-ISP participation.

Fig. 8

From a technology standpoint, the main functionality here is MPLS support in VRF on CSC-PE. The label protocol between PE and CSC-PE doesn’t have to be BGP-LU, it can also be LDP + whatever IGP C-ISP runs. The same design can be used if C-ISP provides VPN services – again, without the need for CSC-ISP to learn any customer’s VPN routes.

For example:

Fig. 9

CSC-PE learns the outgoing label in VRF from the BGP-LU advertisement it received from the ISP PE router (e.g. label 281 for CSC-PE1 on figure 9). When it generates a CSC-VPNv4 route for it, it must allocate a locally unique label for it, in order to install a “swap” LFIB entry.

Sample outputs:

CSC-PE1#sh bgp vpnv4 unicast rd 1.1.1.1:1 7.7.7.7/32
BGP routing table entry for 1.1.1.1:1:7.7.7.7/32, version 5502
Paths: (1 available, best #1, table one)
Multipath: eBGP
  Additional-path-install
  Advertised to update-groups:
     1
  Refresh Epoch 1
  200
    172.16.0.5 (via vrf one) from 172.16.0.5 (5.5.5.5)
      Origin incomplete, metric 10, localpref 200, valid, external, best
      Extended Community: SoO:100:1 RT:1:1 , recursive-via-connected
      mpls labels in/out 288/281
      rx pathid: 0, tx pathid: 0x0

The LFIB entry points to VRF, but has an outgoing label:

CSC-PE1#sh mpls for label 288
Local      Outgoing   Prefix           Bytes Label   Outgoing   Next Hop
Label      Label      or Tunnel Id     Switched      interface
288        281        7.7.7.7/32[V]    8628          Et1/0      172.16.0.5

Traceroute CE2 -> CE1:

CE2#traceroute 9.9.9.9
Type escape sequence to abort.
Tracing the route to 9.9.9.9
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.1.8 0 msec 1 msec 0 msec
  2 172.16.4.6 [MPLS: Labels 25/24 Exp 0] 0 msec 0 msec 1 msec
  3 172.16.2.3 [MPLS: Labels 294/24 Exp 0] 0 msec 1 msec 0 msec
  4 10.2.2.4 [MPLS: Labels 18/288/24 Exp 0] 1 msec 1 msec 0 msec
  5 172.16.0.1 [MPLS: Labels 288/24 Exp 0] 1 msec 0 msec 1 msec
  6 172.16.0.5 [MPLS: Labels 281/24 Exp 0] 0 msec 0 msec 5 msec
  7 192.168.0.7 [AS 200] [MPLS: Label 24 Exp 0] 7 msec 0 msec 6 msec
  8 192.168.0.9 [AS 200] 1 msec 0 msec 5 msec

Label allocation modes

In L3VPN, when a PE router advertises prefixes in VRF to other PE, it allocates MPLS labels – commonly known as VPN or service labels. When a data packet arrives with the service label exposed, PE knows to which VRF that packet belongs.

There are 3 modes of allocating VPN labels:

  1. Per-VRF: all prefixes from the same VRF get the same label
  2. Per-prefix: every prefix gets a separate label
  3. Per-CE: VRF prefixes with the same nexthop get the same label

There is no industry consensus of what mode is better. By default, Arista does per-VRF, Cisco does per-prefix, Juniper does per-CE. While there is no strict requirement for all PE to run the same mode, it is better to make it the same in a multivendor network.

Anyway, what are the differences between those modes?

Per-prefix and per-CE modes require a single lookup on arriving packets: the LFIB entry points to nexthop with egress interface and L2 rewrite info. In per-VRF mode, the LFIB entry points to VRF, and then the router has to do another lookup on the IP packet to determine nexthop. There are folk legends that per-prefix results in faster forwarding – probably those are spread by the same people who claim that MPLS switching is faster than IP. No scientific confirmation of those claims has been found2.

Things that matter are route scale, ECMP load sharing, ICMP tunneling, BGP PIC and of course your favourite vendor’s limitations and bugs.

Route scale

The obvious drawback of per-prefix mode is that if there is a large number of prefixes (e.g. BGP full view in VRF), it leads to wasting labels and possibly hitting a platform limitation, or even exhausting the entire label space (there can be a maximum of 1,048,576 MPLS labels). The same by the way applies to BGP-LU, even without VRFs. As the full view grows, people will be hitting those limits on older deployments that have been working fine for years – see for example this thread: https://puck.nether.net/pipermail/cisco-nsp/2020-January/107252.html. Therefore, with a large number of prefixes you should use per-VRF or per-CE modes.

ECMP/LAG load sharing

When doing ECMP or LAG load sharing, MPLS routers attempt to use L3/L4 headers for hash, to achieve better traffic distribution3, but they can lookup only a few labels deep. If there are more labels than the router can look through, the only information which contributes to hashing is taken from MPLS label values – and if those labels are the same, all traffic will be forwarded over one link. Per-prefix label allocation improves load sharing since different labels are allocated for different prefixes. Anyway this is a legacy thing; modern deployments use Entropy labels [RFC6790].

Directly connected prefixes

In per-VRF mode, there is no difference between handling directly connected or remote prefixes: first the label is stripped and then an IP lookup is done.

In per-prefix or per-CE modes, what to do with directly connected prefixes? Cisco’s answer to that is to allocate an aggregate label for those prefixes, like in per-VRF mode. Juniper does the same for host routes (e.g. loopback) but not for multipoint interfaces – those just don’t get advertised into MPLS VPN.

ICMP tunneling

Contrary to the superstitious beliefs of some security admins, ICMP is critically important for normal functioning of TCP/IP networks. Apart from everyday troubleshooting tools like ping and traceroute, things like Path MTU discovery can lead to application-level problems when broken. In IPv6, link-local ICMP is used for neighbour discovery.

Some ICMP messages (e.g. TTL exceeded, host unreachable, packet too big) are originated by the transit host when it drops packets. In MPLS networks, this means sometimes P routers have to originate those ICMP packets and send them to hosts connected to CE routers. Since P routers don’t know VRF routes, they do ICMP tunneling: the ICMP packet is encapsulated with the same label as the dropped packet was, forwarded in the same LSP, and then the egress PE router does forwarding lookup and send the packet to the proper destination.

Fig. 10

When I run traceroute from CE1 to CE2, at TTL 2 P drops the probe but in order to deliver the ICMP TTL exceeded back, it tunnels it via the same LSP probe went through and PE2 routes it back to CE1. At least this is what happens in per-VRF label allocation mode.

CE1#traceroute 6.6.6.6 source 5.5.5.5
Type escape sequence to abort.
Tracing the route to 6.6.6.6
VRF info: (vrf in name/id, vrf out name/id)
  1 172.16.0.1 0 msec 0 msec 0 msec
  2 10.0.0.4 [MPLS: Labels 18/134 Exp 0] 1 msec 0 msec 0 msec
  3 172.16.2.3 [AS 100] 1 msec 0 msec 0 msec
  4 172.16.2.6 [AS 100] 0 msec 0 msec 1 msec

If PE2 runs per-prefix or per-CE label allocation mode, the label it allocates for 6.6.6.6/32 will have an LFIB entry “forward to CE2” – so even the tunneled ICMP messages will be sent to CE2:

Fig. 11

CE1#traceroute 6.6.6.6 source 5.5.5.5
Type escape sequence to abort.
Tracing the route to 6.6.6.6
VRF info: (vrf in name/id, vrf out name/id)
  1 172.16.0.1 0 msec 0 msec 0 msec
  2 10.0.0.4 [MPLS: Labels 18/280 Exp 0] 1 msec 1 msec 0 msec
  3 172.16.2.3 [AS 100] [MPLS: Label 280 Exp 0] 0 msec 0 msec 1 msec
  4 172.16.2.6 [AS 100] 0 msec 0 msec 0 msec

Fig. 12

CE2 receives the ICMP TTL exceeded message generated by P, with CE1->CE2 UDP probe encapsulated, as well as label stack P received that probe with.

ACL, firewalls, uRPF and other things can break this fragile functionality. This is one of the reasons your traceroutes through MPLS VPN show stars so often. While traceroute is not a big deal, PMTUD gets broken in the same way, leading to MTU blackholes. Therefore, per-VRF label allocation mode seems to be the lesser evil for ICMP.

BGP PIC

In BGP, it is common to have a lot of routes. If the nexthop fails, reprogramming all routes with the new nexthop can take a lot of time – in worst cases, traffic can be blackholed for several minutes after failure detection.

To optimize convergence time, modern routers have hierarchical FIB: all BGP routes with the same nexthop point to a logical entry, which points to the nexthop. If a backup nexthop exists, it is possible to pre-install it in TCAM, so that upon primary nexthop failure, there will be no need to reprogram all routes, it is enough to change one entry. BGP Prefix Independent Convergence (PIC) takes care of this.

Fig. 13

How does BGP PIC Edge fit into MPLS VPN PE-CE routing? Before even thinking about FIB optimizations, we must ensure that the BGP control plane advertises correct backup paths.

Firstly, a unique RD per PE must be used for multihomed CE, otherwise nothing will work if route reflectors are used. Best current practice is to use type 1 RD x.x.x.x:yy where x.x.x.x = router ID, yy – some value identifying VRF.

Secondly, if BGP policies are configured in a way that one PE is a preferred exit point for the entire AS, the other PE must be configured with “advertise-best-external” – so that it advertises the route received from CE.

Fig. 14

PE1 is the primary exit point, even on PE2 the route towards CE1 prefixes is via PE1. But since PE2 is configured with “advertise-best-external”, it advertises not the best route (which is iBGP), but the route via CE1.

PE2 config:

router bgp 100
 address-family vpnv4
  bgp additional-paths select best-external
  bgp additional-paths install
  neighbor 4.4.4.4 activate
  neighbor 4.4.4.4 send-community extended
  neighbor 4.4.4.4 advertise best-external
PE2#sh bgp vpnv4 unicast rd 2.2.2.2:1 192.168.100.25/32
BGP routing table entry for 2.2.2.2:1:192.168.100.25/32, version 5449
Paths: (2 available, best #1, table one)
  Advertise-best-external
  Advertised to update-groups:
     1
  Refresh Epoch 1
  200, imported path from 1.1.1.1:1:192.168.100.25/32 (global)
    1.1.1.1 (metric 20) (via default) from 4.4.4.4 (4.4.4.4)
      Origin incomplete, metric 0, localpref 200, valid, internal, best
      Extended Community: SoO:100:1 RT:1:1
      Originator: 1.1.1.1, Cluster list: 4.4.4.4 , recursive-via-host
      mpls labels in/out 825/300
      rx pathid: 0, tx pathid: 0x0
  Refresh Epoch 2
  200
    172.16.1.5 (via vrf one) from 172.16.1.5 (5.5.5.5)
      Origin incomplete, metric 0, localpref 100, valid, external, backup/repair, advertise-best-external
      Extended Community: SoO:100:1 RT:1:1 , recursive-via-connected
      mpls labels in/out 825/nolabel
      rx pathid: 0, tx pathid: 0

This looks like a hack which breaks the fundamental concept of distance vector routing protocols: each router advertises the route it currently uses. And yes, this can lead to routing loops and blackholes.

Per-prefix mode

With per-prefix mode this works fine. PE1 installs primary routes via CE1, and backup via PE2, using PE2’s label.

PE1:

mpls label mode vrf one protocol bgp-vpnv4 per-prefix
!
router bgp 100
 address-family vpnv4
  bgp additional-paths install
PE1#sh bgp vpnv4 unicast rd 1.1.1.1:1 192.168.100.25/32
BGP routing table entry for 1.1.1.1:1:192.168.100.25/32, version 11912
Paths: (2 available, best #2, table one)
  Additional-path-install
  Advertised to update-groups:
     1
  Refresh Epoch 2
  200, imported path from 2.2.2.2:1:192.168.100.25/32 (global)
    2.2.2.2 (metric 20) (via default) from 4.4.4.4 (4.4.4.4)
      Origin incomplete, metric 0, localpref 100, valid, internal, backup/repair
      Extended Community: SoO:100:1 RT:1:1
      Originator: 2.2.2.2, Cluster list: 4.4.4.4 , recursive-via-host
      mpls labels in/out 51/825
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 4
  200
    172.16.0.5 (via vrf one) from 172.16.0.5 (5.5.5.5)
      Origin incomplete, metric 0, localpref 200, valid, external, best
      Extended Community: SoO:100:1 RT:1:1 , recursive-via-connected
      mpls labels in/out 51/nolabel
      rx pathid: 0, tx pathid: 0x0


PE1#sh ip cef vrf one 192.168.100.25/32 de
192.168.100.25/32, epoch 0, flags [rib defined all labels]
  dflt local label info: other/51 [0x2]
  recursive via 172.16.0.5
    attached to Ethernet1/0
  recursive via 2.2.2.2 label 825-(local:51), repair
    nexthop 10.0.0.4 Ethernet0/0 label 17-(local:17)

Fig. 15

Normally, PE3 pushes label stack [16, 51] (transport and VPN labels), PE1 receives packets with VPN label. If PE1 detects PE1-CE1 link failure, it immediately reroutes traffic via PE2, using PE2’s VPN label for the given prefix. Even before PE2 has learnt about the failure, it will receive traffic with own per-prefix label and forward it to CE1, even though PE2’s best path at this point would still point to PE1.

Per-VRF mode

In per-VRF mode, BGP PIC in this topology will lead to a routing loop until global convergence happens.

Config on PE1 and PE2:

PE1(config)#mpls label mode vrf one protocol bgp-vpnv4 per-vrf

BGP control plane remains the same, PE1 still has a backup route via PE2:

PE1#sh ip cef vrf one 192.168.100.25/32 de
192.168.100.25/32, epoch 0, flags [rib defined all labels]
  recursive via 172.16.0.5
    attached to Ethernet1/0
  recursive via 2.2.2.2 label 219, repair
    nexthop 10.0.0.4 Ethernet0/0 label 17-(local:17)

But unlike the previous example, label 219 is an aggregate (per-VRF) label allocated by PE2, so if the PE1-CE1 link fails and PE1 activates repair, PE2 will remove VPN label from incoming packets and do an extra lookup. But the best path on PE2 points back to PE1:

PE2#sh ip cef vrf one     192.168.100.25/32 de
192.168.100.25/32, epoch 0, flags [rib defined all labels]
  recursive via 1.1.1.1 label 283
    nexthop 10.1.1.4 Ethernet0/1 label 16-(local:17)
  recursive via 172.16.1.5, repair
    attached to Ethernet1/1

Because of that, even though BGP PIC in per-VRF label mode can work in theory, if PE1-CE1 link fails in this topology, traffic will be looped between PE1 and PE2 until global repair happens.

Fig. 16

But in topologies without BGP best-external (e.g. PE1 and PE2 advertising equal routes so that PE3 could ECMP among them), PIC works fine even in per-VRF mode.

Per-CE mode

In per-CE mode, if a link to CE goes down, the label allocated to it becomes useless.

PE1#sh mpls for label 277
Local      Outgoing   Prefix           Bytes Label   Outgoing   Next Hop
Label      Label      or Tunnel Id     Switched      interface
277        No Label   nh-id(3)         0             Et1/0      172.16.0.5

What some implementations do is resilient per-CE label – i.e. if the CE goes down, the label still can be used, but the forwarding logic for it will be like for a per-VRF label.

E.g. on IOS-XR:

RP/0/RP0/CPU0:PE1#sh mpls for labels 24007
Wed Mar 18 14:07:09.684 UTC
Local  Outgoing    Prefix             Outgoing     Next Hop        Bytes
Label  Label       or ID              Interface                    Switched
------ ----------- ------------------ ------------ --------------- ------------
24007  Unlabelled  No ID              Gi0/0/0/2    172.16.3.7      0
       Aggregate   No ID              one                          0            (!)

The backup route points to PE2:

RP/0/RP0/CPU0:PE1#sh cef vrf one 192.168.100.25/32
Wed Mar 18 13:56:22.027 UTC
192.168.100.25/32, version 1557, internal 0x1000001 0x0 (ptr 0xe1ba83c) [1], 0x0 (0xe3801a8), 0xa08 (0xef81988)
 Updated Mar 18 13:54:54.484
 Prefix Len 32, traffic index 0, precedence n/a, priority 3
   via 2.2.2.2/32, 4 dependencies, recursive, backup [flags 0x6100]
    path-idx 0 NHID 0x0 [0xd79c738 0x0]
    recursion-via-/32
    next hop VRF - 'default', table - 0xe0000000
    next hop 2.2.2.2/32 via 24001/0/21
     next hop 10.3.3.4/32 Gi0/0/0/3    labels imposed {100000 116384}
   via 172.16.3.7/32, 6 dependencies, recursive, bgp-ext [flags 0x6020]
    path-idx 1 NHID 0x0 [0xe1bcafc 0x0]
    next hop 172.16.3.7/32 via 172.16.3.7/32
     next hop 172.16.3.7/32 Gi0/0/0/2    labels imposed {None}

So on PE1, forwarding logic during failover is similar to per-VRF case, but PE2 (if it also does per-CE) will actually forward packets to CE1 (like in per-prefix mode), even while the best path points to PE1.

Fig. 17

Different label modes on PE

Generally, it is okay to run different label allocation modes on different PE routers. But apart from increasing troubleshooting complexity, this can trigger bugs, hardware limitations and weird design issues. Take for example PIC from figure 16 where PE1 runs per-VRF and PE2 runs per-prefix mode – that would work totally fine, but per-prefix on PE1 and per-VRF on PE2 would result in a routing loop during convergence.

Other MPLS services and caveats

Whatever applies to L3VPN also applies to 6PE, even though there are no VRFs involved. The “per-VRF” label in 6PE is IPv6 explicit null (label value 2).

While in pseudowires and VPLS the only available option was to allocate a label per instance, EVPN [RFC7432#section-9.2.1] has in theory 3 ways of label allocation: per MAC-VRF, per <MAC-VRF, Ethernet tag> and per MAC – those correspond to per VRF, per CE and per prefix in L3VPN respectively. So far most (all?) vendors have implemented only per MAC-VRF.

A VRF can run both IPv4 and IPv6 routing. In per-VRF mode, Cisco and Arista advertise different labels for IPv4 and IPv6, while Juniper advertises a single label for both. In this case, after stripping the label, hardware has to somehow guess whether the packet is IPv4 or IPv6 – this is usually done by checking the first nibble – and this approach has created a lot of problems in ECMP load sharing at some point, when L2 packets were misidentified as IPv4 or IPv6. But in L3VPN it is safe to assume that any packet arriving with VPN label is always L3, so one label for both AF should be okay.

Label allocation modes are not applicable to CSC since CSC-PE are dealing with received labeled routes, and in MPLS there is no concept of route aggregation.

Nor are they applicable to MPLS-based MVPN, where there is no distinction between service and transport labels as the whole thing works in a completely different way.

Conclusion

Despite SD-WAN hype, MPLS L3VPN is not going away anytime soon. Moreover,  L3VPNs (either RFC4364 or EVPN) are used more and more often in DC/Cloud networks, so it’s not a pure ISP technology anymore. They can be integrated with L2VPN (e.g. EVPN) and use different encapsulations (e.g. VXLAN or SRv6), but the PE-CE routing concepts still apply to them.

While for simple L3VPN deployments it is enough to use eBGP with as-override or allowas-in, with any label allocation mode (whatever is enabled by default), sometimes the requirements for scale, convergence and specific features lead to more complex designs.

It is possible to create custom inter-VRF routing topologies by modifying RT import policies and leaking routes across VRFs – I will include some examples in the next post.

References

  1. BGP/MPLS IP Virtual Private Networks (VPNs) https://tools.ietf.org/html/rfc4364
  2. Internal BGP as the Provider/Customer Edge Protocol for BGP/MPLS IP Virtual Private Networks (VPNs) https://tools.ietf.org/html/rfc6368
  3. IOS Implementation of the iBGP PE-CE Feature https://www.cisco.com/c/en/us/support/docs/ip/border-gateway-protocol-bgp/117567-technote-ibgp-00.html
  4. OSPF as the Provider/Customer Edge Protocol for BGP/MPLS IP Virtual Private Networks (VPNs) https://tools.ietf.org/html/rfc4577
  5. BGP4/IDRP for IP—OSPF Interaction https://tools.ietf.org/html/rfc1745
  6. Pre-bestpath Cost Community – What is it? https://costiser.ro/2014/03/27/pre-bestpath-cost-community
  7. BGP Custom Decision Process https://tools.ietf.org/html/draft-ietf-idr-custom-decision-08
  8. [c-nsp] BGP maximum-prefix on ASR9000s https://puck.nether.net/pipermail/cisco-nsp/2020-January/107252.html
  9. The Use of Entropy Labels in MPLS Forwarding https://tools.ietf.org/html/rfc6790
  10. Advertisement of the best external route in BGP https://tools.ietf.org/html/draft-ietf-idr-best-external-05
  11. BGP MPLS-Based Ethernet VPN https://tools.ietf.org/html/rfc7432

Notes

  1. ^Which is, if using 2-byte ASN, a scarce resource – there are only 1023 of them! By the way some blogs and even vendor docs claim that AS65355 is a private AS which it is not, and BGP features specific to private AS will not treat it as such.
  2. ^On some old routers like Cisco 7600 per-VRF label allocation meant you had to enable recirculation which halved forwarding performance. This is not a problem for any modern routers.
  3. ^Since there is no field like “next header” in MPLS, routers try to “guess” the payload: if the first nibble is 4, it is IPv4, if 6 – it is IPv6, else non-IP. This caused a lot of problems when IEEE began to allocate MAC addresses starting from 4. The same method by the way is used when an MPLS router has to originate an ICMP error for dropped transit packet – see the section about ICMP tunneling. Since this article is focused on L3, assuming this part works fine.

6 thoughts on “Close to the Edge”

  1. Well explained! With the best-external case, it’s also possible to automatically allocate per-next-hop label for such routes, e.g. Nokia SR OS goes this way, it uses the per-vrf label allocation by default but allocates per-next-hop label for prefixes affected by the best-external option.
    About the first nibble parsing here some tricky moment. It’s a very old and well-known problem, so some vendors have an option to refine it. They go further than just the nibble accounting and select the value placed into the offset where IPv4 or IPv6 (depends on the exact value of the nibble) header holds payload size then compare it with the packet size (after the MPLS stack). Obviously, if an actual payload size is equal to the value of this offset you have an IP packet with a high probability. You can find this behavior in Juniper, Huawei, Brocade (I’ve heard Juniper can even perform fragmentation of IP packets inside a pseudowire on LSR, but never checked it).
    Waiting for your next posts!

    1. Hi Igor,

      Thanks for your comment. I never worked with Nokia but what you say makes sense. Looks like every vendor has their own default approach to VPN label allocation.

      4/6 problem unfortunately goes far beyond wrong load sharing – some switches can even drop those packets, there are various combinations of software bugs and ASIC limitations. For example this thread – https://seclists.org/nanog/2016/Dec/29. I also briefly mentioned the 4/6 problem in https://routingcraft.net/equal-routes/. Having seen weird things while working for vendors, I recommend to just use control word in all cases.

      EoMPLS fragmentation – I know it is theoretically possible on LER, using B/E bits of the control word (RFC4623), but never seen it actually working. Fragmentation on an LSR (I presume IP fragmentation, for IPv4 packets inside pseudowire)? Never heard of that, and quick googling doesn’t show any results. I’d appreciate more info on this topic.

      1. I fully agree and also usually recommend using a control word but it isn’t always possible especially in multivendor environments.
        I’ve found the blog post but unfortunately, its language is Russian but I’m pretty sure the screenshots are self-explanatory. I must to say it again: I’m not the author of these post (but I’ve had some discussion with him about this scenario), and I’ve never checked this behavior. Hope you’ll find it interesting 🙂

      2. >>Fragmentation on an LSR (I presume IP fragmentation, for IPv4 packets inside pseudowire)? Never heard of that, and quick googling doesn’t show any results. I’d appreciate more info on this topic.

        I have tested this on Huawei routers, it works fine at the line rate, but not recommended by the vendor. From my perspective, it’s better to have MPLS MTU discovery and fragment traffic on a LER or use jumbo frames on the MPLS interfaces.

        For 4/6 and control word issue Juniper and other vendors have so-called “zero-control-word” https://www.juniper.net/documentation/en_US/junos/topics/topic-map/load-balancing-mpls-traffic.html#id-mpls-encapsulated-payload-loadbalancing-overview with 4 bytes filled with zeros in the beginning.

        “`On seeing this control word, which is four bytes having a numerical value of all zeros, the hash generator assumes the start of an Ethernet frame at the end of the control word in an MPLS ether-pseudowire packet.“`

Leave a Reply

Your email address will not be published. Required fields are marked *