Instability of routing protocol sessions – or, in the network engineers’ slang, flaps, is by far the most common and the most basic routing problem that ever occurs.

Contents hide

3 TCP

4 BGP

4.1 Path MTU discovery

4.2 MTU blackholes

4.2.1 How to fix it

4.2.2 But it has been working for a long time!

4.2.3 Have you tried turning it off and on again?

4.3 Two Minutes Hate about Juniper and MTU

4.4 Recursive routing

5 Aggressive timers

6 Control plane policy

7 External peerings and IXP

8 Conclusion

9 References

10 Notes

Shortly after beginning to write this post, I realized it will be too long. So I will split it into multiple parts, and this will be part 1.

Introduction

Session flaps lead to unnecessary reconvergence, routing loops and traffic blackholing. Every network engineer has seen it a lot of times. Yet it keeps surprising me, how difficult many people find it to troubleshoot flaps. Even very senior engineers often find themselves staring at the terminal, issuing the same “show” command over and over again, while praying that this ritual will somehow make the problem go away. Another topic is root causing flaps occurring intermittently (e.g. once a few months), where the most common strategy is just hoping the problem won’t occur again.

Why is it so? Perhaps for most people, troubleshooting is a boring topic in general. Ask a group of network engineers how to design a scalable ISP or Data Center network, and suddenly everybody is an expert. Now ask “why my BGP session is flapping?” – and maybe you will hear something about checking for packet drops and replacing the cable. And check the MTU, it’s always the MTU.

I hope to shed some light on why routing protocols flap. I will explain why a BGP session flapping every 3 minutes is hard evidence of an MTU blackhole, how instability of some IGP can trigger flaps even in absence of any transport issues, why setting very low keepalive timers is a bad idea even with modern CPU and how the “turn it off and on again” approach can mislead you.

Basics

When two routers establish a neighbourship to exchange routing information, they also exchange periodic messages to see if the neighbour is still alive. Those messages are typically called Hello or Keepalive. If the router did not receive a Hello/Keepalive from its neighbour within the configured interval (typically called Dead/Hold/Inactivity timer), the session is declared down.

Since hold timers are typically configured in the range of 15 – 180 seconds, networks where faster failure detection is needed, use BFD – a lightweight keepalive protocol that can work with hold timers as low as 150ms. When the BFD session goes down, BFD signals to the routing protocol to also bring its session down.

In addition to that, if the routing protocol session is bound to a physical link, and that link goes down, the session is brought down as well.

Oct 28 16:49:46 R1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet1, changed state to down
Oct 28 16:49:46 R1 Rib: %BGP-3-NOTIFICATION: sent to neighbor 10.0.0.2 (AS 100) 6/6 (Cease/other configuration change <Hard Reset>) 0 bytes
Oct 28 16:49:46 R1 Rib: %BGP-5-ADJCHANGE: peer 10.0.0.2 (AS 100) old state Established event Stop new state Idle

This case is the most common, and can be easily identified by correlating protocol flaps with link flaps in syslog. If this is the case, proceed with L1 troubleshooting – transceiver, fiber, optic signals, DWDM etc.

When the links seem to be stable, but the routing protocol sessions are not, it becomes more interesting. It is important to understand what L4 transport does the protocol in question use, and how that transport reacts to packet loss – be it the loss of random packets, or of some specific packets. For this purpose, I will split all routing protocols into 3 categories:

Protocols that run over TCP (e.g. BGP, MSDP, LDP)
Protocols with own transport mechanism (e.g. OSPF, IS-IS, EIGRP)
Soft-state protocols (e.g. PIM, RSVP)

These 3 categories can show very different symptoms when encountering the same type of a transport issue.

TCP

TCP is by far the most common transport level protocol in networking. It is the usual choice for any application developer who wants to transmit something over the network. It is used by some routing protocols (BGP, MSDP, LDP), various SDN (OVSDB, Openflow, PCEP, any proprietary controller), all kinds of API, SSL VPNs and even general Internet browsing.

TCP ensures reliable transmission by marking transmitted segments with sequence numbers. If the TCP sender does not receive acknowledgements for the data it sent, it keeps resending that data. Only after receiving ACKs, the sender can move its congestion window (cwnd) forward and send the following segments. Lost segments and retransmissions also have an impact on TCP performance as they slow the sender down.

BGP

BGP is the routing protocol #1 in the industry, used virtually for anything, so I will cover it first. Take for example a BGP session between 2 routers running with default settings. Routers send Keepalives every minute, and also send Updates, when there are routing changes.

Figure 1 illustrates BGP messages sent only by R1:

Fig. 1

The update at t1 is sent when there is a routing change, let’s say it happened 30 seconds after the previous keepalive. This means, the next keepalive will be sent 60 seconds after the update. Yes, updates also reset the hold timer on R2!

This is why it is normal to see very asymmetric Keepalive counters in BGP statistics:

                         Sent      Rcvd
    Opens:                  1         1
    Notifications:          0         0
    Updates:             2903   2590039
    Keepalives:         13040        73
    Route-Refresh:          0         0
    Total messages:     15944   2590113

Or just very low number of keepalives (along with a lot of updates) for a long-lived session:

                         Sent      Rcvd
    Opens:              23992      2510
    Notifications:          3         2
    Updates:         35744763   2487620
    Keepalives:          2598      2609
    Route-Refresh:          0         0
    Total messages:  35771356   2492741

Such statistics are perfectly fine and don’t mean there are packet drops. A router used for Internet peering, or a route reflector in an ISP network can be constantly processing new updates. Frequent updates/withdraws of the same prefix can be the evidence of unstable routing, but that’s a totally different issue not in the scope of this article.

Not receiving neither Update nor Keepalive within the Hold interval (3 minutes by default) will trigger Hold timer expiration which will bring the BGP session down. Because of TCP sequencing and flow control, an update that cannot get through (e.g. due to MTU issues) will block all subsequent packets, including keepalives, from being sent.

Fig. 2

On figure 2, R1 sends a keepalive, R2 acks it, then R1 tries to transmit an update but it is dropped on the way. Upon retransmission timeout (RTO) expiry, R1 retransmits the update, but it is dropped again. At some point R2 sends a duplicate ack, thus asking R1 to retransmit the next segment, which R1 does, but it fails again! If all this goes on for 3 minutes after the keepalive received by R2, hold timer on R2 will expire and it will drop the BGP session. At the same time, R2 will be sending its keepalives normally and hold timer on R1 will be updated.

Now let’s see how this happens in real life.

Path MTU discovery

BGP does not have any built-in mechanism for checking the underlying network MTU. Instead, it relies on TCP path MTU discovery (PMTUD) – described in [RFC4821]. It works by setting the DF bit in all TCP packets, and hoping that if any transit router drops a big packet due to low MTU, it will send an ICMP error back to the sender. Based on that, the sender will create a routing cache entry with reduced outgoing MTU for that particular destination IP. In IPv6, there is no fragmentation on transit routers, so PMTUD is a must have.

Consider the network on figure 3:

Fig. 3

R1 attempts to send a 1500-byte packet with DF bit to R4, R2 drops it due to low outgoing MTU and sends the ICMP error back to R1:

Fig. 4

Fig. 5

This message indicates that R2’s outgoing MTU is 1400. Upon receiving it, R1 installs a cached route entry for 4.4.4.4 and reduces the TCP MSS.

[admin@R1 ~]$ ip route get 4.4.4.4
4.4.4.4 via 10.0.0.2 dev et1 src 10.0.0.1
    cache expires 420sec mtu 1400

[admin@R1 ~]$ ss --tcp --info dst 4.4.4.4
State       Recv-Q Send-Q                                          Local Address:Port                                                           Peer Address:Port
ESTAB       0      0                                                     1.1.1.1:40299                                                               4.4.4.4:bgp
    cubic wscale:7,7 rto:224 rtt:23.454/7.406 ato:40 mss:1348 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:2311 bytes_received:213 segs_out:21 segs_in:20 data_segs_out:13 data_segs_in:8 send 4.6Mbps lastsnd:10060 lastrcv:31324 lastack:10036 pacing_rate 9.2Mbps delivery_rate 886.5Kbps retrans:0/2 rcv_space:29200 minrtt:12.164

Note the actual MSS is lower than the initial MSS R1 advertised.

The cached route will also impact non-TCP traffic to this destination – which applications should respect and not generate packets larger than that [RFC8899]. At least this is how it works on Linux. With the cached entry in place, I can ping with IP packet size 1400 (ICMP payload 1372) and DF bit set, but a packet just one byte bigger will be dropped locally due to low MTU:

[admin@R1 ~]$ ping 4.4.4.4 -s 1372 -M do
PING 4.4.4.4 (4.4.4.4) 1372(1400) bytes of data.
1380 bytes from 4.4.4.4: icmp_seq=1 ttl=62 time=24.0 ms
1380 bytes from 4.4.4.4: icmp_seq=2 ttl=62 time=16.5 ms


[admin@R1 ~]$ ping 4.4.4.4 -s 1373 -M do
PING 4.4.4.4 (4.4.4.4) 1373(1401) bytes of data.
ping: local error: Message too long, mtu=1400
ping: local error: Message too long, mtu=1400

If I clear the cache and ping with a 1401-byte packet, the first one will go through and trigger the same PMTUD process as shown above (but now with ICMP instead of TCP):

[admin@R1 ~]$ sudo ip route flush cache
[admin@R1 ~]$ ping 4.4.4.4 -s 1373 -M do
PING 4.4.4.4 (4.4.4.4) 1373(1401) bytes of data.
From 10.0.0.2 icmp_seq=1 Frag needed and DF set (mtu = 1400)
ping: local error: Message too long, mtu=1400

Note the first packet is actually sent, and after the error received from 10.0.0.2, the cache entry gets installed and the second packet is dropped locally on R1.

This is how PMTUD is supposed to work in theory. In reality, it fails more often then it works. ICMP messages originated by the transit host can be dropped by firewalls, uRPF, missing routes, CoPP and other reasons. Section 8 of recently released [RFC8900] lists some of those, although the most common reason why PMTUD gets broken is not mentioned there – and that is L2 MTU blackholes.

In IPv4, all transit routers must be able to fragment packets. PMTUD disables fragmentation by setting DF bit, so if PMTUD fails due to blocked ICMP messages, disabling PMTUD (which allows fragmentation by unsetting DF bit) should let large packets go through.

In practice, this doesn’t help against L2 MTU blackholes, because L2 switches cannot fragment packets.

MTU blackholes

Consider the topology on figure 6.

Fig. 6

There are 2 L2 switches with low MTU in the middle. There is a low MTU in the L2 network, and the switches won’t be able to either fragment packets or send ICMP errors due to low MTU.

In this example, R1 is not able to detect low MTU on the path and will keep sending large packets which will be dropped.

Fig. 7

What can we see on this pcap:

R1 sent a large TCP segment (#898) and a bit smaller one (#899)
R2 send TCP DUP ack=96 – this means that R2 received the second packet, but not the first one (seq=96)
R1 keeps retransmitting the packet with seq=96, not getting ack from R2

Let’s look at the TCP socket stats on R1 while it keeps retransmitting the large segment:

[admin@R1 ~]$ ss --tcp --info dst 4.4.4.4
State      Recv-Q Send-Q                                           Local Address:Port                                                            Peer Address:Port
ESTAB      0      2139                                                   1.1.1.1:43149                                                                4.4.4.4:bgp
    cubic wscale:7,7 rto:58368 backoff:8 rtt:25.948/13.349 ato:40 mss:1448 rcvmss:536 advmss:1448 cwnd:1 ssthresh:7 bytes_acked:96 bytes_received:137 segs_out:20 segs_in:9 data_segs_out:14 data_segs_in:4 send 446.4Kbps lastsnd:20652 lastrcv:27000 lastack:27000 pacing_rate 1.8Mbps delivery_rate 684.7Kbps app_limited unacked:2 retrans:1/9 lost:1 sacked:1 fackets:2 rcv_space:29200 notsent:42 minrtt:16.919

First of all, Send-Q is not empty. Non-empty Send-Q is not necessarily a problem on its own, but if it is stuck, it certainly means a problem.

Retransmit counter is growing – not necessarily a problem either, but along with non-empty Send-Q should raise suspicions.

Very big RTO compared to measured RTT – this is bad, as it signals that the TCP session is stuck.

Congestion window reduced to 1 – TCP just attempts to retransmit that one poor segment!

Some of those stats are present in the “show ip bgp neighbor” output.

R1#sh ip bgp neighbors 4.4.4.4 | grep -A2 Send-Q
  Send-Q: 2139/32768
  Outgoing Maximum Segment Size (MSS): 1448
  Total Number of TCP retransmissions: 9

But working with Linux-based network OS, I always prefer SS because it shows more information relevant to the TCP socket. How exactly and when TCP retransmits when occur, depends on the TCP stack implementation, but with cubic TCP with default Linux settings, there will be 3 fast retransmits done in the same second, then the retransmit interval will gradually back off, and after 15 unsuccessful retransmits the session will drop. Sysctl parameters controlling that:

net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15

In case of BGP, probably the 3-minute hold timer on R2 will expire faster than R1 will do 15 unsuccessful retransmits. But with other TCP applications (e.g. SSH), the session can be frozen for a much longer time before it breaks.

In the logs you will see BGP flapping every 3 minute:

R1#sh logg | grep BGP
Oct 27 15:42:45 R1 Bgp: %BGP-3-NOTIFICATION: received from neighbor 4.4.4.4 (VRF default AS 100) 4/0 (Hold Timer Expired Error/None) 0 bytes
Oct 27 15:42:45 R1 Bgp: %BGP-5-ADJCHANGE: peer 4.4.4.4 (VRF default AS 100) old state Established event Stop new state Idle
Oct 27 15:42:46 R1 Bgp: %BGP-5-ADJCHANGE: peer 4.4.4.4 (VRF default AS 100) old state OpenConfirm event Established new state Established
Oct 27 15:45:46 R1 Bgp: %BGP-3-NOTIFICATION: received from neighbor 4.4.4.4 (VRF default AS 100) 4/0 (Hold Timer Expired Error/None) 0 bytes
Oct 27 15:45:46 R1 Bgp: %BGP-5-ADJCHANGE: peer 4.4.4.4 (VRF default AS 100) old state Established event Stop new state Idle
Oct 27 15:45:47 R1 Bgp: %BGP-5-ADJCHANGE: peer 4.4.4.4 (VRF default AS 100) old state OpenConfirm event Established new state Established
Oct 27 15:48:47 R1 Bgp: %BGP-3-NOTIFICATION: received from neighbor 4.4.4.4 (VRF default AS 100) 4/0 (Hold Timer Expired Error/None) 0 bytes
Oct 27 15:48:47 R1 Bgp: %BGP-5-ADJCHANGE: peer 4.4.4.4 (VRF default AS 100) old state Established event Stop new state Idle
Oct 27 15:48:49 R1 Bgp: %BGP-5-ADJCHANGE: peer 4.4.4.4 (VRF default AS 100) old state OpenConfirm event Established new state Established
Oct 27 15:51:49 R1 Bgp: %BGP-3-NOTIFICATION: received from neighbor 4.4.4.4 (VRF default AS 100) 4/0 (Hold Timer Expired Error/None) 0 bytes
Oct 27 15:51:49 R1 Bgp: %BGP-5-ADJCHANGE: peer 4.4.4.4 (VRF default AS 100) old state Established event Stop new state Idle
Oct 27 15:51:50 R1 Bgp: %BGP-5-ADJCHANGE: peer 4.4.4.4 (VRF default AS 100) old state OpenConfirm event Established new state Established

If this is not an MTU blackhole, what else can it be?¹

MTU blackholes are very common with all kinds of overlays – VXLAN, VPLS etc. They also impact all other TCP applications: SSH, FTP, HTTP and so on. Very typical symptoms of this sort of problem is when the session establishes and sort of works, but then freezes when you try to transfer a slightly larger chunk of data.

How to fix it

The only correct fix is to manually configure MTU on all interfaces to make sure this sort of problem does not occur. For VPNs working over the Internet, the common configuration is MSS clamping – let the VPN-terminating router adjust MSS of all transit TCP sessions so that they never send packets larger than MTU.

Linux has MTU blackhole detection:

[admin@R1 ~]$ sudo sysctl -w net.ipv4.tcp_mtu_probing=1

Fig. 8

It begins like the capture on figure 7, but now after R1 couldn’t get ACKs for large packets, it figured out there is an MTU blackhole, and decreased its MSS to net.ipv4.tcp_base_mss (1024 bytes by default).

[admin@R1 ~]$ ss --tcp --info dst 4.4.4.4
State      Recv-Q Send-Q                                           Local Address:Port                                                            Peer Address:Port
ESTAB      0      0                                                      1.1.1.1:36887                                                                4.4.4.4:bgp
    cubic wscale:7,7 rto:228 rtt:26.582/11.696 ato:40 mss:1024 rcvmss:536 advmss:1448 cwnd:3 ssthresh:7 bytes_acked:2235 bytes_received:137 segs_out:19 segs_in:13 data_segs_out:13 data_segs_in:4 send 924.5Kbps lastsnd:32840 lastrcv:34572 lastack:32816 pacing_rate 1.1Mbps delivery_rate 408.9Kbps retrans:0/6 rcv_space:29200 minrtt:16

Some BGP implementations (e.g. Cisco IOS-XR) have similar mechanisms enabled by default. The drawback is that false positives can cause MSS to decrease even when there is no MTU blackhole.

Another possibility is to configure MSS to be very low – as low as 536 bytes, then it will never hit MTU blackholes, even if they are present. The drawback is worse TCP performance.

But it has been working for a long time!

Very often heard, and totally useless words. A lot of things in networking appear to “work”, even if they are in fact broken.

Imagine a BGP session setup in a topology like on figure 6. R1 advertises just a few hundred prefixes, and the network is very stable. The operator adds a few more networks every now and then, they trigger small BGP updates that perfectly pass through. It can work for years like this.

Then something triggers a large BGP update – can be a routing change, route refresh requested, session flap (for whatever reason) or router upgrade during a maintenance window. After that the BGP session will come up and try to send all routes, thus generating a large update, and beginning to flap because of the MTU blackhole.

Have you tried turning it off and on again?

The most common reaction of any IT person when faced a problem they cannot understand is to reboot the damn thing. It often helps indeed. Now we are dealing with a configuration issue – so a reboot wouldn’t help at all, right? Wrong! In some cases it can help, for a short time.

In the situation I described above the setup is fairly stable and only small updates are advertised, and the MTU blackhole is triggered when the router suddenly advertises a lot of routes. Now imagine the following:

A BGP EVPN session is flapping due to an MTU blackhole
Clueless network admin doesn’t want to troubleshoot and reboots the switch
When the switch comes up, the MAC table is empty, and it begins to gradually learn some MACs, and advertise them – BGP updates are small at this point
The session can remain stable for a while, maybe a few hours or even a few days, depending on traffic and MAC learning patterns
At some point, the switch has to send a big update, which exceeds the MTU and the BGP session begins to flap every 3 minutes again

This can happen with any other BGP address family, EVPN is just the most likely to be affected because updates are triggered by dynamic MAC learning.

I like this property of EVPN in particular because it punishes people who don’t understand networking basics and don’t listen to specialists’ advice.

Two Minutes Hate about Juniper and MTU

Juniper does a lot of things differently from all other vendors. Sometimes their way is better, sometimes not, but I do respect creative thinking instead of just repeating what everyone else does. Yet in some cases what they do goes against common sense and there can be no excuse for that.

If you have an L2 interface and do routing on SVI (IRB in Juniper terminology), guess what will be the default L3 MTU of the IRB interface? 1500 bytes like on any other router in the industry? No way, it will be dynamically calculated (!!!) by subtracting 14 bytes from the lowest L2 MTU of L2 interfaces with that respective vlan allowed. For example, I want to increase L2 MTU a bit to allow for MPLS or VXLAN encapsulated traffic through the interface, but among others, there is an IRB that terminates BGP peerings.

admin# set interfaces ge-0/0/1 mtu 1600
[edit]
admin# commit
commit complete
[edit]
admin# run show interfaces irb.100 | match MTU
    Protocol inet, MTU: 1586
    Protocol multiservice, MTU: 1586

This doesn’t make any sense whatsoever. Juniper, WHY????

Recursive routing

BGP sessions are often set up as multihop – i.e. peer IP is reachable via a router provided by another protocol. What happens if the route to peer becomes unreachable?

Unreachble nexthop will trigger routing convergence, but the BGP session will remain alive, and die only after hold timer expiration. There is usually no drawback to that but on some implementations you can force BGP session to go down immediately if the route to peer disappears.

Take for instance the topology from figure 3, but R4 is now Cisco IOS:

R4(config-router)#neighbor 1.1.1.1 fall-over

Now if R1 Lo0 becomes unreachable, and R4 learns that change via OSPF. Debugs on R4:

*Oct 28 17:54:06.510: RT: del 1.1.1.1 via 10.2.2.3, ospf metric [110/31]
*Oct 28 17:54:06.511: RT: delete subnet route to 1.1.1.1/32
*Oct 28 17:54:06.537: %BGP-5-NBR_RESET: Neighbor 1.1.1.1 reset (Route to peer lost)

If your BGP implementation works like this and the multihop session flaps, check the stability of underlying routing. In this example, the route to 1.1.1.1 changed 5 seconds ago:

R4#sh ip ro | in 1.1.1.1
O        1.1.1.1 [110/31] via 10.2.2.3, 00:00:05, GigabitEthernet2

Most other BGP implementations don’t behave like this and will not bring the session down when the route to peer disappears. This is not a problem in most cases, as routing convergence will happen anyway. If underlay routing is static, then in order for BGP to converge faster, you will need multihop BFD.

Aggressive timers

Default BGP timers 60/180 imply that it will take up to 3 minutes to detect the failure. For other protocols it’s a bit faster (OSPF – 40 seconds, IS-IS – 30 seconds), but this is still ages for modern networks. Every design or config guide recommends to use BFD for fast convergence, but many people are still like “we have a cool modern router with an 8-core Xeon CPU, can’t it hold a few hundred BGP sessions with 1/3 timers?”.

The problem here is not the average CPU load. All popular network OS are not real-time OS. In order to not let one process steal all CPU time, they use prioritization, or nice number in Linux. So a lower priority process will give up CPU time when a higher priority process needs to run. This works nicely in most cases – but again, no real time execution is guaranteed.

Another problem is how routing code is written – and this is specific to each implementation. Most routing protocol implementations run in a single thread, and even those that are multi-threaded, are designed to process a large number of routes, rather than generate a lot of keepalives because the network admin didn’t feel like enabling BFD. What can happen is that the process is busy processing a routing update or responding to a CLI command / API request, and doesn’t have time to generate a Keepalive at a very aggressive rate. To make it even more annoying, it can work perfectly 99.999% of time, and then occasionally fail once a few months, leading to a session flap. These problems are almost impossible to troubleshoot, and the best thing to do is to change the config so that it follows vendor best practices. And that means relax the routing protocol timers and configure BFD.

Control plane policy

Most modern routers and switches are built in such a way that they have separate elements for control plane and data plane.

Control plane refers to functionality related to all kinds of signaling, routing protocols, network management and monitoring.² It is performed by a usual computer with a general-purpose CPU, running Linux or another Unix-based OS. Data plane means processing packets, and it is performed by a dedicated network ASIC, FPGA or NPU, specially designed for the purpose of forwarding large volumes of traffic.

All packets that enter the router, first enter data plane³, and, the data plane can decide to send them to control plane for further processing. This typically happens with packets destined to the router itself, or exception packets that can’t be processed by data plane (e.g. TTL=1, packets with IP options or requiring fragmentation). Because the control plane processing power is very low, compared to data plane, most routes implement what is known as Control plane policy (CoPP) – a QoS policy applied in data plane, that rate limits traffic sent from data plane to control plane. CoPP prevents the router CPU from getting overwhelmed by traffic, and ensures that important traffic (e.g. routing protocols or management) is prioritized over other traffic like responding to ping or dropping exception packets and generating ICMP errors.

There are several criteria CoPP can use to distinguish the routing traffic:

All routing traffic is marked with IP precedence 6 (DSCP CS6 or 0xC0)
Some protocols use well-known ports, e.g. BGP TCP uses TCP port 179, LDP uses UDP and TCP ports 646
Some protocols have their own IP protocol number (e.g. OSPF – 89, PIM – 103) and use special link local multicast groups
IS-IS runs directly over ethernet and uses special dst MAC addresses (0180.c200.0014, 0180.c200.0015, 0900.2b00.0004, 0900.2b00.0005)
While all of the above imply dst IP = self, or a link-local multicast group, RSVP is an exception – it sends transit packets with the Router Alert IP option

A CoPP implementation can use different criteria to identify routing traffic and map it to the relevant QoS class. Some routers have CoPP enabled by default, others leave it to the network admin to configure CoPP from scratch.

Example on Arista EOS:

R1#show policy-map type control-plane | grep -A2 -iE "bgp|ospf"
  Class-map: copp-system-bgp (match-any)
       shape : 2500 kbps
       bandwidth : 250 kbps
--
  Class-map: copp-system-OspfIsis (match-any)
       shape : 2500 kbps
       bandwidth : 250 kbps

Note that OSPF and ISIS are mapped to the same class.

It is critically important to understand how CoPP works on the router you use, how to troubleshoot and tweak it. The more functionality-rich and complex the CoPP implementation is, the easier is to shoot yourself in a foot with it – e.g. LPTS on Cisco IOS-XR. Check your vendor documentation for details and best practices.

Therefore, if you ever suspect that the router drops routing protocol packets, CoPP is the first thing to check, namely:

Are those packets mapped to the correct CoPP class? You don’t want routing protocol traffic to be in the low-priority queue for regular traffic.
Are there drops in the respective CoPP class?
If yes – does CoPP show the port where the dropped packets arrive? E.g. on EOS:

R1#show cpu counters queue | grep -E "CoPP Class|CoppSystemBgp"
CoPP Class                 Queue                    Pkts             Octets           DropPkts         DropOctets
CoppSystemBgp              Et1                     78650            6343584                  0                  0
CoppSystemBgp              Et2                         0                  0                  0                  0
CoppSystemBgp              Et3                         0                  0                  0                  0

At this stage it becomes platform specific, and requires to understand at least basics of the platform architecture, port to ASIC mappings, congestion management (e.g. VoQ), for modular platforms it’s even more complex. But incrementing packet drops in CoPP that correlate with session flaps are a good sign that you’re close to isolating the problem.

External peerings and IXP

External Internet peerings usually run with default BGP timers and without BFD. This means, when transport failures occur, Internet traffic is usually blackholed to up to 3 minutes before the network begins to converge.

Imagine an IXP – which is essentially a large L2 network – doing maintenance. Sure, they can shut down BGP sessions from route servers, but many ISPs also do private peerings via the IXP network. [RFC8327] recommends a simple and elegant solution – before doing disruptive changes, block traffic on port 179 for several minutes so that BGP sessions will go down while data traffic still can flow.

The ISP operator experience in that case will be BGP sessions going down for no obvious reason – this is expected and for your own good!

Conclusion

This article is getting too long, so I have decided to stop it and continue in the next one. I will continue with BFD, IGPs and maybe multicast.

[RFC8900] section 6.5 says:

Operators MUST ensure proper PMTUD operation in their network,
including making sure the network generates PTB packets when dropping
packets too large compared to outgoing interface MTU.

This is nice in theory, but they forgot that L2VPNs exist. In my opinion, while PMTUD is nice to have, no network operator should ever rely upon it.

Fortunately, the symptoms of MTU blackholes are very clear and obvious if you understand how TCP works. Troubleshooting is also very simple – ping with DF bit, then try lowering MTU until it works.

Aggressive timers – never use them. Enable BFD for fast convergence.

Be conscious of the control plane policy – while getting into the depths of platform architecture is not so easy, understanding basics and following vendor’s recommended practices will save you a lot of time.

References

TCP/IP Illustrated, Volume 1: The Protocols (2nd Edition) – Kevin R. Fall, W. Richard Stevens, ISBN-13: 978-0321336316
Packetization Layer Path MTU Discovery https://tools.ietf.org/html/rfc4821
Packetization Layer Path MTU Discovery for Datagram Transports https://tools.ietf.org/html/rfc8899
TCP Problems with Path MTU Discovery https://tools.ietf.org/html/rfc2923
IP Fragmentation Considered Fragile https://tools.ietf.org/html/rfc8900
Mitigating the Negative Impact of Maintenance through BGP Session Culling https://tools.ietf.org/html/rfc8327

Notes

^Once in my life, I saw a QoS policy configured with a committed burst (Bc) value of just 1000 bytes. It had the same effect as an MTU blackhole – dropping all packets larger than 1000 bytes. This is why all QoS best practices tell to configure Bc higher than MTU, in practical terms it is at least 10 times higher.
^Sometimes this is called “management plane” but for the purpose of this article it doesn’t make any difference.
^Some routers have dedicated management ports, connected directly to the control plane unit.

6 thoughts on “How to troubleshoot routing protocols session flaps – part 1”

Karsten says:

October 29, 2020 at 4:56 pm

Juniper is using a MTU with L2 headers, which should be known by people using them…
IOS-XR does the same, so I’m a bit curious why you’re just blaming Juniper?

1. Dmytro Shypovalov says:
  
  October 29, 2020 at 5:27 pm
  
  I don’t blame Juniper for adding 14 bytes for MTU on Ethernet interfaces (Cisco IOS-XR, Nokia do the same). I blame Juniper for not using the industry standard default MTU setting of 1500 bytes on IRB, but instead coming up with an arbitrary number which is guaranteed to cause problems if you run routing over it.
  
Matthew Lynn says:

November 21, 2020 at 3:29 am

I thoroughly enjoyed reading this. I look forward to more.

Scassonio says:

March 14, 2021 at 8:01 pm

Greetings Mr. Shypovalov.

First of all, thank you for all the great technical content you’ve posted so far on this site.

In fact, I am a bit disappointed that AFAIK this site has not gained the popularity it deserves, despite a recommendation by Ivan Pepelnjak himself months ago.

I really hope that the second part of this article will be published one day, in the not-too-distant future, and maybe some other stuff.

Wish you the best,
Scassonio

1. Dmytro Shypovalov says:
  
  March 15, 2021 at 9:55 am
  
  Thanks for your feedback Scassonio, appreciate it!
  
  I was busy with plenty of private stuff over the last few months, so didn’t quite have time to write anything for the blog, but I promise I will continue!
  
  1. Scassonio says:
    
    March 21, 2021 at 4:28 pm
    
    That is great news, thanks!
    Looking forward to reading your next articles! 🙂
    
    Scassonio

Posts

How to troubleshoot routing protocols session flaps – part 1