ECN, ECMP and anycast: a cocktail of broken connections

This is a writeup of an issue I discovered back in 2019 with my internet provider at the time. I meant to write it up back then, but somehow never got around to it. Since I just got a reminder of this, I figured better late than never. For a TL:DR, skip to the summary at the end.

The story of a failed connection

This story started with me noticing that connections were randomly failing to certain web pages. After scratching my head some, I realised that the thing those web pages had in common, was that they all used Cloudflare as their CDN. So the failure had to be related to this. And indeed, trying to connect directly to a Cloudflare site gave me the same error, but only over IPv4. Curious.

Trying various things, before long I realised that the error was related to ECN being enabled¹ on my machine. Turning off ECN made the problem go away!

Down the ECN rabbit hole

Okay, so the ECN thing was a hint, and something to investigate further. Maybe Cloudflare was doing something odd with ECN, or there was simply a bug on their side related to this? I sent an email to the Bufferbloat mailing list to ask if anyone else had noticed anything similar.

Now, the nice thing about technical mailing lists like the bloat list is that (a) it has a lot of very knowledgeable people on it, and (b) some of those people are also quite well-connected. So, in addition to getting a bunch of replies where other people were telling me they did not have the same problem, before long, I was also in touch with someone at Cloudflare! That person told me that they were not doing anything special with ECN (other than enable it on their Linux servers), which pointed the arrow back to the path between me and the Cloudflare servers.

Having done my best to rule out any problems with my own network, at this point I reported an issue to my ISP, making sure to extract a promise that they would kick it up to someone knowledgeable to investigate. I left it at that, and contended myself to just disable ECN in the meantime.

An ISP asks for help

A couple of months passed, and I has all but forgotten about this issue, when suddenly an email appeared on another of the bufferbloat.net mailing lists, this one dedicated to ECN. The email referenced a thread on the NANOG mailing list for operators, where someone was asking how to prove to a customer that their network was not at fault for an ECN issue they were looking at. And I happened to recognise the name of the person asking as the technical person at my ISP!

So, back on the case! I immediately shared my suspision that I was the customer in question, which kicked off another thread of people chiming in with suggestions for things to look at. One of those was using the tcptraceroute utility to try to look at the path between me and Cloudflare. Which finally gave a clue to the real root cause of the problem!

Digging into traceroute

Looking at the traceroutes, it became clear that I had two paths to Cloudflare; which of the two was taken appeared to be based on a hash of the packet header, which could be seen by varying the source port:

$ traceroute -q 1 --sport=10000 104.24.125.13
traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets
 1  _gateway (10.42.3.1)  0.357 ms
 2  albertslund-edge1-lo.net.gigabit.dk (185.24.171.254)  4.707 ms
 3  customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46)  1.283 ms
 4  te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49)  1.667 ms
 5  netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246)  1.406 ms
 6  104.24.125.13 (104.24.125.13)  1.322 ms

$ traceroute -q 1 --sport=10001 104.24.125.13
traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets
 1  _gateway (10.42.3.1)  0.293 ms
 2  albertslund-edge1-lo.net.gigabit.dk (185.24.171.254)  3.430 ms
 3  customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38)  1.194 ms
 4  10ge1-2.core1.cph1.he.net (216.66.83.101)  1.297 ms
 5  be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237)  6.805 ms
 6  149.6.142.130 (149.6.142.130)  6.925 ms
 7  104.24.125.13 (104.24.125.13)  1.501 ms

Looking further at those traces, they even go through two completely different upstreams from my ISP (hop 4), which meant that they probably ended up in two different data centres, as is also hinted at by the DNS labels of the intermediate hops (ham and cph most likely refer to Hamburg and Copenhagen, respectively).

Anycast networking

So how does this happen? It’s the same destination IP, so shouldn’t it end up at the same machine? Well, not necessarily. Cloudflare is known to make extensive use of anycast routing, as they explain on their web site. This means that the same IP can end up at different physical servers depending on where it comes from.

This is fine in itself, and pretty standard practice these days. There are people who argue (as was indeed also the case during that mailing list thread) that using anycast with TCP is likely to break things. But in practice, this way of using anycast is so widely used that ISPs tend to structure their network around it; most commonly by making sure any load balancing decisions inside their networks are flow-based, so that a single flow always ends up in the same place, thus keeping TCP from getting confused.

Broken flows

So why was I seeing broken connections? Well, looking closer at a tcpdump of such a broken flow provided a hint:

12:21:47.816359 IP (tos 0x0, ttl 64, id 25687, offset 0, flags [DF], proto TCP (6), length 60)
    10.42.3.130.34420 > 104.24.125.13.80: Flags [SEW], cksum 0xf2ff (incorrect -> 0x0853), seq
3345293502, win 64240, options [mss 1460,sackOK,TS val 4248691972 ecr 0,nop,wscale 7], length 0
12:21:47.823395 IP (tos 0x0, ttl 58, id 0, offset 0, flags [DF], proto TCP (6), length 52)
    104.24.125.13.80 > 10.42.3.130.34420: Flags [S.E], cksum 0x9f4a (correct), seq 1936951409, ack
3345293503, win 29200, options [mss 1400,nop,nop,sackOK,nop,wscale 10], length 0
12:21:47.823479 IP (tos 0x0, ttl 64, id 25688, offset 0, flags [DF], proto TCP (6), length 40)
    10.42.3.130.34420 > 104.24.125.13.80: Flags [.], cksum 0xf2eb (incorrect -> 0x503e), seq 1,
ack 1, win 502, length 0
12:21:47.823665 IP (tos 0x2,ECT(0), ttl 64, id 25689, offset 0, flags [DF], proto TCP (6), length
117)
    10.42.3.130.34420 > 104.24.125.13.80: Flags [P.], cksum 0xf338 (incorrect -> 0xc1d4), seq
1:78, ack 1, win 502, length 77: HTTP, length: 77
        GET / HTTP/1.1
        Host: 104.24.125.13
        User-Agent: curl/7.66.0
        Accept: */*

12:21:47.825485 IP (tos 0x2,ECT(0), ttl 60, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    104.24.125.13.80 > 10.42.3.130.34420: Flags [R], cksum 0x3a65 (correct), seq 1936951410, win
0, length 0

The handshake completes (first three packets) and negotiates ECN usage ([E] flag). The SYN-ACK packet comes back with a TTL of 58. Then, when sending the actual GET request (second-to-last packet), my machine sets the ECT(0) field in the IP header; notice the different TOS value. This triggers a TCP reset of the connection ([R] flag) in the last packet.

But this last reset has a TTL of 60, which is different from the SYN-ACK which had a TTL of 58. So it must have taken a different path! Which, given the traceroutes we saw above, means it came from a totally different machine. That second machine simply did not have any flow state that matched the incoming packet, and so it resets the connection as mandated by the TCP spec.

ECN and ECMP

Armed with the knowledge above, it became pretty straight forward to investigate whether the ECN bits did indeed influence the path. Traceroute allows you to set this (-t sets the TOS byte; -t 1 ends up as ECT(0) on the wire and -t 2 is ECT(1)):

$ traceroute -q 1 --sport=10000 104.24.125.13 -t 1
traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets
 1  _gateway (10.42.3.1)  0.336 ms
 2  albertslund-edge1-lo.net.gigabit.dk (185.24.171.254)  6.964 ms
 3  customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46)  1.056 ms
 4  te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49)  1.512 ms
 5  netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246)  1.313 ms
 6  104.24.125.13 (104.24.125.13)  1.210 ms

$ traceroute -q 1 --sport=10000 104.24.125.13 -t 2
traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets
 1  _gateway (10.42.3.1)  0.339 ms
 2  albertslund-edge1-lo.net.gigabit.dk (185.24.171.254)  2.565 ms
 3  customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38)  1.301 ms
 4  10ge1-2.core1.cph1.he.net (216.66.83.101)  1.339 ms
 5  be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237)  6.570 ms
 6  149.6.142.130 (149.6.142.130)  6.888 ms
 7  104.24.125.13 (104.24.125.13)  1.785 ms

And presto! Varying only the ECN bit, we get a different path.

The most likely cause of this is that there was a router in the path performed a hashing operation on the packet header to select the path to take. And that this hash included the ECN bits, which made it change over the duration of the connection, leading to the failure I saw. The ISP confirmed that they were using ECMP, but that they were in the process of moving away from the ZTE M6000-S routers they were using at the time, so the bug was never reported to the vendor. However they agreed to turn it off, which fixed the problem.

Summary

So, what was happening here? To summarise:

The ECMP algorithm used by the router in the path of my (now former) ISP was using the ECN bits in the IP header as parts of the hash that computed the path.
When a TCP SYN packet first needs to negotiate ECN, it is sent without any ECN bits set in the header; after negotiation succeeds, the data packets will be marked as ECT(0).
Because those different ECN values become part of the ECMP hash, the data packets will take a different path than the handshake.
Since the destination is anycasted, that means they will also end up at a different endpoint.
The second endpoint won’t recognise the connection, and reply with a TCP RST, leading to broken connections.

The fix is to stop hashing on the ECN bits when doing ECMP. Hashing on the diffserv part of the TOS field is OK, but just excluding the TOS field entirely from the hash may be simpler.

In my case, the ISP ended up simply disabling ECMP entirely on the link, which fixed the problem. I have since moved ISPs and have not had this problem resurface. But hopefully this writeup will help others that end up in a similar situation in the future.

The net.ipv4.tcp_ecn sysctl in Linux governs the usage of ECN, and defaults to ‘2’, which means “accept ECN on incoming connections, but don’t request it on outgoing connections”. I have it set to ‘1’ (always enable) on my laptop.
^[return]