Fixing the WiFi performance anomaly on ath9k

The WiFi performance anomaly was first described in the literature in 2003 for 802.11b. Basically, it is an unintended consequence of the way the 802.11 protocol is designed. One would have thought that it would be fixed by now, but it turns out it is alive and well in modern-day Linux. In this post I will describe what exactly the performance anomaly is, and show the results of my efforts to fix it.

If you are already familiar with what the performance anomaly is, feel free to skip to the results.

A quick primer on the 802.11 protocol

The Distributed Coordination Function (DCF) is the piece of the 802.11 protocol that governs how stations behave when they try to grab the medium. It consists of a “listen before send” routine, where the station will listen if anyone else is trying to transmit, and if so, back off a random period of time before trying again. Because every station does this in the same way, every station also has the same probability to grab the medium.

While this mechanism is inefficient compared to what you can do with more centralised control (such as using a TDMA scheme as cell towers do), it has two nice properties: It is simple to implement, and it is (as the name implies) distributed.

The basic property that every station has the same probability to grab the medium ensures a basic fairness between the participating devices. Whenever a station gets a chance to transmit, it will transmit one data packet¹. Assuming all packets are of equal size (the maximum Ethernet packet size, of instance), this ensures throughput fairness. That is, each station will, over time, transmit the same amount of data.

The performance anomaly

The problem occurs when the stations involved operate at different data rates. If one station transmits at 1 Mbps, and another transmits at 10 Mbps, the first station is going to take ten times as long to transmit the same amount of data. And since only one station can transmit at a time, this slows down the whole network, both lowering the aggregate throughput (because most of the time is taken up by the slow station) and impacting latency for everyone (because a lot of time is spent waiting for an opportunity to transmit). This is what is known as the performance anomaly first described in 2003 in the paper I linked above.

In practice, this is typically seen on a network with several stations, where having just one slow station far away from the access point can make the whole network slow to a crawl. Usually this is not what you want: The slow station should be able to get a bit of data through, but it shouldn’t hog all the airtime hurting performance for everyone.

Basically, what we want to do is introduce another notion of fairness: We want fairness between the amount of time each station spends transmitting. This ensures that the total slice of the available resource (transmission time) each station has available to it is dependent only on the total number of stations, not the rate each of these stations transmits at. This is referred to as airtime fairness and turns out to be equivalent to proportional fairness for the network, which gives a nice tradeoff between maximising throughput of the whole network and giving everyone some chance to transmit.

Fixing the anomaly

Lots of research has been done on the performance anomaly since it was first described. Basically, the proposed solutions fall into three categories:

Varying the parameters of the DCF (example). By scaling the back-off period with the rate, the probability of grabbing the medium can be made proportional to the rate, ensuring airtime fairness. The drawback of this approach is that the medium will be idle for a longer fraction of time since everyone will be backing off (if done naively; otherwise global state about the whole network is needed, which is hard to come by), and it requires modifying all clients.
Changing the transmission size (example). As noted above, the DCF achieves throughput fairness under the assumption that all clients transmit the same amount of data every time they get a chance. However, with 802.11n and later, each transmission can be an aggregate of several packets, giving a large range within which to vary the actual transmission size. For instance, one could calculate the number of bytes required to take up a fixed amount of time at the current rate and transmit that much when afforded the chance. The drawback of this approach is that it is hard to get the calculation right: The rate can vary over very small time intervals, and there may be retransmissions to content with.
Scheduling for airtime fairness (example). When transmitting to several different stations, it is possible to keep track of how much time is actually spent transmitting to each, and make scheduling decisions based on this information, to make sure the time spent on each station is evened out over time. This can be done by using a token bucket, by using deficit scheduling, or really any type of fairness queueing you would normally use to ensure byte-based fairness between flows; just account airtime instead of bytes. The drawback of this approach is that it is hard to do distributed, because all stations do not necessarily see all transmissions to each other. However, in an access point setup it can work well.

My approach is based on the third category: I have implemented an airtime fairness scheduler in the ath9k driver that is based on the FQ-CoDel packet scheduler already in the Linux kernel. This adds deficit-based airtime fairness based time spent transmitting to and (optionally) receiving from each associated station. In addition, FQ-CoDel’s sparse flow optimisation translates nicely to an optimisation for stations that transmit very little, giving them temporary priority and hence a nice low latency. The patch has been accepted into mainline Linux (along with Felix Fietkau’s follow-up fix for a power save-related crash bug) and will be part of the Linux 4.11. It is also shipped in the LEDE project firmware. The next section shows the performance evaluation I have performed thus far.

Updated: The above link to the patch now points to the Linux git tree instead of the mailing list where it was posted.

Results

All results in this section are obtained from the same basic setup: There’s one device which acts as an access point, and three others acting as clients. Two clients are close to the access point (achieving high rates from the rate controller), and one is further away (achieving a low rate). In addition, one of the fast clients has an additional virtual station connected which is only receiving ping traffic. The stations are running a stock Ubuntu 16.04, while the access point is running a 4.6-based kernel with the appropriate patches applied.

The three versions of the kernel being compared are the following:

mainline: This is the stock 4.6 kernel, patched to measure airtime consumed by each station (i.e. a subset of the airtime fairness scheduler patch is applied).
softq-fq_codel: As above, but also patched to include the patches to make mac80211 apply fq_codel to each subqueue (patch 1, patch 2) and my patch to make ath9k use this feature.
airtime scheduler: As above but also including my airtime fairness scheduler patch.

Basic validation: UDP traffic

Testing with UDP traffic is a simple way to verify that the scheduler actually produces something like the intended effect in the simplest case. In this test, there is one bulk UDP flow to each station, along with a simple ping to measure latency. That means that only the ping replies generate traffic back from the stations, everything else is sent from the access point. The UDP flows are generated by iperf, configured to send at 200 Mbps to each station (i.e. significantly higher than the actual available throughput).

The basic validation is seen in Figure 1 below: On mainline kernels, the performance anomaly is clearly visible, with the slow station using significantly more airtime than the two others combined. Switching to the mac80211-based intermediate queues already improves things a lot, most likely by ensuring that there is enough packets queued to each station to build full-size aggregates.² Adding the airtime fairness scheduler results in almost perfect airtime fairness between the three stations, which is what we were aiming for. So far so good.

However, just looking at the airtime itself may be fine for satisfying one’s inner notion of fairness. But if we don’t see any of the associated performance gains, it’s not really that useful in the real world. This is where Figure 2 comes in: This shows the aggregate throughput for all three stations, as well as the average ping size to all of them during the test. The mainline kernel only achieves a median throughput of 20 Mbps to all three stations. Switching to FQ-CoDel’ed intermediate queues improves this to just shy of 70 Mbps, while adding the airtime fairness scheduler further increases aggregate throughput to just over 100 Mbps. At the same time, latency decreases from around 300 ms (with a lot of variance), to around 10. Most of this is due to the FQ-CoDel’ed intermediate queues, but the airtime fairness scheduler does give it a slight notch further downwards.

Figure 1

Per-station TX airtime usage in a UDP flood scenario. The two first stations have about the same (fast) rate, while the third station is further away from the access point and gets a lower rate. The fourth station has only a ping going to it. The sampling interval is 200ms, and the graphs show averages of aggregate airtime usage over a 30-second test.

Figure 2

Aggregate throughput and average ping to all the stations in a UDP flood scenario. The values are average throughput values over a 30-second test.

Turning to TCP

Looking at TCP flows in the same setup, figures 3, 4 and 5 shows the aggregate throughput for an upstream-only test (traffic going to the stations), downstream-only and bidirectional, respectively. Several interesting points emerge from these graphs:

Switching to the mac80211 intermediate queues gives about the same improvement in throughput and even more improvement in latency for one-way TCP flows as for UDP flows. The additional gain for the airtime fairness is not quite as big, but still present.
The airtime fairness scheduler improves the downstream one-directional TCP throughput. Presumably this is because the TCP flows are being controlled by the rate at which they receive ACKs in the other direction (which is the traffic direction under control of the scheduler).

Figure 3

Aggregate throughput and average ping to all the stations in a TCP upload (traffic going to the stations) scenario. The values are average throughput values over a 30-second test.

Figure 4

Aggregate throughput and average ping to all the stations in a TCP download (traffic going from the stations) scenario. The values are average throughput values over a 30-second test.

Figure 5

Aggregate throughput and average ping to all the stations in a TCP bidirectional traffic scenario. The values are average throughput values over a 30-second test.

Latency for the fourth station

One important feature (at least on paper) of the airtime fairness scheduler is that it incorporates the “sparse flow optimisation” from FQ-CoDel to the per-station scheduling. This should give stations that only has a little bit of traffic a boost in the scheduling, giving them lower latency. Figures 6 and 7 below show these values for two of the tests: the UDP test and the TCP upload test.

The graphs compare the softq-fq_codel patch set to the airtime scheduler (the stock kernel is omitted because it would be far off the right side of the graphs). In the UDP case, the benefit actually seems to materialise as expected, giving a nice benefit for the fourth station. However, in the TCP traffic case, this is not the case; instead, the airtime scheduler gives the fourth station slightly higher latency compared to the softq-fq_codel patch set. I don’t have a good explanation for why this is so right now, but it is definitely something that requires further investigation.

Figure 6

Ping times to the fourth station (which does not receive bulk traffic), for the UDP traffic test.

Figure 7

Ping times to the fourth station (which does not receive bulk traffic), for the TCP upload traffic test.

Going forward

The current state of the airtime fairness scheduler looks quite promising. However, a more thorough performance evaluation is needed, adding more traffic types and possibly other scenarios (e.g. more stations). In particular, the “sparse station” optimisation needs to be closely examined to figure out if it actually does provide a benefit, or if it can be removed (thus simplifying the code). The patch is also still pending code review on the Linux wireless list.

Update: The patch has now been accepted into mainline Linux (see link above), and the more thorough performance analysis is written up as a paper pending publication. Will post another update with a link when it gets published.

Update: A pre-print of the paper (still under submission) is available on arxiv.org.

Or one aggregate of multiple packets; but I’m ignoring aggregation for the purpose of this explanation and will come back to it later.
^[return]
The standard mandates a form of the second category of airtime fairness fix mentioned above in that one aggregate should not take up more than four milliseconds of airtime. The ath9k driver takes this into account when building the aggregates, but it requires there to be enough packets queued to a station to actually fill this space.
^[return]

Fixing the WiFi performance anomaly on ath9k

A quick primer on the 802.11 protocol

The performance anomaly

Fixing the anomaly

Results

Basic validation: UDP traffic

Figure 1

Figure 2

Turning to TCP

Figure 3

Figure 4

Figure 5

Latency for the fourth station

Figure 6

Figure 7

Going forward