An interesting feature that I implemented about a year ago (it first appeared in the 5.18 kernel), is the ability to make the BPF_PROG_RUN syscall run in a “live packet mode” for XDP. One reason this is interesting is that it makes it possible to implement a programmable packet generator, that leverages the programmability and performance of XDP to achieve very high packet rates, without the hassle of setting up a DPDK-based stack. This feature has, however, not been that well publicised; this post is an attempt to remedy that.

First, a demo

The latest release of xdp-tools contains the xdp-trafficgen utility, which uses the facility described in this post. It supports generating UDP traffic, as well as rudimentary TCP support. It is quite simple to use, for instance to generate UDP traffic one simply invokes it as:

# xdp-trafficgen udp --dst-mac aa:bb:cc:dd:ee:ff --dst-addr fe80::1 --dst-port 10000 ens1f0
Transmitting on ens1f0 (ifindex 3)
lo->ens1f0                      0 err/s         8,037,760 xmit/s
lo->ens1f0                      0 err/s         8,064,528 xmit/s
lo->ens1f0                      0 err/s         8,093,072 xmit/s
lo->ens1f0                      0 err/s         8,078,808 xmit/s
lo->ens1f0                      0 err/s         8,089,553 xmit/s
lo->ens1f0                      0 err/s         8,056,951 xmit/s
lo->ens1f0                      0 err/s         8,061,210 xmit/s
^C

This shows the traffic generator transmitting 8 million packets per second, on a single CPU. The only thing this requires from the hardware is that the target device supports XDP redirect (as a target).

The performance scales pretty linearly when sending from more threads (and thus utilising multiple CPU cores):

# xdp-trafficgen udp --dst-mac aa:bb:cc:dd:ee:ff --dst-addr fe80::1 --dst-port 10000 --threads 5 ens1f0
Transmitting on ens1f0 (ifindex 3)
lo->ens1f0                      0 err/s        39,992,128 xmit/s
lo->ens1f0                      0 err/s        40,311,421 xmit/s
lo->ens1f0                      0 err/s        40,304,242 xmit/s
lo->ens1f0                      0 err/s        40,356,115 xmit/s
lo->ens1f0                      0 err/s        40,310,905 xmit/s
lo->ens1f0                      0 err/s        40,236,664 xmit/s
lo->ens1f0                      0 err/s        40,275,489 xmit/s
^C

There’s also rudimentary support for generating TCP traffic:

# xdp-trafficgen tcp -p 10000 -i ens1f1 fe80::1
Connected to fe80::1 port 10000 from fe80::ee0d:9aff:fedb:1135 port 49660
lo->ens1f1                      0 err/s         5,958,853 xmit/s
lo->ens1f1                      0 err/s         5,980,502 xmit/s
lo->ens1f1                      0 err/s         5,960,844 xmit/s
lo->ens1f1                      0 err/s         5,972,524 xmit/s
lo->ens1f1                      0 err/s         5,970,290 xmit/s
^C

Note that the packet rate is lower because there are full-size packets, whereas the UDP test generates small (64-byte) packets. The rate above corresponds to just over 70 Gbit/s. However, while this can produce quite high packet rates, there is no congestion control, and the retransmission logic is quite naive. So I wouldn’t recommend running this against a target on the internet; however, for stress-testing a local TCP implementation in controlled conditions, it can be useful. More about how it works below.

How does this all work?

As mentioned above, the traffic generation works by using the BPF_PROG_RUN command of the bpf() syscall. Originally, this was called BPF_PROG_TEST_RUN, and was meant for unit testing BPF programs. What it does is that it allows a userspace application to execute a BPF program inside the kernel, supplying it with a fake context structure from userspace (and, for XDP and other networking hooks, also with the packet data), and then read out the program return code afterwards. This can be used to check that a BPF program behaves correctly for a given input.

However, it soon became apparent that this was useful for more than testing: since BPF programs can have side effects, loading and running such a program can be a way to instruct the kernel to do something. Indeed, there’s a whole BPF program type that can only be executed this way (the “syscall” type), which is used to load other programs in the “light skeletons”.

The new kernel feature used by the XDP traffic generator extends this mode of running with side effects to XDP programs. By way of a new flag, BPF_F_TEST_XDP_LIVE_FRAMES passed to the XDP execution in BPF_PROG_RUN, userspace can request the kernel to switch execution mode for the program being passed in. When running in this live packet mode, the kernel will also react to the program return code and, for instance, perform a redirect of the packet if the program returns XDP_REDIRECT.

The redirect facility is exactly what the traffic generator uses; the userspace application will prepare a buffer with the packet data, filling in the source and destination address according to the command line parameters, computing the checksum, etc. Then, it will pass that buffer to the kernel, and run an XDP program that redirects the packet to the chosen interface, causing it to be transmitted.

In the default (UDP) operating mode, the XDP program wont even touch the packet data, it’ll just immediately return XDP_REDIRECT. Whereas, in the “dynamic port” mode, the XDP program will keep a counter of the next port to send to and dynamically rewrite the packet data before transmitting the packet, thus generating traffic with dynamic destination ports without having to go back to userspace for each packet.

To achieve high performance, there’s a repeat parameter to the syscall, which will make the kernel run the XDP program multiple times in a loop before returning to userspace. Because the kernel also sets up a page_pool structure to efficiently recycle the memory used for packet data, this gives the packet generator comparable performance to that which can be achieved when redirecting packets from one interface to another.¹

Dealing with TCP

Generating packets and sending them out is quite straight forward using this mechanism, and the programmability makes it very flexible. However, to implement a TCP-based packet generator, it is not enough to just blast packets at a target: the receiver will just consider these invalid and ignore them. Instead, a TCP traffic generator will need to first perform a handshake to open the connection, and then produce valid TCP packets that fit into the connection window afterwards. On the face of it, this is by no means straight forward to do with this BPF_PROG_RUN facility, so how does the XDP traffic generator implement this? Well, it turns out there are a couple of tricks we can combine to get to a working TCP mode.

First off, is the handshake. There’s a whole state machine dance to this, and lots of corner cases to deal with. Fortunately, the Linux kernel has a mature and battle-tested TCP implementation, so we can just picky-bag on that! Specifically, the XDP traffic generator will just perform a regular connect() call, and let the kernel deal with the handshake. Then, once the connection has been established, it will take over the connection with the XDP-based handling.

Taking over the connection is done by installing an XDP program on the interface the connection is going over. Since this program sees the packets before the kernel stack does, it can parse those packets and extract the information it needs (specifically, the sequence numbers and window information), and then return XDP_DROP, thus ensuring the kernel never sees those packets. From the point of view of the kernel, the TCP connection just looks idle, while the XDP program handles the rest of the connection.

Generating the bulk traffic is done in the same way as in UDP mode: an initial packet is generated in a buffer in userspace and passed to the kernel, where the XDP program will update the sequence numbers to match those expected by the other end of the connection. Synchronisation with the second XDP program is done by way of a shared map, and the generating program will simply return XDP_DROP if it runs up against the farthest end of the window (in effect busy polling until there is space to send more packets).

Retransmissions are handled in the simplest way possible: whenever a duplicate ACK is detected on the receiver side, the sequence number sent out is simply reset to the start of the window (i.e., to the lowest unacked sequence number), meaning that the whole window is en effect transmitted again. The same thing happens on a (fixed) timer if no progress is made (otherwise things tend to get stuck when a packet is lost but the dupack is not received for whatever reason). This can definitely be improved to only resend what is needed, but that is significantly more complex as that also involves parsing and dealing with SACKs. And the way it’s done currently is enough to show that it’s possible to implement a TCP traffic generator in XDP, which was what I set out to do when implementing this.²

In summary

The XDP traffic generator is a neat little utility that ships with xdp-tools and can be used to generate synthetic packet data at high rates, requiring only XDP support on the egress interface. The UDP generation is quite usable today already, and while the TCP support is somewhat rudimentary, it does work to stress-test a TCP receiver.

Because of the programmable nature of the XDP-based generator, it is also quite extensible. Suggestions for features to support are very welcome!

One thing to be aware of when using this is that because of the recycling, it the XDP program modifies the packet before it is sent out, subsequent repetitions (in the same loop) will see the modified version instead of a pristine copy as passed by userspace. So the XDP program needs to be prepared to deal with this (for instance by parsing the packet before modifying it).
^[return]
Another thing that is missing, and which is a bit harder to do, is supporting sending real data over TCP (right now, the data is all zeroes).
^[return]