An interesting feature that I implemented about a year ago (it first appeared in
the 5.18 kernel), is the ability to make the BPF_PROG_RUN
syscall run in a
“live packet mode” for XDP. One reason this is interesting is that it makes it
possible to implement a programmable packet generator, that leverages the
programmability and performance of XDP to achieve very high packet rates,
without the hassle of setting up a DPDK-based stack. This feature has, however,
not been that well publicised; this post is an attempt to remedy that.
First, a demo
The latest release of xdp-tools
contains the xdp-trafficgen
utility, which
uses the facility described in this post. It supports generating UDP traffic, as
well as rudimentary TCP support. It is quite simple to use, for instance to
generate UDP traffic one simply invokes it as:
# xdp-trafficgen udp --dst-mac aa:bb:cc:dd:ee:ff --dst-addr fe80::1 --dst-port 10000 ens1f0
Transmitting on ens1f0 (ifindex 3)
lo->ens1f0 0 err/s 8,037,760 xmit/s
lo->ens1f0 0 err/s 8,064,528 xmit/s
lo->ens1f0 0 err/s 8,093,072 xmit/s
lo->ens1f0 0 err/s 8,078,808 xmit/s
lo->ens1f0 0 err/s 8,089,553 xmit/s
lo->ens1f0 0 err/s 8,056,951 xmit/s
lo->ens1f0 0 err/s 8,061,210 xmit/s
^C
This shows the traffic generator transmitting 8 million packets per second, on a single CPU. The only thing this requires from the hardware is that the target device supports XDP redirect (as a target).
The performance scales pretty linearly when sending from more threads (and thus utilising multiple CPU cores):
# xdp-trafficgen udp --dst-mac aa:bb:cc:dd:ee:ff --dst-addr fe80::1 --dst-port 10000 --threads 5 ens1f0
Transmitting on ens1f0 (ifindex 3)
lo->ens1f0 0 err/s 39,992,128 xmit/s
lo->ens1f0 0 err/s 40,311,421 xmit/s
lo->ens1f0 0 err/s 40,304,242 xmit/s
lo->ens1f0 0 err/s 40,356,115 xmit/s
lo->ens1f0 0 err/s 40,310,905 xmit/s
lo->ens1f0 0 err/s 40,236,664 xmit/s
lo->ens1f0 0 err/s 40,275,489 xmit/s
^C
There’s also rudimentary support for generating TCP traffic:
# xdp-trafficgen tcp -p 10000 -i ens1f1 fe80::1
Connected to fe80::1 port 10000 from fe80::ee0d:9aff:fedb:1135 port 49660
lo->ens1f1 0 err/s 5,958,853 xmit/s
lo->ens1f1 0 err/s 5,980,502 xmit/s
lo->ens1f1 0 err/s 5,960,844 xmit/s
lo->ens1f1 0 err/s 5,972,524 xmit/s
lo->ens1f1 0 err/s 5,970,290 xmit/s
^C
Note that the packet rate is lower because there are full-size packets, whereas the UDP test generates small (64-byte) packets. The rate above corresponds to just over 70 Gbit/s. However, while this can produce quite high packet rates, there is no congestion control, and the retransmission logic is quite naive. So I wouldn’t recommend running this against a target on the internet; however, for stress-testing a local TCP implementation in controlled conditions, it can be useful. More about how it works below.
How does this all work?
As mentioned above, the traffic generation works by using the BPF_PROG_RUN
command of the bpf()
syscall. Originally, this was called BPF_PROG_TEST_RUN
,
and was meant for unit testing BPF programs. What it does is that it allows a
userspace application to execute a BPF program inside the kernel, supplying it
with a fake context structure from userspace (and, for XDP and other networking
hooks, also with the packet data), and then read out the program return code
afterwards. This can be used to check that a BPF program behaves correctly for a
given input.
However, it soon became apparent that this was useful for more than testing: since BPF programs can have side effects, loading and running such a program can be a way to instruct the kernel to do something. Indeed, there’s a whole BPF program type that can only be executed this way (the “syscall” type), which is used to load other programs in the “light skeletons”.
The new kernel feature used by the XDP traffic generator extends this mode of
running with side effects to XDP programs. By way of a new flag,
BPF_F_TEST_XDP_LIVE_FRAMES
passed to the XDP execution in BPF_PROG_RUN
,
userspace can request the kernel to switch execution mode for the program being
passed in. When running in this live packet mode, the kernel will also react
to the program return code and, for instance, perform a redirect of the packet
if the program returns XDP_REDIRECT
.
The redirect facility is exactly what the traffic generator uses; the userspace application will prepare a buffer with the packet data, filling in the source and destination address according to the command line parameters, computing the checksum, etc. Then, it will pass that buffer to the kernel, and run an XDP program that redirects the packet to the chosen interface, causing it to be transmitted.
In the default (UDP) operating mode, the XDP program wont even touch the packet
data, it’ll just immediately return XDP_REDIRECT
. Whereas, in the “dynamic
port” mode, the XDP program will keep a counter of the next port to send to and
dynamically rewrite the packet data before transmitting the packet, thus
generating traffic with dynamic destination ports without having to go back to
userspace for each packet.
To achieve high performance, there’s a repeat
parameter to the syscall, which
will make the kernel run the XDP program multiple times in a loop before
returning to userspace. Because the kernel also sets up a page_pool
structure
to efficiently recycle the memory used for packet data, this gives the packet
generator comparable performance to that which can be achieved when redirecting
packets from one interface to another.1
Dealing with TCP
Generating packets and sending them out is quite straight forward using this
mechanism, and the programmability makes it very flexible. However, to implement
a TCP-based packet generator, it is not enough to just blast packets at a
target: the receiver will just consider these invalid and ignore them. Instead,
a TCP traffic generator will need to first perform a handshake to open the
connection, and then produce valid TCP packets that fit into the connection
window afterwards. On the face of it, this is by no means straight forward to do
with this BPF_PROG_RUN
facility, so how does the XDP traffic generator
implement this? Well, it turns out there are a couple of tricks we can combine
to get to a working TCP mode.
First off, is the handshake. There’s a whole state machine dance to this, and
lots of corner cases to deal with. Fortunately, the Linux kernel has a mature
and battle-tested TCP implementation, so we can just picky-bag on that!
Specifically, the XDP traffic generator will just perform a regular connect()
call, and let the kernel deal with the handshake. Then, once the connection has
been established, it will take over the connection with the XDP-based handling.
Taking over the connection is done by installing an XDP program on the interface
the connection is going over. Since this program sees the packets before the
kernel stack does, it can parse those packets and extract the information it
needs (specifically, the sequence numbers and window information), and then
return XDP_DROP
, thus ensuring the kernel never sees those packets. From the
point of view of the kernel, the TCP connection just looks idle, while the XDP
program handles the rest of the connection.
Generating the bulk traffic is done in the same way as in UDP mode: an initial
packet is generated in a buffer in userspace and passed to the kernel, where the
XDP program will update the sequence numbers to match those expected by the
other end of the connection. Synchronisation with the second XDP program is done
by way of a shared map, and the generating program will simply return XDP_DROP
if it runs up against the farthest end of the window (in effect busy polling
until there is space to send more packets).
Retransmissions are handled in the simplest way possible: whenever a duplicate ACK is detected on the receiver side, the sequence number sent out is simply reset to the start of the window (i.e., to the lowest unacked sequence number), meaning that the whole window is en effect transmitted again. The same thing happens on a (fixed) timer if no progress is made (otherwise things tend to get stuck when a packet is lost but the dupack is not received for whatever reason). This can definitely be improved to only resend what is needed, but that is significantly more complex as that also involves parsing and dealing with SACKs. And the way it’s done currently is enough to show that it’s possible to implement a TCP traffic generator in XDP, which was what I set out to do when implementing this.2
In summary
The XDP traffic generator is a neat little utility that ships with xdp-tools and can be used to generate synthetic packet data at high rates, requiring only XDP support on the egress interface. The UDP generation is quite usable today already, and while the TCP support is somewhat rudimentary, it does work to stress-test a TCP receiver.
Because of the programmable nature of the XDP-based generator, it is also quite extensible. Suggestions for features to support are very welcome!
One thing to be aware of when using this is that because of the recycling, it the XDP program modifies the packet before it is sent out, subsequent repetitions (in the same loop) will see the modified version instead of a pristine copy as passed by userspace. So the XDP program needs to be prepared to deal with this (for instance by parsing the packet before modifying it).
[return]Another thing that is missing, and which is a bit harder to do, is supporting sending real data over TCP (right now, the data is all zeroes).
[return]