TCP zero-copy is a feature of the Linux kernel that makes it possible to send and receive data without incurring an extra copy between kernel memory and the memory buffer that holds the final data (in userspace, or even in the memory of a different device on the system).

Copying data adds overhead, so avoiding it is appealing. The kernel features that enable this are quite new, and figuring out exactly how they work under the hood not trivial. So in this post I’ll try to summarise what exactly is going on under the hood when these features are used.

There are actually a couple of (interrelated) features in play here, and also several different kernel APIs that can be used for zero-copy. I’ll try to cover them in roughly the order they were added to the kernel, starting with the simple TX-side zero-copy mode.

Send side zero-copy of userspace buffers

The sender-size zero-copy operation is the oldest, being added all the way back in August 2017 with commit f214f915e7db (“tcp: enable MSG_ZEROCOPY”). This makes it possible to specify the MSG_ZEROCOPY flag when sending data on a TCP socket using the sendmsg() syscall. This syscall takes an iovec structure with pointer(s) to data buffer(s), instead of just supplying the data as an argument, like send() does.

When using this flag, the kernel won’t copy the data from userspace into the kernel. Instead, it will build an skb structure that references the userspace data buffers directly, and pass that down through the stack. The kernel TCP stack will generate the headers for the TCP packet(s) that carry the data like it normally does, but instead of prepending it to the same buffer that holds the data, the headers will be in a separate (kernel memory) buffer. This also means that if the network device does not support scatter-gather DMA operations (where different data chunks come from different places), zero-copy sending won’t work, and the kernel will end up copying the data instead. This, and more details about how to use MSG_ZEROCOPY is documented in the kernel docs.

Since the memory is being copied directly from userspace to the network device, the userspace application has to keep it around unmodified, until it has finished sending. The sendmsg() syscall itself is asynchronous, and will return without waiting for this. Instead, once the memory buffers are no longer needed by the stack, the kernel will return a notification to userspace that the buffers can be reused. This is done via the socket error queue (recvmsg() with the MSG_ERRQUEUE flag set), and the userspace application will have to poll that to know when it can free or reuse the memory. The notification will also contain a flag if zero-copy was not successful, and the kernel had to copy the data for whatever reason (such as missing hardware support).

Using io_uring for zero-copy transfer

The astute reader probably noticed that the API outlined above is really an asynchronous data transfer API. The kernel already has a general purpose async API for lots of different operations, in the form of io_uring. And indeed, io_uring gained support for zero-copy TCP in 2022.

Using the zero-copy send operation with io_uring is relatively straight-forward. There’s a new io_uring_prep_send_zc() operation that can be used in place of the regular send operation, which works in a similar way. There’s also a new notification type for when the memory buffers are no longer required, similar to the one put on the error queue in the sendmsg() API. The submission linked above has more details.

Receive side zero-copy and memory registration

While send-side zero-copy is relatively straight forward, the receive side is a bit more complicated. This is because the send side can rely on scatter-gather DMA to assemble the packet headers and the data payload from different memory buffers. But on receive, the reverse has to be done: the kernel needs to process the TCP headers, while the payload should go directly to the destination buffer.

To achieve this, hardware assistance is needed. Specifically, the NIC needs to support a couple of features:

  • TCP header split. This is the ability for the hardware to parse the packet headers, identify the boundary between TCP headers and data, and send the headers and data to two different memory locations.

  • Receive queue page_pool memory binding operations. This is a driver feature where the driver supports replacing the underlying memory provider used to supply the hardware with memory buffers, on a per-queue basis. It relies on the page_pool kernel abstraction for network packet memory, which has seen increasing adoption in network drivers in recent years.

Once these prerequisites are in place, an application can register a memory region to be used for zero-copy receive on a given NIC receive queue. There’s an io_uring operation for this, io_uring_register_ifq(), and a netlink command, NETDEV_CMD_BIND_RX, although the latter can’t actually be used with userspace memory buffers (see the section on device memory below).

Under the hood, they work similarly: In both cases, the kernel will allocate a scatter-gather table for the memory region, split the memory up into page-sized chunks, and bind a custom memory provider to the page_pool instance used by the netdevice RXQ. Registration will fail if the network device does not have TCP data split enabled.

The page_pool structure is used by the NIC driver to allocate pages that are used to populate the NIC receive ring. When a custom backing provider is used, these pages will be allocated from the region of memory registered by the userspace application, which is what allows packet data to be copied directly into the right memory region.

The memory provider config ensures that each RXQ is only bound once, but multiple RXQs can share the same binding (and memory provider).

One consequence of this design is that once the memory buffer has been registered with the queue, all traffic arriving on that queue will end up somewhere in the bound memory. This means that usually, flow steering is needed to make sure that only the traffic intended to go into the zero-copy buffers hit that receive queue. The NICs that support this usually have pretty extensive flow steering facilities built in, but there is nothing in the kernel ensuring this is done correctly; the application using zero-copy has to make sure everything is set up correctly.

Another consequence is that the application has no way to steer where in the registered memory the packet data ends up. The memory buffer is registered as a whole, at which point the kernel chops it up into page-sized chunks as described above. Once data arrives, the application will be notified of available data buffers as they arrive, with no guarantees on ordering. Meaning that the application has to be able to handle data being spread out over arbitrary memory chunks (analogous to how the hardware has to support scatter-gather data on the send side).

Using zero-copy with device memory

The final wrinkle to the TCP zero-copy picture is the support for using it with memory that is not in userspace, but resides on a different device in the system altogether (such as a storage device, or GPU memory). Support for using device memory with TCP like this was added in September 2024. Interestingly, this was before the general zero-copy capability was added to io_uring in February of 2025.

For the receive side, using device memory works similarly to the zero-copy to userspace memory operation: A data buffer is registered with the NIC receive queue, and that is used for incoming packets. The only difference is that the memory being supplied with the registration command is a dma-buf file descriptor referring to a chunk of device memory, instead of a userspace memory buffer. Such a device memory buffer can be registered both through io_uring, or through the netlink API mentioned above. If using the netlink API, notifications of incoming data can be received with recvmsg(), by passing the MSG_SOCK_DEVMEM flag to that system call.

On the transmit side, zero-copy from device memory works similarly to the zero-copy from userspace, but an extra step is needed to use device memory: a binding is created for the memory using the NETDEV_CMD_BIND_TX genl command, binding the memory region to a network device transmission queue. When creating the binding, the kernel builds the same scatter-gather table as in the RX case, and in addition populates a tx_vec table of mappings from memory offsets in the bound area, into the iovec structures referring to the dmabuf binding. Once this registration is performed, zero-copy transmit can be done the same way as described above, with the difference that the memory addresses supplied to sendmsg() are interpreted as offsets into the bound memory instead of absolute addresses.

The TX side of device memory zero-copy was added in May 2025 with commit bd61848900bf (“net: devmem: Implement TX path”), and still has somewhat limited device driver support. It is also not supported through io_uring yet. Support for io_uring is likely to materialise at some point, though, as is support in more drivers.

Closing remarks

As should be apparent from the preceding sections, the support for zero-copy TCP has quite a few components, and requires some relatively complicated setup to get working. So is it worth it?

The cover letter to the io_uring patch series cites a 30-40% improvement in throughput for a single flow on a single CPU, which is roughly on the order of the improvement I’ve seen in my own tests. Note, however, that this is for high-speed NICs in tests that transfer lots of data, where the setup costs can be amortised over a large transfer. In that sense the zero-copy support is a bulk throughput optimisation, not a low-latency one. You’re unlikely to get much benefit on your laptop, but in the data centre there may be some benefit to be had for certain applications. In particular, I’m hopeful that this can serve as an alternative to specialised transports and fabrics such as RDMA and Infiniband, bringing very high data rate transfers to more environments that don’t have the specialised hardware needed for these more expensive technologies.