IO uring integration #7684

RedKinda · 2025-10-14T19:09:17Z

RedKinda
Oct 14, 2025

Hi everyone, I wanted to start a discussion about integrating uring seamlessly into tokio. I have a running (WIP) branch here master...RedKinda:tokio:uring_refactor and would like to hear some feedback.

General

Generally, uring will perform best when you tune it to your specific application and its needs. However, there are significant speed improvements to be gained by using it in tokio even if untuned. Ideally, it would be a feature you can simply turn on with a boolean, and your app becomes magically faster, without any extra effort. Based on some early benchmarks, this is achievable.

Uring per thread

Currently, the unstable uring implementation uses a global uring instance behind a Mutex. This is clearly a bottleneck for high performance apps, especially ones interested in using uring. General consensus seems to be that you should have a uring instance per thread. In tokio terms, there should be a thread-local uring instance per tokio worker thread. I did not benchmark these two different implementations, since it is a bit difficult to benchmark for this scenario.

I do implement uring-per-worker in the linked branch, but it does have some issues at the moment. When a uring Op is first polled, it submits the SQ entry to the uring of whatever worker thread it is running on, and then it receives the result through a oneshot::Sender. CQE is polled anytime the worker is about to park, and after unparking. The uring fd is also added to the epoll selector.

First issue with this implementation currently is that if thread A is waiting on the epoll, and thread B receives an event on its CQE, thread A wakes up from the epoll, and then subsequently has to wake up all of the other worker threads, because they need to check and process their own CQEs. This causes some extra overhead, but only in scenarios where worker threads sleep a lot, and overhead goes away when the workers dont park.

Second issue - in my current implementation the future that runs directly in the rt.block_on function does not drive the CQE correctly and gets stuck. This seems like a pure implementation bug, and I would appreciate guidance on how to resolve it. Futures running inside tokio::spawn() work correctly and dont get stuck.

Registered buffers and zerocopy

Uring provides the option to pre-register buffers to be used with uring. From uring documentation: Registered buffers is an optimization that is useful in conjunction with O_DIRECT reads and writes, where it maps the specified range into the kernel once when the buffer is registered rather than doing a map and unmap for each IO every time IO is performed to that region.

In addition to this, (I might however be wrong here) registered buffers should provide other slight performance benefits, like cache locality and some kernel side optimizations. Obvious issue here is "how do we know how many/big buffers tokio should pre-register". Tokio could expose a way to configure this, but it should also have a sensible default.

This is a bit of a segway into the SendZc uring operation. From https://lore.kernel.org/io-uring/fef75ea0-11b4-4815-8c66-7b19555b279d@kernel.dk/?s=09 - MSG_ZEROCOPY already does this with send(2) and sendmsg(2), but the io_uring side did not. In local testing, the crossover point for send zerocopy being faster is now around 3000 byte packets, and it performs better than the sync syscall variants as well. This would be good to verify, but ZC variant seems to start being faster at around 3000 byte packets. Hence, it would make sense for tokio to register a bunch of 3000 byte buffers, and then when a write is executed, use either the registered buffer, or SendZC. If the write is <3k, and no registered buffer is free, then we can simply fall back to a regular send. This would however be good to validate in terms of performance. Additionally, having "a bunch of" 3k buffers per worker thread adds up in memory consumption ( say 3k buffers times 8 times 16 workers gets us to 375mb), so this should probably be opt-in as well.

Other potential speedups

One of big strengths of uring is that you can batch a bunch of operations, without doing any syscalls. Currently tokio takes no advantage of this fact, and calls enter() after every operation is submitted. This could be batched, for example by instead calling enter() on maintenance() and before a thread is parked. This could improve throughput, but would likely increase latency of individual operations, so it is probably not worth doing for all tokio users.

Integration into Tokio

Currently in unstable tokio, fs::write and OpenOptions::open use uring if it is enabled. In my branch so far, I have implemented uring in impl AsyncWrite for File as a start, with a fallback to regular write if uring is unavailable. This process was mostly seamless and resulted in massive speedups in my benchmarks (with a latency tradeoff), see benchmark details below. I plan on implementing AsyncRead/AsyncSeek in a similar way next, and therefore make File use uring in most of its operations.

(Early) Benchmarks

Benchmarks are very specialized and should not be taken too seriously, but it is good to use them as a rough reference point of where the implementation stands. That said, I implemented two benchmarks and ran them on my Intel i7-10875H (16) @ 5.100GHz two times, one time without the uring feature, and one time with. The difference below is change in uring compared to regular.

async_write_one         time:   [10.164 µs 10.470 µs 10.852 µs]
                        change: [+13.363% +18.037% +22.524%] (p = 0.00 < 0.05)
                        Performance has regressed.

async_write_a_lot       time:   [589.83 µs 595.65 µs 601.95 µs]
                        change: [-77.368% -77.057% -76.666%] (p = 0.00 < 0.05)
                        Performance has improved.

For some context, the write_one benchmark iterates on this

                let mut file = File::open("/dev/null").await.unwrap();
                file.write_all(data).await.unwrap();

while the write_a_lot benchmark spawns 32 tasks to write 32 times to /dev/null. This suggests that my current implementation is ~18% slower when you do a single write on repeat from a single thread repeatedly, aka the latency is worse. However, if you have high throughput, uring outperforms the current implementation by 4x.

Next steps

Personally, I would love to get uring into tokio and would love to work on it. My next steps would be implementing AsyncRead and AsyncSeek into File, and then later move onto TcpStream. I appreciate any and all feedback to what I wrote and my current implementation, especially if there are any glaring issues that need to be resolved before I get to implementing uring into other places. I am also not sure how to structure this into different PRs, it would probably make sense to have a separate PR for implementing uring-per-thread, and then individual PRs for File, TcpStream etc. Any guidance here would also be appreciated :)

Darksonn · 2025-10-14T22:51:19Z

Darksonn
Oct 14, 2025
Maintainer

See discussion on discord:
https://discord.com/channels/500028886025895936/810724255046172692/1427735341003051079

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

IO uring integration #7684

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Uh oh!

IO uring integration #7684

Uh oh!

RedKinda Oct 14, 2025

General

Uring per thread

Registered buffers and zerocopy

Other potential speedups

Integration into Tokio

(Early) Benchmarks

Next steps

Replies: 1 comment

Uh oh!

Darksonn Oct 14, 2025 Maintainer

RedKinda
Oct 14, 2025

Darksonn
Oct 14, 2025
Maintainer