Replies: 1 comment
-
|
See discussion on discord: |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone, I wanted to start a discussion about integrating uring seamlessly into tokio. I have a running (WIP) branch here master...RedKinda:tokio:uring_refactor and would like to hear some feedback.
General
Generally, uring will perform best when you tune it to your specific application and its needs. However, there are significant speed improvements to be gained by using it in tokio even if untuned. Ideally, it would be a feature you can simply turn on with a boolean, and your app becomes magically faster, without any extra effort. Based on some early benchmarks, this is achievable.
Uring per thread
Currently, the unstable uring implementation uses a global uring instance behind a Mutex. This is clearly a bottleneck for high performance apps, especially ones interested in using uring. General consensus seems to be that you should have a uring instance per thread. In tokio terms, there should be a thread-local uring instance per tokio worker thread. I did not benchmark these two different implementations, since it is a bit difficult to benchmark for this scenario.
I do implement uring-per-worker in the linked branch, but it does have some issues at the moment. When a uring Op is first polled, it submits the SQ entry to the uring of whatever worker thread it is running on, and then it receives the result through a
oneshot::Sender. CQE is polled anytime the worker is about to park, and after unparking. The uring fd is also added to the epoll selector.First issue with this implementation currently is that if thread A is waiting on the epoll, and thread B receives an event on its CQE, thread A wakes up from the epoll, and then subsequently has to wake up all of the other worker threads, because they need to check and process their own CQEs. This causes some extra overhead, but only in scenarios where worker threads sleep a lot, and overhead goes away when the workers dont park.
Second issue - in my current implementation the future that runs directly in the
rt.block_onfunction does not drive the CQE correctly and gets stuck. This seems like a pure implementation bug, and I would appreciate guidance on how to resolve it. Futures running insidetokio::spawn()work correctly and dont get stuck.Registered buffers and zerocopy
Uring provides the option to pre-register buffers to be used with uring. From uring documentation:
Registered buffers is an optimization that is useful in conjunction with O_DIRECT reads and writes, where it maps the specified range into the kernel once when the buffer is registered rather than doing a map and unmap for each IO every time IO is performed to that region.In addition to this, (I might however be wrong here) registered buffers should provide other slight performance benefits, like cache locality and some kernel side optimizations. Obvious issue here is "how do we know how many/big buffers tokio should pre-register". Tokio could expose a way to configure this, but it should also have a sensible default.
This is a bit of a segway into the SendZc uring operation. From https://lore.kernel.org/io-uring/fef75ea0-11b4-4815-8c66-7b19555b279d@kernel.dk/?s=09 -
MSG_ZEROCOPY already does this with send(2) and sendmsg(2), but the io_uring side did not. In local testing, the crossover point for send zerocopy being faster is now around 3000 byte packets, and it performs better than the sync syscall variants as well.This would be good to verify, but ZC variant seems to start being faster at around 3000 byte packets. Hence, it would make sense for tokio to register a bunch of 3000 byte buffers, and then when a write is executed, use either the registered buffer, or SendZC. If the write is <3k, and no registered buffer is free, then we can simply fall back to a regular send. This would however be good to validate in terms of performance. Additionally, having "a bunch of" 3k buffers per worker thread adds up in memory consumption ( say 3k buffers times 8 times 16 workers gets us to 375mb), so this should probably be opt-in as well.Other potential speedups
One of big strengths of uring is that you can batch a bunch of operations, without doing any syscalls. Currently tokio takes no advantage of this fact, and calls
enter()after every operation is submitted. This could be batched, for example by instead callingenter()onmaintenance()and before a thread is parked. This could improve throughput, but would likely increase latency of individual operations, so it is probably not worth doing for all tokio users.Integration into Tokio
Currently in unstable tokio,
fs::writeandOpenOptions::openuse uring if it is enabled. In my branch so far, I have implemented uring inimpl AsyncWrite for Fileas a start, with a fallback to regular write if uring is unavailable. This process was mostly seamless and resulted in massive speedups in my benchmarks (with a latency tradeoff), see benchmark details below. I plan on implementing AsyncRead/AsyncSeek in a similar way next, and therefore makeFileuse uring in most of its operations.(Early) Benchmarks
Benchmarks are very specialized and should not be taken too seriously, but it is good to use them as a rough reference point of where the implementation stands. That said, I implemented two benchmarks and ran them on my
Intel i7-10875H (16) @ 5.100GHztwo times, one time without the uring feature, and one time with. The difference below is change in uring compared to regular.For some context, the
write_onebenchmark iterates on thiswhile the
write_a_lotbenchmark spawns 32 tasks to write 32 times to/dev/null. This suggests that my current implementation is ~18% slower when you do a single write on repeat from a single thread repeatedly, aka the latency is worse. However, if you have high throughput, uring outperforms the current implementation by 4x.Next steps
Personally, I would love to get uring into tokio and would love to work on it. My next steps would be implementing
AsyncReadandAsyncSeekintoFile, and then later move ontoTcpStream. I appreciate any and all feedback to what I wrote and my current implementation, especially if there are any glaring issues that need to be resolved before I get to implementing uring into other places. I am also not sure how to structure this into different PRs, it would probably make sense to have a separate PR for implementing uring-per-thread, and then individual PRs forFile,TcpStreametc. Any guidance here would also be appreciated :)Beta Was this translation helpful? Give feedback.
All reactions