-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
internal/poll: transparently support new linux io_uring interface #31908
Comments
It should be feasible to fit this approach into our current netpoll framework. For some programs I think it would reduce the number of threads doing file I/O. It should potentially reduce the number of system calls required to read from the network. I'm concerned about coordinating access to the ring. The approach seems designed for high performance communication between the application and the kernel, but it seems easiest to use for an application that uses a single thread for I/O, or in which each I/O thread uses its own ring. In Go of course each goroutine is acting independently, and it seems infeasible for each thread to have a separate ring. so that means that goroutines will need to coordinate their access to the I/O ring. That's fine but on high |
Thinking about this further, it's not clear that it makes sense to use this new interface for network I/O. It seems that it can only handle a fixed number of concurrent I/O requests, and it's going to be quite hard to make that work transparently and efficiently in Go programs. Without knowing what the Go program plans to do, we can't allocate the ring appropriately. If that is true, we would only use it for file I/O, where we can reasonably delay I/O operations when the ring is full. In that case it would not fit into the netpoll system. Instead, we might in effect send all file I/O requests to a single goroutine which would add them to the ring, while a second goroutine would sleep until events were ready and wake up the goroutine waiting for them. That should limit the number of threads we need for file I/O and reduce the number of system calls. |
I was wondering if we could use the map approach: when it becomes full we allocate a new, bigger one, and start submitting requests to the new one. Once the old one doesn't have pending requests anymore we deallocate it.
Can you elaborate on the "infeasible" part? Assuming having multiple rings is feasible, wouldn't having per-P rings work (with the appropriate fallback slow paths in case the per-P ring is full)? I'm not so familiar with the poller, is the problem that the model mismatch is too big? |
File I/O is the main reason I'm interested in Whether having a ring per P (as @CAFxX suggests) or just having a single thread dedicated to managing a ring... either solution seems fine. Unless there's some measurable advantage to |
Using a ring per P seems possible, but it means that when a P is idle some M will have to be sleeping in a io_uring getevent call. And then there will have to be a way to wake up that M if we need it, and some way to avoid confusion if the P gets assigned to a different M. It may be doable but it seems pretty complicated. |
I think it is a killer feature of Linux 5.1. |
Apparently io_uring brings many benefits to networked io as well. Is that not the case or is it just hard to accommodate for the size of the ring? |
@dahankzter See the discussion above. When I looked at |
The fixed limit has been removed, and another networking concern was the fixed CQ ring size and the difficulty in sizing that. The latter was fixed with the IORING_FEAT_NODROP support, which guarantees that completion events are never dropped. I think that should take care of most of the concerns? One thing I've been toying with is the idea of having a shared backend in terms of threads between rings. I think that'd work nicely and would allow you to just setup one ring per thread and not worry too much about overhead. |
@axboe Thanks for the update. Where is the current API documented? |
I wrote an update here: https://kernel.dk/io_uring-whatsnew.pdf and the man pages in liburing are also up-to-date. |
I'm working on rio, a pure-rust misuse-resistant io_uring library, and have been thinking a lot about edge cases that folks are likely to hit. One is exactly the overflow issue @axboe mentions that is addressed with In the testing of io_uring with sled I spin up thousands of database instances, each getting their own uring. This quickly causes ENOMEM to be hit, making this approach infeasible under this kind of test. details here. So, you may run into this if you go for a ring-per-goroutine. Maybe ring-per-golang proc.go processor would be OK? Just be aware of this ENOMEM tendency. Care needs to be taken when working with linked ops. If a linked SQE fails due to, say, a tcp client disconnecting, it will cancel everything down the chain, even if it wrote some bytes into the buffer passed to io_uring. On newer kernel versions you can use HARDLINK instead of LINK to write the partial data into downstream sockets/files/etc... even when the previous write received an error due to disconnection etc... Regarding concurrent access, it's not too complex. Just make sure that the shared submission queue io_uring will change everything. It allows us to dramatically reduce our syscalls (even without SQPOLL mode that spins up a kernel thread, negating the need to do a syscall to submit events). This is vital in a post-meltdown/spectre world where our syscalls have gotten much more expensive. It allows us to do real async file IO without O_DIRECT (but it also works great with O_DIRECT if you're building storage engines etc...). It lets us do things like write broadcast proxies where there's a read followed by a DRAIN barrier, then many socket writes that read from that same buffer. The goldrush has begun :) |
@spacejam do you have any interesting experiences that you can talk about from looking into using io_uring for async file I/O? I still find that to be the more compelling use case, since Linux did not have a proper story for async file I/O prior to io_uring, in my limited understanding. I’m definitely interested in seeing benchmarks of io_uring for networking compared to other async networking approaches, since if io_uring is implemented for file I/O, I would speculate that it might only be an incremental amount of work to support io_uring for network I/O as well. |
@coder543 It actually started as async file IO, as that only worked for O_DIRECT with Linux. With io_uring, it works for buffered and O_DIRECT. io_uring also supports networked IO. As per 5.4, you can do the regular read/readv or write/writev on the socket, or you can use send/recvmsg. 5.5 adds connect/accept support, and 5.6 will have support for send/recv as well. |
@coder543 For bulk-writing and reading on my 7th gen lenovo x1 carbon with nvme + full disk encryption + ext4 with io_uring (without using registered buffers nor file descriptors nor SQPOLL mode) I can hit 5gbps while writing sequentially with O_DIRECT and 6.5gbps reads while reading sequentially with O_DIRECT. This is the |
High level overview: https://lwn.net/Articles/810414/ |
I think the focus should be on file I/O only, to begin with. Network I/O works just fine using netpoll already. If the model works out for file I/O, we can try it for network I/O also (modulo complications related to deadlines, but those don't seem insurmountable). A ring per P sounds good to me. Each ring would maintain a cache of off-heap completion tokens which identify the parked goroutine and provide a slot to write the result of the operation to. These tokens would function much like To perform I/O, a goroutine acquires the usual locks through Some pseudocode: The completion token:
SQ submission, parking the goroutine doing the submission:
Handling completions:
The question of who / what handles completions remains. If I understand things correctly, after parking the current goroutine, we enter the scheduler and we are executing If I think we can leverage the existing netpoll infrastructure to solve this problem. We associate a non-blocking event file descriptor (as in I have insufficient knowledge of the runtime to know whether what I am describing is, in fact, feasible. It is all very hand-wavy, and I haven't attempted an implementation (yet?). What do you think? |
Went with the "single thread for I/O" approach and built a small POC here: https://github.com/agnivade/frodo :) Nothing serious, just something for me to learn about io_uring and do something with it. There are definitely some hairy edge-cases due to my highly inexperienced knowledge of CGo. But it's at a stage to start playing around. I have thrown in a benchmark just for the sake of it, but it does not mean anything - as it's not an apples-apples comparison, and the CGO boundary alone brings an order of magnitude difference. |
I've also been working on a pure go library that initial tests passing for reading from files. I wanted to have somewhat working code before looking at where to integrate it into the runtime. However, I've run into a few things that are rather difficult with Go's memory model regarding memory barriers and dealing with multiple writers to the submit ring, which would be less of an issue in the solution proposed by @acln0. |
The git link point to Facebook workplace, is this intentional?
|
I saw ScyllaDB adopted it, they talked about it in this article: https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/ |
I'm working on an easy-to-use iouring library for golang (https://github.com/iceber/iouring-go), both file IO and socket IO work fine, but the testing and documentation aren't perfect yet! |
@Iceber I don't think it's an option to have the |
@mvdan Yes, this is just an experiment for now, and it is optimal for the standard library to provide these features But really integrating I'm also going to try to develop |
This comment has been minimized.
This comment has been minimized.
Hi, I am working on a go-uring library. In addition to the liburing port itself, it provides a backend for I/O operations (named reactor). With its help, you can implement net.Listener and net.Conn interfaces, and make a comparison of the reactor (with an io_uring inside) against the standard mechanism - netpoller. Benchmark (using the example of an echo server). In addition, on this benchmark, a comparison of the go-uring library against the liburing. The results suggest that the ring can at least be an interesting alternative for a netpoller. |
Windows 11 is introducing a new ioring API that seems almost identical [1] to io_uring - so any work in this area might be applicable on both Windows and Linux. |
As of September 2022, can golang runtime detect whether the kernel supports ioring (and switch to using it)? |
No. This issue is still open. |
Today I read an experience report on Hacker News where someone describes a way they found to work with io_uring that resulted in actual performance improvements over using epoll(2): https://news.ycombinator.com/item?id=35548968. Might be worth keeping in mind when experimenting with this in the netpoller. |
Hi. I wrote a web framework that is based on io_uring: https://github.com/pawelgaczynski/gain. It is entirely written in Go and achieves really high performance. In a test environment based on a m6i.xlarge AWS EC2 instance machine, kernel 5.15.0-1026-aws and go 1.19, it achieved 490k req/s in the plaintext TechEmpower benchmark. Gain uses the liburing library internal port. It is not a full port. It focuses primarily on the networking part, but I don't see any problem extending the implementation to include the rest of the liburing-supported operations and publishing the port on Github as a standalone package. It would also be worthwhile to create an additional layer of abstraction to use the liburing port in a more idiomatic way, as currently the port is very close to the prototype implemented in C and may not be the most intuitive for Go programmers. If anyone is interested in using my liburing port or would like to help develop it, please contact me by creating an issue in the Gain repository. |
Hi. I have implemented and published an almost full port of the liburing library to the Go programming language (no cgo): https://github.com/pawelgaczynski/giouring The giouring API is very similar to liburing API so documentation for liburing is valid for giouring (see README.md for more details). |
Still no plans to use io_uring instead of epoll on Go's runtime? |
I wasn't sure whether to create a new issue or comment here, but we've run into some significant bottlenecks in the current netpoll implementation on I did a full writeup here but the tl;dr is that in our use-case, we've got >1,500 TCP connections to ScyllaDB shards and >1,000 client connections talking to our ConnectRPC service. Everything is basically rack-local so the requests are generally sub-millisecond so hundreds if not >1k sockets become ready simultaneously causing a block on We had to solve this by breaking up our binary across many containers on the same host (8 containers each pinned to 24 different cores) which made the EPollWait completely disappear from our CPU profiles. |
I brought this up in the bluesky thread that the above came from, but one thing to consider is that Docker has now disabled
This has prompted users of I think it's extremely common for Go code to be running within containers, and if a large bulk of them can't use the feature, that seems pretty unfortunate. Hopefully things get worked out such that |
Any update on this one? Seems like a massive win where async threads can submit into a central ring, and sleep until the single syscall wakes them up. The gains here should be huge for hundreds of async tasks. |
@KyleSanderson See the previous comment. It appears that Also, note that for Go it seems likely to help for programs that do a lot of parallel local file I/O. I don't see why |
Ah I feel for you with bad information... This is already heavily used on storage and networking appliances. Being disallowed in certain environments is pretty normal, especially with large paradigm shifts.
I'm going to explain some rationale as clearly as I can. I'm sure it's going to be way too specific, and way too vague all at the same time. 😸 Go Networking: main: accept loop: per-connection goroutine:
There's a fundamental misunderstanding in this thread that this is just for polling. io_uring performs batched syscalls, through a single syscall. This includes reading, and writing to descriptors. Reading a single byte from a stream has the same overhead as reading a single byte from 1000 streams. The power here, is obviously huge, because the wakeup event includes the bytes you're about to request, from any descriptor, potentially at any position, between userspace and the kernel. There's a reason why Windows also added this API, IORing, to stay competitive in the market. Multi-Threaded applications now can be a single thread if using the API natively, reading from thousands of descriptors simultaneously to then send back on the thousand sockets, in a single call. Call pattern could be on the syscall, submit the event through a channel (if there's multiple goroutines running), to a single io_uring worker, that would then wake-up, add the work, and submit the group of syscalls again. Easier said then done, but testing this out in the language should be fairly hard to lose. Anyway, happy to discuss if there's office hours or similar, but this is going to elevate golang even further once added, because it will natively beat naive C applications using old non-batched APIs. |
Thanks for the information. This issue is about transparent use of Also, since this is about the standard library, I do think that polling is the relevant consideration here. I don't yet see how to map the standard library operations onto |
Certainly. There would need to be a probe at the start of the application to understand what's available. Same deal as calling cpuid
At this call site, https://cs.opensource.google/go/x/sys/+/refs/tags/v0.24.0:unix/syscall_unix.go;l=166 as opposed to calling read directly, create a return channel, send the opcode, and data reference in a struct, to your io_uring ingestion loop, add it to the bundle of work, and resubmit your batch syscall request. When the referenced work comes back using the context, send it back through the return channel and close it. The Read call has been waiting on the channel return the entire time, and returns appropriately. There's likely better ways to do it, but it should help the scheduler immensely as opposed to spurious wakeups all over the place. |
While axboe/liburing/wiki: io_uring and networking in 2023 delivers guidance and best practices of converting networking applications to use IMHO, |
On the topic of becoming more amenable to transparent support for batching requests, there may be a bit of news cooking:
https://git.kernel.dk/cgit/linux/commit/?h=io_uring-sched-submit&id=f008f64f52496567bec199a47e9df224d71d963f |
A document on the latest linux IO syscall has made the rounds of discussion forums on the internet:
http://kernel.dk/io_uring.pdf
I wanted to open a discussion on whether we could (and should) add transparent support for this on supported linux kernels.
LWN article: https://lwn.net/Articles/776703/
The text was updated successfully, but these errors were encountered: