-
Notifications
You must be signed in to change notification settings - Fork 2.1k
asyncio: Added async i/o APIs. #10605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Any specific reason why a callback-based system was chosen? Alternative would be With callback based system, this gets complex (maintain a multithread-safe hash-table for all operations, callback sets the value |
This is one of those "you can't please everyone" things. I started with a poll interface and was asked to change it to callbacks. You could just have an array of bools and have the callback set each one to true as pieces finish loading, and each frame check for true instead of DONE...or something like that. |
This generally looks good. Is there going to be WriteFileAsync()? What's the interaction with storage? |
Looks pretty good. Do have some input
While this is probably fine heuristically, I would probably add a hard maximum of 64. There's a lot of people gaming on old workstations that would exceed that cap and be unlikely to benefit from more in the typical case of I/O on a single drive.
this is fine. a note in the documentation that callbacks should try to be brief and forward heavy processing to other threads may be warranted, because many assets do need reasonably heavy processing.
don't disagree but it might be more straightforward to start with an implementation that can portably share logic. A relatively thin shim over pread/pwrite on the Unix likes and ReadFile/WriteFile on Win32 can enable a single implementation with better characteristics than the generic implementation. Emscripten would benefit most from its own implementation, IIRC all the libc synchronous I/O actually gets proxied to the main runtime thread in order to simplify handling the promises from the underlying JS interfaces.
Android doesn't have a thing for this. Mac and iOS have dispatch_io which just doesn't mesh well with this interface (in particular there is no way to supply a read buffer) and at least when it was new wasn't any better for regular files than a simpler threadpool anyway |
Crazy thought... do you actually want that many I/O threads? Maybe we should start with one and add a hint that lets applications tune that? |
having a fixed number of not enough threads makes the system unusable for many real projects and having a few too many idle threads is always cheaper than creating them on-demand. This is what posix aio implementations generally got wrong and why nobody really uses them for heavy I/O (except for FreeBSD's implementation I guess). tunability as a range with a heuristic selection within that range might be appropriate. All of the factors that would determine the optimal number of threads for an I/O-busy program are not really statically knowable, but a user could profile their particular I/O pattern across a good variety of machines and come up with a range that might be better than just relying on the heuristic alone. |
LoadFileAsync is useful because it has open the file, figure out how large the file is, allocate memory for it, and read into it. WriteFileAsync would only need to open and write an existing buffer, so it's only saving one step. But we could definitely add it! Storage just needs an equivalent of LoadFileAsync. On the built-in implementation, it literally will just call SDL_LoadFileAsync. Current thinking was not to implement positional reads/writes at the storage level, but I'm still trying to understand what data in a SDL_Storage interface does when you have a file like Spider-Man 2's New York map, which is presumably dozens of gigabytes.
Right now it spins threads on-demand if there aren't any idle ones, up to a maximum count, and ones that are idle for more than 30 seconds will start terminating until only one is left (we always keep one around so there isn't a failure case when no threads can start). Until you use async i/o, it won't initialize anything so it doesn't have a resource cost at all until you touch it. The idea being that you likely have a flood of startup loading and then there's not a lot of value in keeping idle threads around. There's a FIXME to add a hint, and we definitely should tune the defaults in any case. |
Okay, single-threaded Emscripten support is in, so this is Good Enough to ship and do improvements with system-specific APIs later...but do we really want this callback API? It forces people deal with locking and worries that they'll clog the thread pool if they don't move quickly enough. Maybe it would be better if we give them something like: SDL_GetAsyncResults(&details) Internally we keep a queue of completed tasks, and each call to this returns the next completed one with all the details. You don't poll specific tasks, you just get the next one in the completed list. Call doesn't block, returns false if nothing is ready. This dramatically simplifies SDL's code and the mental load an app needs to manage. The app can still toss work to a background thread to process the loaded data if they want...but if they just didn't want to block on i/o, they can otherwise still benefit in a single-threaded app. |
Sure, I buy it, let’s do that instead. |
The one thing that would be nice is a way to wait for multiple I/O operations to complete. WaitForCompletion(array of operations) |
You probably want a separate handle for each request. For example if a library does async loading it wants to get its own data separate from the app. |
Allowing for separate completion queues (probably specified when creating an async file object, with NULL refering to some default completion queue) would probably be the most useful form of this approach, since different threads would have different interests. That also conveniently translates very directly to how IOCP and io_uring both deal with completions. Should also be convenient for tying I/O ops directly to different asset pipelines. |
I think allowing separate queues of I/O operations and then enumerating within each queue with |
d04bd1b
to
5e312e7
Compare
Okay, this now has queryable tasks and "task groups" and the background thread callbacks are gone. I'm super happy with how this came out. Here's the category documentation from the new version, which gives the basic idea: CategoryIOAsyncSDL offers a way to perform i/o asynchronously. This allows an app to read or write files without waiting for data to actually transfer; the functions that request i/o never block while the request is fulfilled. Instead, the data moves in the background and the app can check for results at their leisure. This is more complicated that just reading and writing files in a synchronous way, but it can allow for more efficiency, and never having framerate drops as the hard drive catches up, etc. The general usage pattern for async i/o is:
If you want to track several tasks at once, regardless of the order in which they complete:
This all works, without blocking, in a single thread, but one can also use a task group in a background thread and sleep until new results have arrived:
And, of course, to match the synchronous SDL_LoadFile, we offer SDL_LoadFileAsync as a convenience function. This will handle allocating a buffer, slurping in the file data, and null-terminating it; you still get a task handle to check later or add to a task group. |
5e312e7
to
320a93f
Compare
Nice! |
320a93f
to
78bc0d2
Compare
Hi - I was sent this direction because I've been peripherally involved in SDL_Gpu (doing reviews focused on performance and answering questions about PCI and DMA). I happen to build storage APIs professionally though, so looking at this pull request is far more up my alley. I had a really long review I was working on, but the recent changes you've made addressed about 80% of what I was going to say so this will now be much shorter (but still probably long). I think for now I'll stick to just public API feedback, and maybe I'll post in a few separate posts. First, I do not think it is possible to implement the API calls that do not require a task group efficiently on Linux. On Linux, your asynchronous I/O choices are either Linux AIO (not to be confused with POSIX AIO) or io_uring. Both of these APIs require you to first create some per-thread context (e.g. a task group) in order to submit I/O. Creating these on demand per-file is expensive and may run into OS-imposed limits on the number of io_urings or aio_contexts available. I strongly suggest that you only expose the APIs that require a task group - it's the only way to get true asynchronous I/O. I've personally made a similar mistake with APIs released 10+ years ago and it was really painful to find my way out of the corner I painted myself into. Second, opening and closing the file should be asynchronous operations too (and they should be submitted to the task group). These can actually secretly block a long time in some cases. Of course, on many operating systems there isn't an asynchronous open/close option at all, and the implementations there will be blocking. But at least on Linux with io_uring you can asynchronously open and close files, so the API should allow for the best behavior rather than forcing the worst. More review to follow. |
(Sorry, I just squashed this down to a single commit right before you posted that; it's all the same stuff, just in one commit.) I'm skeptical that async opens are worth the API complexity, but I can totally see a close operation blocking while cached writes make their way to disk. I'm nervous about adding the complexity, but let me think on this a little tonight. As for tasks vs task groups: presumably with io_uring (and Win32 i/o completion ports), we'll have a single internal thread that lives just to deal with moving data from the OS in a single context/completion port to our own task groups that mostly sleeps. If that's not a terrible idea, then it doesn't matter if the app wants to query a single task or a task group, it's just a question of whether they look at a struct directly or shift the head of a linked list of structs before looking at it. |
(Also, more review is totally welcome at any length you're willing to write, because I'm 100% winging it over here like I know what I'm doing.) |
The shift to support "task groups" is an excellent and necessary change. However, I do still have a few pieces of feedback on the group task group design.
(More coming) |
Using a thread is, unequivocally, a terrible idea (sorry, that sounds really harsh and I promise I don't mean it to be, but I've been at this a long time and everyone wants to use a thread). The hardware (and I've helped design NVMe from close to the beginning) wants you to create a separate "queue" per thread. The hardware accepts command submissions in parallel (and they're all asynchronous too!). And the operating systems have spent the last decade re-designing themselves to do parallel, asynchronous I/O internally to match the hardware. You are supposed to create a per-thread "context" (io_uring, aio_context, ioRing, iocp, whatever) at thread start up time, and use that exclusively for the I/O operations you perform on that thread. This will allow for parallel, lockless I/O submission and provide the lowest latency. |
I have noticed file open latency before. I've never seen it be significant for a single file, and the many-file usecase is probably covered adequately enough by SDL_LoadFileAsync and perhaps similar composite operations. I'd agree its probably not worth the complexity.
An async sync-and-close would be useful for savefiles. System APIs typically will not wait for cached writes to hit the disk before closing a file, having this behavior be explicit as a composite operation is useful.
Windows will block inside an Overlapped ReadFile when the underlying disk driver's queue is full. It's not that hard to hit this with random reads because Windows disk drivers have shallow queues compared to eg Linux. This is unfortunate because other pending reads to the same device might still succeed without blocking by hitting the filesystem cache, and obviously when combined with other uses of Overlapped I/O (eg sockets) that should never block this is even worse. It will also block on Overlapped WriteFile for writes that extend a file, which is a pretty normal thing to happen. Because of this behavior the typical procedure is to still use a threadpool to issue overlapped I/O. Probably a lesser concern, but Wine also doesn't attempt to provide asynchronous I/O to regular files at all. As far as io_uring goes, I definitely agree with jaedan that this would be a suboptimal use of it, but also think that the typical use of a per-thread ring is both needlessly resource-expensive (they are not lightweight objects) and a poor fit for games' logic. A ring wrapped in a mutex for each task group is probably the least bad approach here, although I definitely lack jaedan's level of experience with io_uring. Extending the API with batching operations would probably bring that kind of implementation closer to optimal (also an improvement for other implementations) but that can be a later enhancement IMO. |
We could create a heavy weight async I/O context when a thread first uses the async API. This could be stored in TLS and be automatically cleaned up, so the users get the benefit of that without having to explicitly track things themselves. It might be worth adding a io_uring and I/O ring implementation to validate that this API works efficiently with those interfaces. The only thing that affects the ABI is the question of how asynchronous I/O interacts with the storage API, and it's becoming clear that it doesn't. This seems like it's intended as a low level high performance interface to disk files. If that's true, then let's bump this out of the ABI milestone. |
On Linux for many many years (via Linux aio) and on Windows for the last 3, this is no longer the standard procedure. We shouldn't make the API inefficient on modern operating systems by design, but rather accept the inefficiency only for operating systems with poor asynchronous IO support. Let's imagine we have a system with a background thread that does the IO processing. This thread may spin while there's IO outstanding for efficiency, but it certainly can't just spin all the time. IO in games is sporadic, so most commonly this thread will be asleep when we next start a burst of IO. So we queue up the operation into the shared queue or ring and then signal the background thread to wake. On most operating systems, thread scheduling latency is only at about 1ms granularity, and even if it reacts instantly just the mechanics of restoring thread state to wake it takes a significant chunk of time. So all of this added time is added to our IO request. But then the underlying SSD will complete the request in 20-100 microseconds most likely. Why would we 10x our latency because of software design? It makes no sense.
I really can't disagree more on both points. These objects are not cheap to create because they usually do memory allocations and system calls, but actual resource use is very minimal. Usually it's 64 bytes * number of queues entries. And you make one of these per thread that wants to perform IO. If you have true asynchronous IO, then you only need a very limited number of threads in your game that are doing asset loading. We're talking kilobytes of total resources use here. And regarding a poor fit for game logic, a ring buffer for command submission is a perfect fit for game logic. On each tick you enqueue all your IO operations. At end of tick you submit and check for completions. It cannot be more ideal. It also happens to be exactly how the GPU API works too, and it would be how an optimal networking API would function as well. |
Removing SDL_AsyncIOTask cleans it up nicely, but does it make sense to add a function to cancel all the I/O on a queue? |
I was thinking about that as an option. One still needs to wait for the completion, though, to free buffers that are in-flight, so all it gains you is maybe some tasks will finish faster, or maybe a little less disk i/o gets started. We could have it as a flag to SDL_DestroyAsyncIOQueue, to say "I'm going down and I don't care if this stuff finishes or not, get me out of here asap." |
Hey, @madebr: I'm getting test failures like this:
But running |
Actually, this doesn't need a flag. Since you can't get the results without the queue, it's reasonable to cancel everything by default to make the queue destruction finish more quickly; the app has to assume all of it failed anyway, since any of it could fail without warning. |
...of course, we don't actually have a list of pending tasks for the queue to cancel them, so screw it, I'm leaving it alone. I'll probably remove the cancel stuff from the implementation once this merges, so we can reasonable find it in revision control if we want to restore it later. |
Still need to figure out the memory leaks in the test program, but otherwise, I think we should merge this. The only thing pending is what to do with the Storage interface, but it's becoming clear we'll want to rev that interface for other reasons, too, so it doesn't make sense to hold this up in the meantime. |
(Ping @madebr) |
Run with the command line option |
Configure your CMake project with I was hoping CI would show the stack traces when building with It's a far fetch, but maybe. |
Ah, awesome, thanks!
This appears to be an SDL_AddTimer leak...I'm not sure why this PR is triggering it when main isn't. I'll dig in for a second though. |
Interestingly,
To be clear, SDL_Quit() legit cleans up the completed timer list, which we intentionally keep as an allocation pool until SDL_Quit is called, so even SDL_RemoveTimer() will "leak" in this regard, as it currently stands. |
Does allocation and deallocation happen through SDL_malloc/SDL_free? |
Ugh, how mortifying: diff --git a/src/SDL.c b/src/SDL.c
index 0fdf9713f..5e6493588 100644
--- a/src/SDL.c
+++ b/src/SDL.c
@@ -624,7 +625,7 @@ void SDL_Quit(void)
SDL_DBus_Quit();
#endif
- SDL_QuitTimers();
+ SDL_QuitAsyncIO();
SDL_SetObjectsInvalid();
SDL_AssertionsQuit(); |
Okay, this is the moment, let's merge this...! |
Is there a specific reason to use --- a/src/file/io_uring/SDL_asyncio_liburing.c
+++ b/src/file/io_uring/SDL_asyncio_liburing.c
@@ -284,3 +284,3 @@ static SDL_AsyncIOTask *liburing_asyncioqueue_wait_results(void *userdata, Sint3
} else {
- struct __kernel_timespec ts = { (__kernel_time64_t) timeoutMS / SDL_MS_PER_SECOND, (long long) SDL_MS_TO_NS(timeoutMS % SDL_MS_PER_SECOND) };
+ struct __kernel_timespec ts = { (Sint64) timeoutMS / SDL_MS_PER_SECOND, (Sint64) SDL_MS_TO_NS(timeoutMS % SDL_MS_PER_SECOND) };
liburing.io_uring_wait_cqe_timeout(&queuedata->ring, &cqe, &ts); |
I think it's just what the struct uses, at least on the headers I looked at, but the PR to change this should be fine. |
Just adding a "best practices" section to the async i/o docs. @jaedan and @nfries88, does this cover it well? Best PracticesSimple non-blocking i/o--for an app that just wants to pick up data whenever it's ready without losing framerate waiting on disks to spin--can use whatever pattern works well for the program. In this case, simply call SDL_ReadAsyncIO, or maybe SDL_LoadFileAsync, as needed. Once a frame, call SDL_GetAsyncIOResult to check for any completed tasks and deal with the data as it arrives. If two separate pieces of the same program need their own i/o, it is legal for each to create their own queue. This will prevent either piece from accidentally consuming the other's completed tasks. Each queue does require some amount of resources, but it is not an overwhelming cost. Do not make a queue for each task, however. It is better to put many tasks into a single queue. They will be reported in order of completion, not in the order they were submitted, so it doesn't generally matter what order tasks are started. One async i/o queue can be shared by multiple threads, or one thread can have more than one queue, but the most efficient way--if ruthless efficiency is the goal--is to have one queue per thread, with multiple threads working in parallel, and attempt to keep each queue loaded with tasks that are both started by and consumed by the same thread. On modern platforms that can use newer interfaces, this can keep data flowing as efficiently as possible all the way from storage hardware to the app, with no contention between threads for access to the same queue. Written data is not guaranteed to make it to physical media by the time a closing task is completed, unless SDL_CloseAsyncIO is called with its |
I would probably add a note that for very simple cases or where the data is needed immediately it is generally more efficient and much simpler logic to just use normal blocking I/O. But aside for that, that advice looks pretty complete. |
This is still in-progress, but I've been heads-down on this for awhile, so I'm putting it in a draft PR.
This implements an async i/o API, and a "generic" backend that implements it with a pool of SDL threads that can block on synchronous SDL_IOStreams. This is likely "good enough" for most things, and it's usable on everything but single-threaded Emscripten builds.
The downsides to the generic backend:
(SDL_GetCPUCount * 2) +1
at most),and it uses the same thread pool that is used for notifying apps that i/o is complete, which is to say if the app uses that callback for heavy processing, until the callback returns, it won't be available for doing more i/o. In practice, you'd have to have lots of read/writes that also have very heavy callbacks before you'd see bottlenecks, I think, but system-specific APIs can just get all the i/o cooking separately regardless of callback weight.(EDIT: this isn't true now. The generic backend uses up to 8 threads for i/o, but they are only limited by disk bandwidth, and no longer have to worry about the app hogging them in a callback.)It can't ever work for single-threaded Emscripten, because it needs threads.(EDIT: it works in single-threaded in Emscripten now...it's just synchronous, because Emscripten's filesystem API is synchronous and almost always just a memcpy from MEMFS anyhow).This would benefit from a Windows backend, that uses win32 IO Completion Ports, to solve these limitations. Linux can use io_uring if that's generally available at this point. Emscripten
could use actual async file APIs and run stuff between mainloop iterations if single-threaded(EDIT: see above), or hand off from there to the usual thread pool if not. I'm sure Mac/iOS/Android have a thing, too, but the ABI wouldn't change, so we can deal with those later.Fixes #1374.