Skip to content

asyncio: Added async i/o APIs. #10605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Dec 3, 2024
Merged

Conversation

icculus
Copy link
Collaborator

@icculus icculus commented Aug 28, 2024

This is still in-progress, but I've been heads-down on this for awhile, so I'm putting it in a draft PR.

This implements an async i/o API, and a "generic" backend that implements it with a pool of SDL threads that can block on synchronous SDL_IOStreams. This is likely "good enough" for most things, and it's usable on everything but single-threaded Emscripten builds.

The downsides to the generic backend:

  • While i/o happens in a background thread, you can't do two accesses in the same open file in parallel, because the SDL_IOStream needs to seek for each, so file position state would race. There's a mutex protecting it per-opened-file. Separate files (even two opens of the same file) can work in parallel, though.
  • It can't cancel in-flight i/o; if a read/write is happening, it can't be stopped before it completes. In practice, this isn't likely a big deal.
  • There are limited threads for i/o (currently (SDL_GetCPUCount * 2) +1 at most), and it uses the same thread pool that is used for notifying apps that i/o is complete, which is to say if the app uses that callback for heavy processing, until the callback returns, it won't be available for doing more i/o. In practice, you'd have to have lots of read/writes that also have very heavy callbacks before you'd see bottlenecks, I think, but system-specific APIs can just get all the i/o cooking separately regardless of callback weight. (EDIT: this isn't true now. The generic backend uses up to 8 threads for i/o, but they are only limited by disk bandwidth, and no longer have to worry about the app hogging them in a callback.)
  • It can't ever work for single-threaded Emscripten, because it needs threads. (EDIT: it works in single-threaded in Emscripten now...it's just synchronous, because Emscripten's filesystem API is synchronous and almost always just a memcpy from MEMFS anyhow).

This would benefit from a Windows backend, that uses win32 IO Completion Ports, to solve these limitations. Linux can use io_uring if that's generally available at this point. Emscripten could use actual async file APIs and run stuff between mainloop iterations if single-threaded (EDIT: see above), or hand off from there to the usual thread pool if not. I'm sure Mac/iOS/Android have a thing, too, but the ABI wouldn't change, so we can deal with those later.

Fixes #1374.

@icculus icculus added this to the 3.0 ABI milestone Aug 28, 2024
@namandixit
Copy link
Contributor

Any specific reason why a callback-based system was chosen?

Alternative would be SDL_ReadIOAsync, etc. returning a handle on which the application could poll to ask if the operation was finished. This is the interface I have in my engine, and I feel it is more natural for loading screens, etc. (store handles for all in-flight operations in an array, iterate and poll every frame, keep showing the loading screen unless all return DONE).

With callback based system, this gets complex (maintain a multithread-safe hash-table for all operations, callback sets the value DONE for asset ID as key, application iterates over the hash table); especially if the platform layer and game layer are reasonably decoupled.

@icculus
Copy link
Collaborator Author

icculus commented Aug 28, 2024

This is one of those "you can't please everyone" things. I started with a poll interface and was asked to change it to callbacks.

You could just have an array of bools and have the callback set each one to true as pieces finish loading, and each frame check for true instead of DONE...or something like that.

@slouken
Copy link
Collaborator

slouken commented Aug 28, 2024

This generally looks good. Is there going to be WriteFileAsync()? What's the interaction with storage?

@nfries88
Copy link

nfries88 commented Aug 28, 2024

Looks pretty good. Do have some input

  • There are limited threads for i/o (currently (SDL_GetCPUCount * 2) +1 at most)

While this is probably fine heuristically, I would probably add a hard maximum of 64. There's a lot of people gaming on old workstations that would exceed that cap and be unlikely to benefit from more in the typical case of I/O on a single drive.

and it uses the same thread pool that is used for notifying apps that i/o is complete... in practice, you'd have to have lots of read/writes that also have very heavy callbacks before you'd see bottlenecks, I think

this is fine. a note in the documentation that callbacks should try to be brief and forward heavy processing to other threads may be warranted, because many assets do need reasonably heavy processing.

This would benefit from ... IO Completion Ports ... io_uring ... Emscripten could use actual async file APIs

don't disagree but it might be more straightforward to start with an implementation that can portably share logic. A relatively thin shim over pread/pwrite on the Unix likes and ReadFile/WriteFile on Win32 can enable a single implementation with better characteristics than the generic implementation.

Emscripten would benefit most from its own implementation, IIRC all the libc synchronous I/O actually gets proxied to the main runtime thread in order to simplify handling the promises from the underlying JS interfaces.

I'm sure Mac/iOS/Android have a thing, too, but the ABI wouldn't change, so we can deal with those later.

Android doesn't have a thing for this.

Mac and iOS have dispatch_io which just doesn't mesh well with this interface (in particular there is no way to supply a read buffer) and at least when it was new wasn't any better for regular files than a simpler threadpool anyway

@slouken
Copy link
Collaborator

slouken commented Aug 28, 2024

Crazy thought... do you actually want that many I/O threads? Maybe we should start with one and add a hint that lets applications tune that?

@nfries88
Copy link

Crazy thought... do you actually want that many I/O threads? Maybe we should start with one and add a hint that lets applications tune that?

having a fixed number of not enough threads makes the system unusable for many real projects and having a few too many idle threads is always cheaper than creating them on-demand. This is what posix aio implementations generally got wrong and why nobody really uses them for heavy I/O (except for FreeBSD's implementation I guess).

tunability as a range with a heuristic selection within that range might be appropriate. All of the factors that would determine the optimal number of threads for an I/O-busy program are not really statically knowable, but a user could profile their particular I/O pattern across a good variety of machines and come up with a range that might be better than just relying on the heuristic alone.

@icculus
Copy link
Collaborator Author

icculus commented Aug 28, 2024

This generally looks good. Is there going to be WriteFileAsync()? What's the interaction with storage?

LoadFileAsync is useful because it has open the file, figure out how large the file is, allocate memory for it, and read into it. WriteFileAsync would only need to open and write an existing buffer, so it's only saving one step. But we could definitely add it!

Storage just needs an equivalent of LoadFileAsync. On the built-in implementation, it literally will just call SDL_LoadFileAsync. Current thinking was not to implement positional reads/writes at the storage level, but I'm still trying to understand what data in a SDL_Storage interface does when you have a file like Spider-Man 2's New York map, which is presumably dozens of gigabytes.

Crazy thought... do you actually want that many I/O threads? Maybe we should start with one and add a hint that lets applications tune that?

Right now it spins threads on-demand if there aren't any idle ones, up to a maximum count, and ones that are idle for more than 30 seconds will start terminating until only one is left (we always keep one around so there isn't a failure case when no threads can start). Until you use async i/o, it won't initialize anything so it doesn't have a resource cost at all until you touch it. The idea being that you likely have a flood of startup loading and then there's not a lot of value in keeping idle threads around.

There's a FIXME to add a hint, and we definitely should tune the defaults in any case.

@icculus
Copy link
Collaborator Author

icculus commented Aug 29, 2024

Okay, single-threaded Emscripten support is in, so this is Good Enough to ship and do improvements with system-specific APIs later...but do we really want this callback API?

It forces people deal with locking and worries that they'll clog the thread pool if they don't move quickly enough.

Maybe it would be better if we give them something like:

SDL_GetAsyncResults(&details)

Internally we keep a queue of completed tasks, and each call to this returns the next completed one with all the details. You don't poll specific tasks, you just get the next one in the completed list. Call doesn't block, returns false if nothing is ready.

This dramatically simplifies SDL's code and the mental load an app needs to manage. The app can still toss work to a background thread to process the loaded data if they want...but if they just didn't want to block on i/o, they can otherwise still benefit in a single-threaded app.

@slouken
Copy link
Collaborator

slouken commented Aug 29, 2024

Sure, I buy it, let’s do that instead.

@slouken
Copy link
Collaborator

slouken commented Aug 29, 2024

The one thing that would be nice is a way to wait for multiple I/O operations to complete. WaitForCompletion(array of operations)

@slouken
Copy link
Collaborator

slouken commented Aug 29, 2024

You probably want a separate handle for each request. For example if a library does async loading it wants to get its own data separate from the app.

@nfries88
Copy link

nfries88 commented Aug 30, 2024

Internally we keep a queue of completed tasks, and each call to this returns the next completed one with all the details. You don't poll specific tasks, you just get the next one in the completed list. Call doesn't block, returns false if nothing is ready.

Allowing for separate completion queues (probably specified when creating an async file object, with NULL refering to some default completion queue) would probably be the most useful form of this approach, since different threads would have different interests. That also conveniently translates very directly to how IOCP and io_uring both deal with completions. Should also be convenient for tying I/O ops directly to different asset pipelines.

@namandixit
Copy link
Contributor

namandixit commented Aug 30, 2024

I think allowing separate queues of I/O operations and then enumerating within each queue with SDL_GetAsyncResults(queue, &details) will be the best option. If someone doesn't want to store all handles, they can just use a single queue. If someone wants to deal with each I/O operation separately, they can have one queue per operation. And it allows future features and optimisations (queue priority, separate queue for local vs remote resources, different queues for different LOD levels, decompression on particular queues, etc.).

@icculus icculus force-pushed the sdl3-asyncio branch 2 times, most recently from d04bd1b to 5e312e7 Compare September 1, 2024 22:24
@icculus
Copy link
Collaborator Author

icculus commented Sep 1, 2024

Okay, this now has queryable tasks and "task groups" and the background thread callbacks are gone.

I'm super happy with how this came out. Here's the category documentation from the new version, which gives the basic idea:

CategoryIOAsync

SDL offers a way to perform i/o asynchronously. This allows an app to read or write files without waiting for data to actually transfer; the functions that request i/o never block while the request is fulfilled.

Instead, the data moves in the background and the app can check for results at their leisure.

This is more complicated that just reading and writing files in a synchronous way, but it can allow for more efficiency, and never having framerate drops as the hard drive catches up, etc.

The general usage pattern for async i/o is:

  • Open a file with SDL_IOAsyncFromFile.
  • Start i/o tasks to that file with SDL_ReadIOAsync or SDL_WriteIOAsync.
  • Later on, use SDL_GetIOAsyncTaskResult to see if the task is finished.
  • When your tasks are done, close the file with SDL_CloseIOAsync.

If you want to track several tasks at once, regardless of the order in which they complete:

  • Create a task group with SDL_CreateIOAsyncGroup.
  • Specify this group when calling SDL_ReadIOAsync or SDL_WriteIOAsync.
  • Use SDL_GetIOAsyncGroupResult to see if anything has finished yet.

This all works, without blocking, in a single thread, but one can also use a task group in a background thread and sleep until new results have arrived:

  • Call SDL_WaitIOAsyncGroup from one or more threads to efficiently block until new tasks complete.
  • When shutting down, call SDL_SignalIOAsyncGroup to unblock sleeping threads despite there being no new tasks completed.

And, of course, to match the synchronous SDL_LoadFile, we offer SDL_LoadFileAsync as a convenience function. This will handle allocating a buffer, slurping in the file data, and null-terminating it; you still get a task handle to check later or add to a task group.

@slouken
Copy link
Collaborator

slouken commented Sep 1, 2024

I'm super happy with how this came out. Here's the category documentation from the new version, which gives the basic idea:

Nice!

@icculus icculus marked this pull request as ready for review September 1, 2024 23:09
@jaedan
Copy link

jaedan commented Sep 2, 2024

Hi - I was sent this direction because I've been peripherally involved in SDL_Gpu (doing reviews focused on performance and answering questions about PCI and DMA). I happen to build storage APIs professionally though, so looking at this pull request is far more up my alley. I had a really long review I was working on, but the recent changes you've made addressed about 80% of what I was going to say so this will now be much shorter (but still probably long).

I think for now I'll stick to just public API feedback, and maybe I'll post in a few separate posts.

First, I do not think it is possible to implement the API calls that do not require a task group efficiently on Linux. On Linux, your asynchronous I/O choices are either Linux AIO (not to be confused with POSIX AIO) or io_uring. Both of these APIs require you to first create some per-thread context (e.g. a task group) in order to submit I/O. Creating these on demand per-file is expensive and may run into OS-imposed limits on the number of io_urings or aio_contexts available. I strongly suggest that you only expose the APIs that require a task group - it's the only way to get true asynchronous I/O. I've personally made a similar mistake with APIs released 10+ years ago and it was really painful to find my way out of the corner I painted myself into.

Second, opening and closing the file should be asynchronous operations too (and they should be submitted to the task group). These can actually secretly block a long time in some cases. Of course, on many operating systems there isn't an asynchronous open/close option at all, and the implementations there will be blocking. But at least on Linux with io_uring you can asynchronously open and close files, so the API should allow for the best behavior rather than forcing the worst.

More review to follow.

@icculus
Copy link
Collaborator Author

icculus commented Sep 2, 2024

(Sorry, I just squashed this down to a single commit right before you posted that; it's all the same stuff, just in one commit.)

I'm skeptical that async opens are worth the API complexity, but I can totally see a close operation blocking while cached writes make their way to disk. I'm nervous about adding the complexity, but let me think on this a little tonight.

As for tasks vs task groups: presumably with io_uring (and Win32 i/o completion ports), we'll have a single internal thread that lives just to deal with moving data from the OS in a single context/completion port to our own task groups that mostly sleeps. If that's not a terrible idea, then it doesn't matter if the app wants to query a single task or a task group, it's just a question of whether they look at a struct directly or shift the head of a linked list of structs before looking at it.

@icculus
Copy link
Collaborator Author

icculus commented Sep 2, 2024

(Also, more review is totally welcome at any length you're willing to write, because I'm 100% winging it over here like I know what I'm doing.)

@jaedan
Copy link

jaedan commented Sep 2, 2024

The shift to support "task groups" is an excellent and necessary change. However, I do still have a few pieces of feedback on the group task group design.

  1. One major source of overhead in storage APIs is the system calls themselves. The only way to work around this is to batch operations and submit them in a single system call. Some operating systems now have facilities for this (e.g. io_uring, and I think there's something on Windows called ioRing now). But in your API it isn't really clear whether this sort of batching would be allowed. I guess you could make an SDL_ReadIOAsync call simply queue onto the group and then only really submit the operation to the OS in SDL_GetIOAsyncGroupResult, but I think as the functions are named now that isn't clear enough to the user that they need to call SDL_GetIOAsyncGroupResult once to even start the operations. I think you need a separate SDL_QueueReadIOAsync and then an SDL_SubmitIO type of set up.

  2. Task groups should NOT be thread safe. The whole purpose of io_uring, or Linux aio, or Windows ioRing (and to some extent iocp), or even NVMe multiple submission/completion queues or Linux blk-mq is that you don't need locks. Instead, instruct users to create one task group per thread and use it for every operation on that thread for it's entire duration. This needs to work essentially the same way as SDL_Gpu's command buffers because the underlying storage hardware happens to work exactly the same way as the underlying GPU hardware in terms of command submission and completion.

  3. The name task group indicates to me something lighter weight than what I think this really needs to represent. A task group needs to represent a Linux io_uring, or a Windows iocp, etc. These are not cheap things to create and destroy. In fact, we only want to do it once per thread and at thread start up time, then re-use it. Most of the modern OS APIs call it a "ring" (because it's literally a circular submission ring of commands) or a "queue". I've historically called it an "io_channel".

(More coming)

@jaedan
Copy link

jaedan commented Sep 2, 2024

As for tasks vs task groups: presumably with io_uring (and Win32 i/o completion ports), we'll have a single internal thread that lives just to deal with moving data from the OS in a single context/completion port to our own task groups that mostly sleeps. If that's not a terrible idea, then it doesn't matter if the app wants to query a single task or a task group, it's just a question of whether they look at a struct directly or shift the head of a linked list of structs before looking at it.

Using a thread is, unequivocally, a terrible idea (sorry, that sounds really harsh and I promise I don't mean it to be, but I've been at this a long time and everyone wants to use a thread). The hardware (and I've helped design NVMe from close to the beginning) wants you to create a separate "queue" per thread. The hardware accepts command submissions in parallel (and they're all asynchronous too!). And the operating systems have spent the last decade re-designing themselves to do parallel, asynchronous I/O internally to match the hardware.

You are supposed to create a per-thread "context" (io_uring, aio_context, ioRing, iocp, whatever) at thread start up time, and use that exclusively for the I/O operations you perform on that thread. This will allow for parallel, lockless I/O submission and provide the lowest latency.

@nfries88
Copy link

nfries88 commented Sep 2, 2024

I'm skeptical that async opens are worth the API complexity

I have noticed file open latency before. I've never seen it be significant for a single file, and the many-file usecase is probably covered adequately enough by SDL_LoadFileAsync and perhaps similar composite operations. I'd agree its probably not worth the complexity.

but I can totally see a close operation blocking while cached writes make their way to disk. I'm nervous about adding the complexity, but let me think on this a little tonight.

An async sync-and-close would be useful for savefiles. System APIs typically will not wait for cached writes to hit the disk before closing a file, having this behavior be explicit as a composite operation is useful.

As for tasks vs task groups: presumably with io_uring (and Win32 i/o completion ports), we'll have a single internal thread that lives just to deal with moving data from the OS in a single context/completion port to our own task groups that mostly sleeps. If that's not a terrible idea, then it doesn't matter if the app wants to query a single task or a task group, it's just a question of whether they look at a struct directly or shift the head of a linked list of structs before looking at it.

Windows will block inside an Overlapped ReadFile when the underlying disk driver's queue is full. It's not that hard to hit this with random reads because Windows disk drivers have shallow queues compared to eg Linux. This is unfortunate because other pending reads to the same device might still succeed without blocking by hitting the filesystem cache, and obviously when combined with other uses of Overlapped I/O (eg sockets) that should never block this is even worse. It will also block on Overlapped WriteFile for writes that extend a file, which is a pretty normal thing to happen. Because of this behavior the typical procedure is to still use a threadpool to issue overlapped I/O. Probably a lesser concern, but Wine also doesn't attempt to provide asynchronous I/O to regular files at all.

As far as io_uring goes, I definitely agree with jaedan that this would be a suboptimal use of it, but also think that the typical use of a per-thread ring is both needlessly resource-expensive (they are not lightweight objects) and a poor fit for games' logic. A ring wrapped in a mutex for each task group is probably the least bad approach here, although I definitely lack jaedan's level of experience with io_uring. Extending the API with batching operations would probably bring that kind of implementation closer to optimal (also an improvement for other implementations) but that can be a later enhancement IMO.

@slouken
Copy link
Collaborator

slouken commented Sep 2, 2024

We could create a heavy weight async I/O context when a thread first uses the async API. This could be stored in TLS and be automatically cleaned up, so the users get the benefit of that without having to explicitly track things themselves.

It might be worth adding a io_uring and I/O ring implementation to validate that this API works efficiently with those interfaces.

The only thing that affects the ABI is the question of how asynchronous I/O interacts with the storage API, and it's becoming clear that it doesn't. This seems like it's intended as a low level high performance interface to disk files. If that's true, then let's bump this out of the ABI milestone.

@jaedan
Copy link

jaedan commented Sep 2, 2024

Because of this behavior the typical procedure is to still use a threadpool to issue overlapped I/O.

On Linux for many many years (via Linux aio) and on Windows for the last 3, this is no longer the standard procedure. We shouldn't make the API inefficient on modern operating systems by design, but rather accept the inefficiency only for operating systems with poor asynchronous IO support.

Let's imagine we have a system with a background thread that does the IO processing. This thread may spin while there's IO outstanding for efficiency, but it certainly can't just spin all the time. IO in games is sporadic, so most commonly this thread will be asleep when we next start a burst of IO. So we queue up the operation into the shared queue or ring and then signal the background thread to wake. On most operating systems, thread scheduling latency is only at about 1ms granularity, and even if it reacts instantly just the mechanics of restoring thread state to wake it takes a significant chunk of time. So all of this added time is added to our IO request. But then the underlying SSD will complete the request in 20-100 microseconds most likely. Why would we 10x our latency because of software design? It makes no sense.

but also think that the typical use of a per-thread ring is both needlessly resource-expensive (they are not lightweight objects) and a poor fit for games' logic

I really can't disagree more on both points. These objects are not cheap to create because they usually do memory allocations and system calls, but actual resource use is very minimal. Usually it's 64 bytes * number of queues entries. And you make one of these per thread that wants to perform IO. If you have true asynchronous IO, then you only need a very limited number of threads in your game that are doing asset loading. We're talking kilobytes of total resources use here.

And regarding a poor fit for game logic, a ring buffer for command submission is a perfect fit for game logic. On each tick you enqueue all your IO operations. At end of tick you submit and check for completions. It cannot be more ideal. It also happens to be exactly how the GPU API works too, and it would be how an optimal networking API would function as well.

@slouken
Copy link
Collaborator

slouken commented Oct 8, 2024

Removing SDL_AsyncIOTask cleans it up nicely, but does it make sense to add a function to cancel all the I/O on a queue?

@icculus
Copy link
Collaborator Author

icculus commented Oct 8, 2024

Removing SDL_AsyncIOTask cleans it up nicely, but does it make sense to add a function to cancel all the I/O on a queue?

I was thinking about that as an option. One still needs to wait for the completion, though, to free buffers that are in-flight, so all it gains you is maybe some tasks will finish faster, or maybe a little less disk i/o gets started.

We could have it as a flag to SDL_DestroyAsyncIOQueue, to say "I'm going down and I don't care if this stuff finishes or not, get me out of here asap."

@icculus
Copy link
Collaborator Author

icculus commented Oct 8, 2024

Hey, @madebr: I'm getting test failures like this:

22: Allocation 350: 64 bytes
22: 	0x7fbecacd5cd8: ???
22: 	0x7fbecacdb7f2: ???
22: 	0x55a9c282a089: ???
22: 	0x55a9c27f09d0: ???
22: 	0x7fbeca829d90: ???
22: 	0x7fbeca829e40: ???
22: 	0x55a9c27f0a85: ???

But running ctest or just testautomation on my machine doesn't trigger it. Is there something I have to do to turn on memory leak checking?

@icculus
Copy link
Collaborator Author

icculus commented Oct 8, 2024

We could have it as a flag to SDL_DestroyAsyncIOQueue, to say "I'm going down and I don't care if this stuff finishes or not, get me out of here asap."

Actually, this doesn't need a flag. Since you can't get the results without the queue, it's reasonable to cancel everything by default to make the queue destruction finish more quickly; the app has to assume all of it failed anyway, since any of it could fail without warning.

@icculus
Copy link
Collaborator Author

icculus commented Oct 8, 2024

...of course, we don't actually have a list of pending tasks for the queue to cancel them, so screw it, I'm leaving it alone. I'll probably remove the cancel stuff from the implementation once this merges, so we can reasonable find it in revision control if we want to restore it later.

@icculus
Copy link
Collaborator Author

icculus commented Dec 3, 2024

Still need to figure out the memory leaks in the test program, but otherwise, I think we should merge this. The only thing pending is what to do with the Storage interface, but it's becoming clear we'll want to rev that interface for other reasons, too, so it doesn't make sense to hold this up in the meantime.

@icculus
Copy link
Collaborator Author

icculus commented Dec 3, 2024

Hey, @madebr: I'm getting test failures like this:

22: Allocation 350: 64 bytes
22: 0x7fbecacd5cd8: ???
22: 0x7fbecacdb7f2: ???
22: 0x55a9c282a089: ???
22: 0x55a9c27f09d0: ???
22: 0x7fbeca829d90: ???
22: 0x7fbeca829e40: ???
22: 0x55a9c27f0a85: ???

But running ctest or just testautomation on my machine doesn't trigger it. Is there something I have to do to turn on memory leak checking?

(Ping @madebr)

@slouken
Copy link
Collaborator

slouken commented Dec 3, 2024

Run with the command line option --trackmem

@madebr
Copy link
Contributor

madebr commented Dec 3, 2024

Hey, @madebr: I'm getting test failures like this:
22: Allocation 350: 64 bytes
22: 0x7fbecacd5cd8: ???
22: 0x7fbecacdb7f2: ???
22: 0x55a9c282a089: ???
22: 0x55a9c27f09d0: ???
22: 0x7fbeca829d90: ???
22: 0x7fbeca829e40: ???
22: 0x55a9c27f0a85: ???
But running ctest or just testautomation on my machine doesn't trigger it. Is there something I have to do to turn on memory leak checking?

(Ping @madebr)

Configure your CMake project with -DSDL_TRACKMEM=ON (like ci does), and run ctest again.
Or run your tests with --trackmem as slouken suggested.

I was hoping CI would show the stack traces when building with RelWithDebInfo cmake build type.
Does building in Debug build type also show nothing?

It's a far fetch, but maybe.
In 4c00433, I moved the stack collection to report time so the tests run faster.
This late report generation might fail if the leaking library has been unloaded already.

@icculus
Copy link
Collaborator Author

icculus commented Dec 3, 2024

Ah, awesome, thanks!

Allocation 0: 64 bytes
	0x75cdb81f06cf: SDL_malloc_REAL+0x2f
	0x75cdb81f8904: SDL_CreateTimer+0xe0
	0x75cdb81f8a9a: SDL_AddTimer_REAL+0x39
	0x75cdb81294d3: SDL_AddTimer+0x31
	0x578f4daf3cc9: SDLTest_SetTestTimeout+0x78
	0x578f4daf3e39: SDLTest_RunTest+0x10b
	0x578f4daf4d16: SDLTest_ExecuteTestSuiteRunner+0xc83
	0x578f4da9ec92: main+0x359
	0x75cdb7e9c1ca: __libc_init_first+0x8a
	0x75cdb7e9c28b: __libc_start_main+0x8b
	0x578f4da9e835: _start+0x25

This appears to be an SDL_AddTimer leak...I'm not sure why this PR is triggering it when main isn't. I'll dig in for a second though.

@icculus
Copy link
Collaborator Author

icculus commented Dec 3, 2024

Interestingly, valgrind --leak-check=full ./test/testtimer --trackmem does not think this is leaking, but the --trackmem output does:

Memory allocations:
Allocation 0: 64 bytes
	0x498e6cf: SDL_malloc_REAL+0x2f
	0x49968ce: SDL_CreateTimer+0xe0
	0x4996a97: SDL_AddTimerNS_REAL+0x31
	0x48c7509: SDL_AddTimerNS+0x34
	0x10dc95: main+0x345
	0x4ca01ca: __libc_init_first+0x8a
	0x4ca028b: __libc_start_main+0x8b
	0x10d665: _start+0x25
Allocation 1: 40 bytes
	0x498e744: SDL_calloc_REAL+0x49
	0x4b4f8c7: SDL_CreateMutex_REAL+0x2a
	0x4996560: SDL_InitTimers+0x39
	0x49967ec: SDL_CheckInitTimers+0xd
	0x4996850: SDL_CreateTimer+0x62
	0x4996a64: SDL_AddTimer_REAL+0x39
	0x48c74d3: SDL_AddTimer+0x31
	0x10dbf3: main+0x2a3
	0x4ca01ca: __libc_init_first+0x8a
	0x4ca028b: __libc_start_main+0x8b
	0x10d665: _start+0x25
Allocation 2: 64 bytes
	0x498e6cf: SDL_malloc_REAL+0x2f
	0x49968ce: SDL_CreateTimer+0xe0
	0x4996a64: SDL_AddTimer_REAL+0x39
	0x48c74d3: SDL_AddTimer+0x31
	0x10dfc9: main+0x679
	0x4ca01ca: __libc_init_first+0x8a
	0x4ca028b: __libc_start_main+0x8b
	0x10d665: _start+0x25
Allocation 3: 32 bytes
	0x498e6cf: SDL_malloc_REAL+0x2f
	0x4b5013f: SDL_CreateSemaphore_REAL+0x19
	0x4996583: SDL_InitTimers+0x5c
	0x49967ec: SDL_CheckInitTimers+0xd
	0x4996850: SDL_CreateTimer+0x62
	0x4996a64: SDL_AddTimer_REAL+0x39
	0x48c74d3: SDL_AddTimer+0x31
	0x10dbf3: main+0x2a3
	0x4ca01ca: __libc_init_first+0x8a
	0x4ca028b: __libc_start_main+0x8b
	0x10d665: _start+0x25
Allocation 4: 9 bytes
	0x498e6cf: SDL_malloc_REAL+0x2f
	0x4992b0d: SDL_strdup_REAL+0x30
	0x49952a6: SDL_CreateThreadWithPropertiesRuntime_REAL+0x138
	0x49953c2: SDL_CreateThreadRuntime_REAL+0x85
	0x49965e0: SDL_InitTimers+0xb9
	0x49967ec: SDL_CheckInitTimers+0xd
	0x4996850: SDL_CreateTimer+0x62
	0x4996a64: SDL_AddTimer_REAL+0x39
	0x48c74d3: SDL_AddTimer+0x31
	0x10dbf3: main+0x2a3
	0x4ca01ca: __libc_init_first+0x8a
	0x4ca028b: __libc_start_main+0x8b
	0x10d665: _start+0x25
Allocation 5: 112 bytes
	0x498e744: SDL_calloc_REAL+0x49
	0x499525e: SDL_CreateThreadWithPropertiesRuntime_REAL+0xf0
	0x49953c2: SDL_CreateThreadRuntime_REAL+0x85
	0x49965e0: SDL_InitTimers+0xb9
	0x49967ec: SDL_CheckInitTimers+0xd
	0x4996850: SDL_CreateTimer+0x62
	0x4996a64: SDL_AddTimer_REAL+0x39
	0x48c74d3: SDL_AddTimer+0x31
	0x10dbf3: main+0x2a3
	0x4ca01ca: __libc_init_first+0x8a
	0x4ca028b: __libc_start_main+0x8b
	0x10d665: _start+0x25
Allocation 6: 64 bytes
	0x498e6cf: SDL_malloc_REAL+0x2f
	0x49968ce: SDL_CreateTimer+0xe0
	0x4996a64: SDL_AddTimer_REAL+0x39
	0x48c74d3: SDL_AddTimer+0x31
	0x10dbf3: main+0x2a3
	0x4ca01ca: __libc_init_first+0x8a
	0x4ca028b: __libc_start_main+0x8b
	0x10d665: _start+0x25
Total: 0.38 Kb in 7 allocations
==305100== 
==305100== HEAP SUMMARY:
==305100==     in use at exit: 714,589 bytes in 168 blocks
==305100==   total heap usage: 3,294 allocs, 3,126 frees, 1,972,232 bytes allocated
==305100== 
==305100== 288 bytes in 1 blocks are possibly lost in loss record 23 of 52
==305100==    at 0x484D953: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==305100==    by 0x40145AB: calloc (rtld-malloc.h:44)
==305100==    by 0x40145AB: allocate_dtv (dl-tls.c:370)
==305100==    by 0x40145AB: _dl_allocate_tls (dl-tls.c:629)
==305100==    by 0x4D13606: allocate_stack (allocatestack.c:429)
==305100==    by 0x4D13606: pthread_create@@GLIBC_2.34 (pthread_create.c:655)
==305100==    by 0x4B4F4CF: SDL_SYS_CreateThread (SDL_systhread.c:115)
==305100==    by 0x499530B: SDL_CreateThreadWithPropertiesRuntime_REAL (SDL_thread.c:383)
==305100==    by 0x49953C1: SDL_CreateThreadRuntime_REAL (SDL_thread.c:403)
==305100==    by 0x49965DF: SDL_InitTimers (SDL_timer.c:230)
==305100==    by 0x49967EB: SDL_CheckInitTimers (SDL_timer.c:295)
==305100==    by 0x499684F: SDL_CreateTimer (SDL_timer.c:309)
==305100==    by 0x4996A63: SDL_AddTimer_REAL (SDL_timer.c:363)
==305100==    by 0x48C74D2: SDL_AddTimer (SDL_dynapi_procs.h:61)
==305100==    by 0x10DBF2: main (testtimer.c:143)
==305100== 
==305100== 2,617 bytes in 1 blocks are definitely lost in loss record 25 of 52
==305100==    at 0x484DB80: realloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==305100==    by 0x498E499: real_realloc (SDL_malloc.c:6323)
==305100==    by 0x11A419: SDLTest_LogAllocations (SDL_test_memory.c:443)
==305100==    by 0x1110E9: SDLTest_CommonDestroyState (SDL_test_common.c:745)
==305100==    by 0x10E29C: main (testtimer.c:285)
==305100== 
==305100== LEAK SUMMARY:
==305100==    definitely lost: 2,617 bytes in 1 blocks
==305100==    indirectly lost: 0 bytes in 0 blocks
==305100==      possibly lost: 288 bytes in 1 blocks
==305100==    still reachable: 711,684 bytes in 166 blocks
==305100==         suppressed: 0 bytes in 0 blocks
==305100== Reachable blocks (those to which a pointer was found) are not shown.
==305100== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==305100== 
==305100== For lists of detected and suppressed errors, rerun with: -s
==305100== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0)
icculus@lucha:~/projects/SDL-icculus/buildbot$ 

To be clear, SDL_Quit() legit cleans up the completed timer list, which we intentionally keep as an allocation pool until SDL_Quit is called, so even SDL_RemoveTimer() will "leak" in this regard, as it currently stands.

@madebr
Copy link
Contributor

madebr commented Dec 3, 2024

Does allocation and deallocation happen through SDL_malloc/SDL_free?
Does deallocation happen after SDL_TrackAllocations?

@icculus
Copy link
Collaborator Author

icculus commented Dec 3, 2024

Ugh, how mortifying:

diff --git a/src/SDL.c b/src/SDL.c
index 0fdf9713f..5e6493588 100644
--- a/src/SDL.c
+++ b/src/SDL.c
@@ -624,7 +625,7 @@ void SDL_Quit(void)
     SDL_DBus_Quit();
 #endif
 
-    SDL_QuitTimers();
+    SDL_QuitAsyncIO();
 
     SDL_SetObjectsInvalid();
     SDL_AssertionsQuit();

@icculus
Copy link
Collaborator Author

icculus commented Dec 3, 2024

Okay, this is the moment, let's merge this...!

@icculus icculus merged commit 46f43c2 into libsdl-org:main Dec 3, 2024
40 checks passed
@icculus icculus deleted the sdl3-asyncio branch December 3, 2024 22:33
@sezero
Copy link
Contributor

sezero commented Dec 4, 2024

Is there a specific reason to use __kernel_time64_t and long long in SDL_asyncio_liburing.c? Can we not simply use Sint64 instead? E.g.:

--- a/src/file/io_uring/SDL_asyncio_liburing.c
+++ b/src/file/io_uring/SDL_asyncio_liburing.c
@@ -284,3 +284,3 @@ static SDL_AsyncIOTask *liburing_asyncioqueue_wait_results(void *userdata, Sint3
     } else {
-        struct __kernel_timespec ts = { (__kernel_time64_t) timeoutMS / SDL_MS_PER_SECOND, (long long) SDL_MS_TO_NS(timeoutMS % SDL_MS_PER_SECOND) };
+        struct __kernel_timespec ts = { (Sint64) timeoutMS / SDL_MS_PER_SECOND, (Sint64) SDL_MS_TO_NS(timeoutMS % SDL_MS_PER_SECOND) };
         liburing.io_uring_wait_cqe_timeout(&queuedata->ring, &cqe, &ts);

@icculus
Copy link
Collaborator Author

icculus commented Dec 4, 2024

I think it's just what the struct uses, at least on the headers I looked at, but the PR to change this should be fine.

@icculus
Copy link
Collaborator Author

icculus commented Dec 4, 2024

Just adding a "best practices" section to the async i/o docs. @jaedan and @nfries88, does this cover it well?

Best Practices

Simple non-blocking i/o--for an app that just wants to pick up data whenever it's ready without losing framerate waiting on disks to spin--can use whatever pattern works well for the program. In this case, simply call SDL_ReadAsyncIO, or maybe SDL_LoadFileAsync, as needed. Once a frame, call SDL_GetAsyncIOResult to check for any completed tasks and deal with the data as it arrives.

If two separate pieces of the same program need their own i/o, it is legal for each to create their own queue. This will prevent either piece from accidentally consuming the other's completed tasks. Each queue does require some amount of resources, but it is not an overwhelming cost. Do not make a queue for each task, however. It is better to put many tasks into a single queue. They will be reported in order of completion, not in the order they were submitted, so it doesn't generally matter what order tasks are started.

One async i/o queue can be shared by multiple threads, or one thread can have more than one queue, but the most efficient way--if ruthless efficiency is the goal--is to have one queue per thread, with multiple threads working in parallel, and attempt to keep each queue loaded with tasks that are both started by and consumed by the same thread. On modern platforms that can use newer interfaces, this can keep data flowing as efficiently as possible all the way from storage hardware to the app, with no contention between threads for access to the same queue.

Written data is not guaranteed to make it to physical media by the time a closing task is completed, unless SDL_CloseAsyncIO is called with its flush parameter set to true, which is to say that a successful result here can still result in lost data during an unfortunately-timed power outage if not flushed. However, flushing will take longer and may be unnecessary, depending on the app's needs.

@nfries88
Copy link

nfries88 commented Jan 7, 2025

I would probably add a note that for very simple cases or where the data is needed immediately it is generally more efficient and much simpler logic to just use normal blocking I/O.

But aside for that, that advice looks pretty complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Asynchronous file I/O
7 participants