Description
This is just an idea I had, that I'd like to document somewhere so it does not get lost. It will only really be actionable after Stageless is implemented.
What problem does this solve or what need does it fill?
Spawning tasks onto the multithreaded task executor runtime, for running each system, is currently a major source of overhead for Bevy. Many Bevy projects, ironically, perform better if their schedule is set to single-threaded execution.
It is considered "idiomatic" (and we often encourage users to do it) to have many small systems that often do little to no work in Bevy. We say this to "improve parallelism opportunities". I don't think this is bad advice and I don't think we should change what we consider "idiomatic". It also leads to clean code.
However, in practice, this leads to bad performance. A typical trace is full of "bubbles" and "system execution overhead" from Bevy preparing and spawning tasks for all systems, only for many of them to do nothing or very little actual work when run. We need to address this overhead.
There are two separate fronts to attack here:
- Improving the quality of our implementation to reduce the perf overhead of running systems with multithreading. There are many great efforts and PRs for this, and this is not the purpose of this issue.
- Giving users the APIs/tools/mechanisms they need to control the runtime/execution of their systems and be able to save themselves the overhead when possible. This is the scope of this issue.
Bevy needs better APIs for allowing users to control and reduce the overheads of multithreaded execution. One major tool ("run conditions") will come with Stageless. This issue proposes one more such mechanism.
Additional Context: How will Stageless improve things?
Stageless "Run Conditions" (which I helped design with the same perf considerations as I described above) give users the ability to tell Bevy when to run or not run their systems. Run Conditions are evaluated by the executor itself, without spawning a task (and therefore without the associated perf overhead), and a task for running the actual system will only be spawned if the system should run. Unlike our legacy "Run Criteria", users can easily add and compose multiple "Run Conditions", allowing for flexibility and precise control.
After stageless, a common development workflow to optimize a Bevy game would be: "if you know when your system shouldn't run, go and add some run conditions to control it!". Users will have the power to prevent systems that have nothing to do that frame, from being run by Bevy and slowing things down.
What solution would you like?
I would like to propose one more mechanism for giving users more control over the overhead of parallel system execution: the ability to tell Bevy to not spawn a task for a given system, and instead run it inside the executor itself (similar to how Run Conditions and exclusive systems work).
This only makes sense if a user knows that a certain system, even when it has work to do, does very little and will always finish quickly. If the user determines that the overhead of parallel execution is more than the actual work the system is doing.
Like other optimizations (such as ECS table/sparse-set storage), it is a trade-off and a balancing act. It is a lever that users can pull in niche scenarios when they consider it worth doing.
Setting this property on a system will remove the task spawning overhead for that system, but may bottleneck the executor task if the system takes too long to run (holding it up from running other systems that may have become ready).
What alternative(s) have you considered?
Using exclusive systems. They are run like I described above ("inline" inside the executor, without spawning a task onto the thread pool), and always on the main thread (which is not necessary for the "inline" systems proposed here). Therefore, they can also accomplish the goal of being efficient and not introducing runtime overhead.
However, they, by definition, do not allow other parallel systems to run alongside them. They completely disable parallel system execution.
The solution proposed by this issue is better, because it allows for a "middle ground". A system marked with the proposed "inline execution flag", while it runs inside the executor, still allows for other systems (that were run using tasks) to run in parallel.