Skip to content

Unify and simplify command and system error handling #18351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

alice-i-cecile
Copy link
Member

@alice-i-cecile alice-i-cecile commented Mar 17, 2025

Objective

  • ECS error handling is a lovely flagship feature for Bevy 0.16, all in the name of reducing panics and encouraging better error handling (Our API suggests that panicking should be the default #14275).
  • Currently though, command and system error handling are completely disjoint and use different mechanisms.
  • Additionally, there's a number of distinct ways to set the default/fallback/global error handler that have limited value. As far as I can tell, this will be cfg flagged to toggle between dev and production builds in 99.9% of cases, with no real value in more granular settings or helpers.
  • Fixes Unify system and command error handling #17272

Solution

  • Standardize error handling on the OnceLock global error mechanisms ironed out in Improved Command Errors #17215
    • As discussed there, there are serious performance concerns there, especially for commands
    • I also think this is a better fit for the use cases, as it's truly global
  • Move from SystemErrorContext to a more general purpose ErrorContext, which can handle observers and commands more clearly
  • Cut the superfluous setter methods on App and SubApp
  • Rename the limited (and unhelpful) fallible_systems example to error_handling, and add an example of command error handling

Testing

Ran the error_handling example.

Notes for reviewers

  • Do you see a clear way to allow commands to retain &mut World access in the per-command custom error handlers? IMO that's a key feature here (allowing the ad-hoc creation of custom commands), but I'm not sure how to get there without exploding complexity.
  • I've removed the feature gate on the default_error_handler: contrary to @cart's opinion in Improved Command Errors #17215 I think that virtually all apps will want to use this. Can you think of a category of app that a) is extremely performance sensitive b) is fine with shipping to production with the panic error handler? If so, I can try to gather performance numbers and/or reintroduce the feature flag. UPDATE: see benches at the end of this message.
  • OnceLock is in std: @bushrat011899 what should we do here?
  • Do you have ideas for more automated tests for this collection of features?

Benchmarks

I checked the impact of the feature flag introduced: benchmarks might show regressions. This bears more investigation. I'm still skeptical that there are users who are well-served by a fast always panicking approach, but I'm going to re-add the feature flag here to avoid stalling this out.

image

@alice-i-cecile alice-i-cecile added A-ECS Entities, components, systems, and events C-Usability A targeted quality-of-life change that makes Bevy easier to use X-Contentious There are nontrivial implications that should be thought through D-Modest A "normal" level of difficulty; suitable for simple features or challenging fixes S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Mar 17, 2025
@alice-i-cecile alice-i-cecile added this to the 0.16 milestone Mar 17, 2025
@alice-i-cecile alice-i-cecile requested a review from cart March 17, 2025 04:57
Copy link
Contributor

You added a new example but didn't add metadata for it. Please update the root Cargo.toml file.

@bushrat011899
Copy link
Contributor

Re. OnceLock: I've already got a suitable alternative implemented in bevy_platform_support::sync::OnceLock! Should be a drop-in replacement that'll defer to std when possible and use a spinlock when it can't.

Copy link
Contributor

@bushrat011899 bushrat011899 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good direction to move error handling towards. There's definitely room for iteration, but this is the most flexible and most approachable form of error handling thus far. For example, it may be nice to have fallible plugins (e.g., Plugin::build could return a Result), etc. in the future.

Some minor typos and a suggestion to use bevy_platform_support::sync::OnceLock for no_std compatibility reasons. I do also raise the idea of making the OnceLock itself private and just having a function to initialize the handler. While I think that's a nicer API, I don't see it as a blocker.

/// # Warning
///
/// As this can *never* be overwritten, library code should never set this value.
pub static GLOBAL_ERROR_HANDLER: OnceLock<fn(BevyError, ErrorContext)> = OnceLock::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fact that this is a OnceLock should be a hidden implementation detail users don't interact with, allowing us to experiment with the implementation without breaking user code needlessly. Perhaps exposing a simple set_error_handler(...) function to go with the default_error_handler() function that is already added?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other direction we can go is to remove default_error_handler completely. I don't think it adds a ton without the feature flag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to cut default_error_handler: it doesn't make this more misuse resistant, and end users generally won't be creating custom error handling code so the convenience argument doesn't hold water for me either.

pub last_run: Tick,
/// Context for a [`BevyError`] to aid in debugging.
#[derive(Debug, PartialEq, Eq, Clone)]
pub enum ErrorContext {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future PR: Experimenting with this context type will be very helpful in the future. For example, systems and observers could include the SystemId, allowing a more rich error handler to have differing behaviour based on the system that failed, but entirely up to the user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I think that sticking complex app-specific error handling in the global error handler is likely to be a bit of a mess. Instead, users should be trying to handle errors as close to where they're generated as possible, and we should provide them the tools required to do that.

Copy link
Contributor

@NthTensor NthTensor Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Alice here: complex conditional logic in the default handler will probably be an anti-pattern. We can expect users to want to have complex error handling for their code but much less precision is possible for engine code or ecosystem code. For code you write yourself, there's probably better ways to handle errors (like piping). What the default handler gives is consistency: you don't have to worry about one ecosystem developer choosing to panic and another choosing to silently fail without logging.

That said, we should still probably let users do it. Providing the system ID can't hurt. As an aside, I'd probably like to see the source-location-tracking exposed here. But all of that is firmly in the category of follow-up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agreed on source location tracking, and no objection to exposing system id etc. Seems handy for tooling! Just slap on some "please don't omni-error handling" advice in the docs :)

alice-i-cecile and others added 2 commits March 17, 2025 13:43
Co-authored-by: Zachary Harrold <zac@harrold.com.au>
Co-authored-by: Zachary Harrold <zac@harrold.com.au>
Copy link
Contributor

You added a new example but didn't add metadata for it. Please update the root Cargo.toml file.

Copy link
Contributor

@NthTensor NthTensor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First code review from me in a while, sorry if I'm a bit rusty. I like seeing all the red, and I think what you've ended up on is really simple and clear. I don't really have any faults with the implementation, but I do still have some questions.

Do you see a clear way to allow commands to retain &mut World access...

The issue is not that we don't have world-access when handling errors in commands or systems, right? The issue is observers. Is there a reason we can't go with something like the following:

fn error_handler(error: BevyError, ctx: ErrorContext, world: DeferredWorld)

Commands and systems appear to have &mut World access when they call GLOBAL_ERROR_HANDLER.get_or_init(|| panic), which means it should be safe to transmute into UnsafeWorldCell and DeferredWorld, no? And observers already have a deferred world they can use. I'm guess I am probably missing something, but it seems like this would be a way to preserve access to commands.

I've removed the feature gate on the default_error_handler...

I think this is probably the right call, but we should benchmark as it could have perf implications. Unfortunately I do think we should wait to merge this until we get either positive benchmarks or hear Cart's stance, which is the only reason I'm marking this as changes requested.

OnceLock is in std...

The once_cell crate has no_std support via critical_section. It would also let us swap out the handler at runtime. Perhaps we should consider using it instead, although that can be follow-up IMO.

Do you have ideas for more automated tests for this collection of features?

Honestly, I think we've landed on a pretty simple implementation. There doesn't seem to be much room for logic errors. Performance is the major concern.

/// # Warning
///
/// As this can *never* be overwritten, library code should never set this value.
pub static GLOBAL_ERROR_HANDLER: OnceLock<fn(BevyError, ErrorContext)> = OnceLock::new();
Copy link
Contributor

@NthTensor NthTensor Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on hiding this from users and having a App::set_error_handler fn which can be called multiple times (overwriting the previous value), and then which actually sets this in App::start? We could even make this private and expose it only through a get_error_handler() function.

I think this might be closer to what bevy users expect from our API, and would eliminate the problems caused by calling it conflicting handlers or trying to change it while the app is running. Or IDK maybe we want to be able to swap it out at runtime for the editor, in which case a OnceLock isn't the right primitive either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't work for standalone bevy_ecs users: they don't have access to App. We also really want this to be truly global, not set per SubApp.

That said, I am down to make this a bit more opaque to give us flexibility and a nicer API. Let me see about newtyping it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I think the current is probably fine TBH. The API works, it was just a thought.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does work, but I don't like it! 😂

Copy link
Contributor

@NthTensor NthTensor Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could always provide a wrapper as part of App that hides the less nice ECS implementation from most users. But keep the current thing (or something like it) in place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO there should be one way to set this: no need to complicate things.

@alice-i-cecile
Copy link
Member Author

The issue is not that we don't have world-access when handling errors in commands or systems, right? The issue is observers. Is there a reason we can't go with something like the following:

Originally yes. I'm increasingly convinced that allowing &mut World access in the global handler is a bit of a trap though. Without a really good motivating use case I'm worried that providing that will encourage omni-handlers. However, I'd really like it in Commands::queue_handled, so you can respond to failed commands right at the call site nice and easily.

That can easily wait for followup though.

Honestly, I think we've landed on a pretty simple implementation. There doesn't seem to be much room for logic errors. Performance is the major concern.

Valid. I can try and spend some time benching this.

@alice-i-cecile
Copy link
Member Author

@NthTensor I'm seeing in the ballpark of 10% perf regressions on some of the command benchmarks with the configurable error handling enabled. I'm not 100% convinced that's real, but that's enough to sell me on feature-flagging this again for this PR.

Copy link
Contributor

You added a new feature but didn't add a description for it. Please update the root Cargo.toml file.

/// The default error handler. This defaults to [`panic()`],
/// but if set, the [`GLOBAL_ERROR_HANDLER`] will be used instead, enabling error handler customization.
/// The `configurable_error_handler` feature must be enabled to change this from the panicking default behavior,
/// as there may be runtime overhead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want to #[inline] this instead of marking it #[cold]? It's the performance of the branch that dosn't make this call that we have to worry about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was Cart's original design 🤷🏽‍♀️ I think that's reasonable, but I don't feel like running more benches here right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this can happen during the RC

Copy link
Contributor

@NthTensor NthTensor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it's in a good state for merge now, especially given that we can still remove the feature flag during the RC if we change our minds.

I wish cargo's feature overrides were easier so we could justify this being the default. I also wonder if using cold might move the performance needle at all. Either way, I very much want this in for the first RC.

@alice-i-cecile alice-i-cecile added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Mar 18, 2025
@alice-i-cecile
Copy link
Member Author

Merging: let's see if I managed to fix CI finally...

@alice-i-cecile alice-i-cecile added this pull request to the merge queue Mar 18, 2025
Merged via the queue into bevyengine:main with commit 5d0505a Mar 18, 2025
36 checks passed
mockersf pushed a commit that referenced this pull request Mar 18, 2025
# Objective

- ECS error handling is a lovely flagship feature for Bevy 0.16, all in
the name of reducing panics and encouraging better error handling
(#14275).
- Currently though, command and system error handling are completely
disjoint and use different mechanisms.
- Additionally, there's a number of distinct ways to set the
default/fallback/global error handler that have limited value. As far as
I can tell, this will be cfg flagged to toggle between dev and
production builds in 99.9% of cases, with no real value in more granular
settings or helpers.
- Fixes #17272

## Solution

- Standardize error handling on the OnceLock global error mechanisms
ironed out in #17215
- As discussed there, there are serious performance concerns there,
especially for commands
- I also think this is a better fit for the use cases, as it's truly
global
- Move from `SystemErrorContext` to a more general purpose
`ErrorContext`, which can handle observers and commands more clearly
- Cut the superfluous setter methods on `App` and `SubApp`
- Rename the limited (and unhelpful) `fallible_systems` example to
`error_handling`, and add an example of command error handling

## Testing

Ran the `error_handling` example.

## Notes for reviewers

- Do you see a clear way to allow commands to retain &mut World access
in the per-command custom error handlers? IMO that's a key feature here
(allowing the ad-hoc creation of custom commands), but I'm not sure how
to get there without exploding complexity.
- I've removed the feature gate on the default_error_handler: contrary
to @cart's opinion in #17215 I think that virtually all apps will want
to use this. Can you think of a category of app that a) is extremely
performance sensitive b) is fine with shipping to production with the
panic error handler? If so, I can try to gather performance numbers
and/or reintroduce the feature flag. UPDATE: see benches at the end of
this message.
- ~~`OnceLock` is in `std`: @bushrat011899 what should we do here?~~
- Do you have ideas for more automated tests for this collection of
features?

## Benchmarks

I checked the impact of the feature flag introduced: benchmarks might
show regressions. This bears more investigation. I'm still skeptical
that there are users who are well-served by a fast always panicking
approach, but I'm going to re-add the feature flag here to avoid
stalling this out.


![image](https://github.com/user-attachments/assets/237f644a-b36d-4332-9b45-76fd5cbff4d0)

---------

Co-authored-by: Zachary Harrold <zac@harrold.com.au>
github-merge-queue bot pushed a commit that referenced this pull request Mar 24, 2025
…ia the GLOBAL_ERROR_HANDLER (#18454)

# Objective

There are two related problems here:

1. Users should be able to change the fallback behavior of *all*
ECS-based errors in their application by setting the
`GLOBAL_ERROR_HANDLER`. See #18351 for earlier work in this vein.
2. The existing solution (#15500) for customizing this behavior is high
on boilerplate, not global and adds a great deal of complexity.

The consensus is that the default behavior when a parameter fails
validation should be set based on the kind of system parameter in
question: `Single` / `Populated` should silently skip the system, but
`Res` should panic. Setting this behavior at the system level is a
bandaid that makes getting to that ideal behavior more painful, and can
mask real failures (if a resource is missing but you've ignored a system
to make the Single stop panicking you're going to have a bad day).

## Solution

I've removed the existing `ParamWarnPolicy`-based configuration, and
wired up the `GLOBAL_ERROR_HANDLER`/`default_error_handler` to the
various schedule executors to properly plumb through errors .

Additionally, I've done a small cleanup pass on the corresponding
example.

## Testing

I've run the `fallible_params` example, with both the default and a
custom global error handler. The former panics (as expected), and the
latter spams the error console with warnings 🥲

## Questions for reviewers

1. Currently, failed system param validation will result in endless
console spam. Do you want me to implement a solution for warn_once-style
debouncing somehow?
2. Currently, the error reporting for failed system param validation is
very limited: all we get is that a system param failed validation and
the name of the system. Do you want me to implement improved error
reporting by bubbling up errors in this PR?
3. There is broad consensus that the default behavior for failed system
param validation should be set on a per-system param basis. Would you
like me to implement that in this PR?

My gut instinct is that we absolutely want to solve 2 and 3, but it will
be much easier to do that work (and review it) if we split the PRs
apart.

## Migration Guide

`ParamWarnPolicy` and the `WithParamWarnPolicy` have been removed
completely. Failures during system param validation are now handled via
the `GLOBAL_ERROR_HANDLER`: please see the `bevy_ecs::error` module docs
for more information.

---------

Co-authored-by: MiniaczQ <xnetroidpl@gmail.com>
mockersf pushed a commit that referenced this pull request Mar 25, 2025
…ia the GLOBAL_ERROR_HANDLER (#18454)

There are two related problems here:

1. Users should be able to change the fallback behavior of *all*
ECS-based errors in their application by setting the
`GLOBAL_ERROR_HANDLER`. See #18351 for earlier work in this vein.
2. The existing solution (#15500) for customizing this behavior is high
on boilerplate, not global and adds a great deal of complexity.

The consensus is that the default behavior when a parameter fails
validation should be set based on the kind of system parameter in
question: `Single` / `Populated` should silently skip the system, but
`Res` should panic. Setting this behavior at the system level is a
bandaid that makes getting to that ideal behavior more painful, and can
mask real failures (if a resource is missing but you've ignored a system
to make the Single stop panicking you're going to have a bad day).

I've removed the existing `ParamWarnPolicy`-based configuration, and
wired up the `GLOBAL_ERROR_HANDLER`/`default_error_handler` to the
various schedule executors to properly plumb through errors .

Additionally, I've done a small cleanup pass on the corresponding
example.

I've run the `fallible_params` example, with both the default and a
custom global error handler. The former panics (as expected), and the
latter spams the error console with warnings 🥲

1. Currently, failed system param validation will result in endless
console spam. Do you want me to implement a solution for warn_once-style
debouncing somehow?
2. Currently, the error reporting for failed system param validation is
very limited: all we get is that a system param failed validation and
the name of the system. Do you want me to implement improved error
reporting by bubbling up errors in this PR?
3. There is broad consensus that the default behavior for failed system
param validation should be set on a per-system param basis. Would you
like me to implement that in this PR?

My gut instinct is that we absolutely want to solve 2 and 3, but it will
be much easier to do that work (and review it) if we split the PRs
apart.

`ParamWarnPolicy` and the `WithParamWarnPolicy` have been removed
completely. Failures during system param validation are now handled via
the `GLOBAL_ERROR_HANDLER`: please see the `bevy_ecs::error` module docs
for more information.

---------

Co-authored-by: MiniaczQ <xnetroidpl@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ECS Entities, components, systems, and events C-Usability A targeted quality-of-life change that makes Bevy easier to use D-Modest A "normal" level of difficulty; suitable for simple features or challenging fixes S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it X-Contentious There are nontrivial implications that should be thought through
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unify system and command error handling
3 participants