[Feature] Improved Panic Handling #2927

kaimast · 2025-09-17T19:55:35Z

This PR is a follow-up on #2813.

Context: Handling Fatal Errors in snarkVM

Generally, snarkVM and snarkOS aim to never panic if it is possible to avoid. However, there is always the possibility of bugs that were overlooked or ending up in a state where the code cannot recover safely.

If the code panics, there are two different scenarios:

If the panic happens inside a program execution (i.e., inside try_vm_runtime) we can safely log it and continue operating normally. snarkVM will simply treat this as a failed execution.
If the panic happens outside a program execution (i.e., anywhere else in snarkVM or in snarkOS), we most likely do not want to simply continue execution, and we definitely want to notify the user and developers about the error.
We had several instances where a thread or tokio task panicked, and, while information about the panic was printed to stdout, there was no mention of it in the logs.

While the former is already implemented. The latter was partially introduced by #2813, but this PR adds some important improvement and fixes. The behavior of try_vm_runtime remains unchanged by this, albeit its implementation changes slightly.

Improved Panic Handling Mechanisms

Panic handling as implemented in #2813 (and also before but to a limited extent) has a major shortcoming: it did not work well with multi-threading. That is because the panic handler is set per-process but, while executing a snarkVM program another thread, unrelated to execution, might panic, and snarkVM may falsely log it as a VM error.

The proposed mechanism in this PR is to store backtraces and error messages in a thread-local variable. try_vm_runtime is changed to retrieve the error message from this thread-local variable.
Additionally, there is now a catch_unwind function that wraps around the function of the same name from the standard library and returns the error message and backtrace on error. The latter is useful when printing error messages about panics, for example, in snarkOS.

Utilities for Task Management

tokio allows propagating panics through JoinHandles. This PR also provides wrappers around tokio's spawn and spawn_blocking functions that contain the boilerplate for propagating such panics.
This replaces logic that before resided in snarkos-node-bft but will be useful to other crates.

Usage

In snarkVM

snarkVM has one notable instance of spawning a background task, the sequential ops thread (see the changes in #2975). 80c81d8 adds catch_unwind to that thread to ensure future panics in the storage thread are logged properly.

In snarkOS

For snarkOS, the plan is to use catch_unwind for important background tasks such as block synchronization. Again, this is to ensure that a panic in block synchronization does not happen silently and block synchronization stops for no apparent reason.

ProvableHQ/snarkOS#3874 contains the full set of proposed changes for snarkOS.

Notes

For panic handling, there is a potential overhead of storing backtraces that might not always be needed. It is my understanding that panics do not happen very frequently in production, but if this understanding is wrong, please let me know.
Panic handling for "detached" tokio task (those without JoinHandles) does not work well (yet). Future improvements to tokio will change this, but we also do not use detached tasks much.

utilities/src/errors.rs

utilities/src/task.rs

ledger/store/src/helpers/memory/internal/nested_map.rs

utilities/src/errors.rs

vicsn · 2025-09-21T16:37:24Z

and snarkVM may falsely log it as VM error.

Is this the main and only usecase for the thread local panic logic of this PR? I'm trying to better understand whether the refactor is worth it at the moment.

I thought we printed the error in the VM panic handler. Therefore:

if it were to catch a non-VM panic, I'd assume we would clearly observe this in the logs. Clumsy, but no harm done.
if it were to catch a VM panic of the wróng thread, would there be a risk of any of the following:
a. verification of the wrong transaction being logged as wrong?
b. verification of the wrong transaction being returned as Err?
c. the above being non-deterministic, and therefore creating a risk of a fork on the network?

Cargo.toml

ledger/store/src/helpers/rocksdb/internal/nested_map.rs

utilities/src/task.rs

…eaded setting

kaimast · 2025-12-04T21:03:36Z

@vicsn I extended the PR description

kaimast force-pushed the feat/track-error branch from 34a44e4 to cec41dd Compare September 17, 2025 20:13

kaimast changed the title ~~[Feature]~~ [Feature] Improved Panic Handling Sep 17, 2025

kaimast marked this pull request as ready for review September 17, 2025 20:47

raychu86 reviewed Sep 17, 2025

View reviewed changes

utilities/src/errors.rs Show resolved Hide resolved

raychu86 reviewed Sep 17, 2025

View reviewed changes

utilities/src/errors.rs Outdated Show resolved Hide resolved

vicsn reviewed Sep 18, 2025

View reviewed changes

utilities/src/task.rs Show resolved Hide resolved

vicsn reviewed Sep 18, 2025

View reviewed changes

ledger/store/src/helpers/memory/internal/nested_map.rs Show resolved Hide resolved

vicsn reviewed Sep 18, 2025

View reviewed changes

utilities/src/errors.rs Show resolved Hide resolved

kaimast mentioned this pull request Sep 19, 2025

[WIP] Track errors ProvableHQ/snarkOS#3874

Draft

vicsn requested a review from ljedrz September 22, 2025 19:14

ljedrz reviewed Sep 22, 2025

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

ljedrz reviewed Sep 22, 2025

View reviewed changes

ledger/store/src/helpers/rocksdb/internal/nested_map.rs Show resolved Hide resolved

ljedrz reviewed Sep 22, 2025

View reviewed changes

utilities/src/task.rs Show resolved Hide resolved

kaimast mentioned this pull request Oct 6, 2025

[Feature] LoggableError trait #2944

Merged

vicsn marked this pull request as draft October 8, 2025 17:01

kaimast force-pushed the feat/track-error branch from 1fc8711 to 5b52e96 Compare October 31, 2025 16:19

kaimast force-pushed the feat/track-error branch 2 times, most recently from 56de046 to ac0550c Compare November 13, 2025 23:27

kaimast added 2 commits November 26, 2025 18:07

feat(utilities): add utilities for propagating errors across async tasks

4295231

feat(utils/error): make catching panics work correctly in a multi-thr…

6fc9ebb

…eaded setting

kaimast force-pushed the feat/track-error branch from ac0550c to 6fc9ebb Compare November 27, 2025 02:08

feat(synth/vm): catch panics in sequential ops thread

80c81d8

kaimast marked this pull request as ready for review November 28, 2025 16:49

mitchmindtree mentioned this pull request Dec 5, 2025

[Bug] Panic on when using assert_eq on arrays ProvableHQ/leo#28992

Open

vicsn requested review from ljedrz and raychu86 December 5, 2025 09:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Improved Panic Handling #2927

[Feature] Improved Panic Handling #2927

Uh oh!

kaimast commented Sep 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vicsn commented Sep 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaimast commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Feature] Improved Panic Handling #2927

Are you sure you want to change the base?

[Feature] Improved Panic Handling #2927

Uh oh!

Conversation

kaimast commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context: Handling Fatal Errors in snarkVM

Improved Panic Handling Mechanisms

Utilities for Task Management

Usage

In snarkVM

In snarkOS

Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vicsn commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaimast commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kaimast commented Sep 17, 2025 •

edited

Loading

vicsn commented Sep 21, 2025 •

edited

Loading