Skip to content

Conversation

@kaimast
Copy link
Collaborator

@kaimast kaimast commented Sep 17, 2025

This PR is a follow-up on #2813.

Context: Handling Fatal Errors in snarkVM

Generally, snarkVM and snarkOS aim to never panic if it is possible to avoid. However, there is always the possibility of bugs that were overlooked or ending up in a state where the code cannot recover safely.

If the code panics, there are two different scenarios:

  • If the panic happens inside a program execution (i.e., inside try_vm_runtime) we can safely log it and continue operating normally. snarkVM will simply treat this as a failed execution.
  • If the panic happens outside a program execution (i.e., anywhere else in snarkVM or in snarkOS), we most likely do not want to simply continue execution, and we definitely want to notify the user and developers about the error.
    We had several instances where a thread or tokio task panicked, and, while information about the panic was printed to stdout, there was no mention of it in the logs.

While the former is already implemented. The latter was partially introduced by #2813, but this PR adds some important improvement and fixes. The behavior of try_vm_runtime remains unchanged by this, albeit its implementation changes slightly.

Improved Panic Handling Mechanisms

Panic handling as implemented in #2813 (and also before but to a limited extent) has a major shortcoming: it did not work well with multi-threading. That is because the panic handler is set per-process but, while executing a snarkVM program another thread, unrelated to execution, might panic, and snarkVM may falsely log it as a VM error.

The proposed mechanism in this PR is to store backtraces and error messages in a thread-local variable. try_vm_runtime is changed to retrieve the error message from this thread-local variable.
Additionally, there is now a catch_unwind function that wraps around the function of the same name from the standard library and returns the error message and backtrace on error. The latter is useful when printing error messages about panics, for example, in snarkOS.

Utilities for Task Management

tokio allows propagating panics through JoinHandles. This PR also provides wrappers around tokio's spawn and spawn_blocking functions that contain the boilerplate for propagating such panics.
This replaces logic that before resided in snarkos-node-bft but will be useful to other crates.

Usage

In snarkVM

snarkVM has one notable instance of spawning a background task, the sequential ops thread (see the changes in #2975). 80c81d8 adds catch_unwind to that thread to ensure future panics in the storage thread are logged properly.

In snarkOS

For snarkOS, the plan is to use catch_unwind for important background tasks such as block synchronization. Again, this is to ensure that a panic in block synchronization does not happen silently and block synchronization stops for no apparent reason.

ProvableHQ/snarkOS#3874 contains the full set of proposed changes for snarkOS.

Notes

  • For panic handling, there is a potential overhead of storing backtraces that might not always be needed. It is my understanding that panics do not happen very frequently in production, but if this understanding is wrong, please let me know.
  • Panic handling for "detached" tokio task (those without JoinHandles) does not work well (yet). Future improvements to tokio will change this, but we also do not use detached tasks much.

@kaimast kaimast changed the title [Feature] [Feature] Improved Panic Handling Sep 17, 2025
@kaimast kaimast marked this pull request as ready for review September 17, 2025 20:47
@vicsn
Copy link
Collaborator

vicsn commented Sep 21, 2025

and snarkVM may falsely log it as VM error.

Is this the main and only usecase for the thread local panic logic of this PR? I'm trying to better understand whether the refactor is worth it at the moment.

I thought we printed the error in the VM panic handler. Therefore:

  1. if it were to catch a non-VM panic, I'd assume we would clearly observe this in the logs. Clumsy, but no harm done.
  2. if it were to catch a VM panic of the wróng thread, would there be a risk of any of the following:
    a. verification of the wrong transaction being logged as wrong?
    b. verification of the wrong transaction being returned as Err?
    c. the above being non-deterministic, and therefore creating a risk of a fork on the network?

@vicsn vicsn requested a review from ljedrz September 22, 2025 19:14
@vicsn vicsn marked this pull request as draft October 8, 2025 17:01
@kaimast kaimast force-pushed the feat/track-error branch 2 times, most recently from 56de046 to ac0550c Compare November 13, 2025 23:27
@kaimast kaimast marked this pull request as ready for review November 28, 2025 16:49
@kaimast
Copy link
Collaborator Author

kaimast commented Dec 4, 2025

@vicsn I extended the PR description

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants