Skip to content

Conversation

@kim
Copy link
Contributor

@kim kim commented Apr 16, 2025

PR #2550 removed performing the reducer call inside spawn_blocking,
thereby introducing the regression to never invoke the on_panic
callback.

This causes a database to continue accepting writes even if it is unable
to persist the transactions via its Durability impl. Instead, the
on_panic callback should remove the database from the
HostController, which renders this database unavailable, but does
not affect any other databases managed by the controller.

The behavior is fine in panic = abort environments, yet until we have
a better way to propagate durability errors as values to the
ModuleHost, we can't abort the server in multi-tenant environments.

Thus, restore the original behavior (using spawn instead of the more
expensive spawn_blocking).

API and ABI breaking changes

  • no

Expected complexity level and risk

1

Testing

I don't know how to write an automated test for this, because I don't know how
to produce a module wasm blob inside a test.

Instead, the behavior can be tested by modifying the code:

  • remove the panic hook in standalone main
  • modify the local durability impl to panic append_tx once
    self.durable_tx_offset() returns, say > 10
  • deploy a module and run a number of reducer calls until the panic occurs
  • observe that subsequent calls return a server error

Note that standalone's lazy database instantiation logic will cause the database
to be re-started on each subsequent reducer call after the durability impl, only
to then fail the call. That's ok, because standalone actually aborts the server,
so we'll never do this in practice. In a replicated setting, the crash should
cause a new leader to be elected.

PR #2550 removed performing the reducer call inside `spawn_blocking`,
thereby introducing the regression to never invoke the `on_panic`
callback.

This causes a database to continue accepting writes even if it is unable
to persist the transactions via its `Durability` impl. Instead, the
`on_panic` callback should remove the database from the
`HostController`, which renders _this_ database unavailable, but does
not affect any other databases managed by the controller.

The behavior is fine in `panic = abort` environments, yet until we have
a better way to propagate durability errors as values to the
`ModuleHost`, we can't abort the server in multi-tenant environments.

Thus, restore the original behavior (using `spawn` instead of the more
expensive `spawn_blocking`).
@kim kim requested review from coolreader18, gefjon and jsdt April 16, 2025 08:02
@kim kim added this pull request to the merge queue Apr 16, 2025
Merged via the queue into master with commit a3614ce Apr 16, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants