Swing-store gets unexpectedly committed between blocks #8423

mhofman · 2023-10-03T03:15:20Z

Describe the bug

Polkachu reported on Discord one of their backup node experienced a wrong AppHash after commiting block 11880370 on mainnet.

Stopping peer for error err="blockchainReactor validation error: wrong Block.Header.AppHash.  Expected 7B24CAE35F91A752BAF214680DA88793D282316596E0A636FC6B407D5BF4C6D7, got 02971C1ECE60372762967309A4FC585BFBF01A6E6A657B8BCAE3D3E1DB05EE6A" module=p2p peer={"Data":{},"Logger":{}}

After a long analysis we have reproduced and confirmed this is caused by the swing-store apparently getting committed mid-block. In this case it got committed after run-0 (complete left-over work), run-1 (high priority push price), run-2 (timer poll), but before run-3 (first action queue item). If a restart of the node happens after such a mid-block commit, it results in an execution divergence.

Because the hangover inconsistency logic assumes that midblock commits do not happen, the saved block height only gets updated at the end of "end block", once all runs have completed. As such the mechanism to detect and handle hangovers does not realize that the current block was partially processed, and attempts re-execution, but starting from a partial block state. In this case the swing-store state ends up identical, which suggests only run-0 had been committed, but the exported data to cosmos DB (for state-sync and consistency) does not contain the same entries as a node that executed normally (it's missing the run-0 related exports), resulting in an AppHash error.

To Reproduce

Artifacts related to this issue in shared a GDrive folder

Steps to reproduce the behavior:

Download agoric-11880351.tgz and extract to ~/.agoric following usual steps to restore a cosmos snapshot
1. You will likely need to setup some p2p config, such as an addr book or configure a seed node
Checkout agoric-upgrade-11 and build

Modify packages/cosmic-swingset/src/launch-chain.js with the patch in 0001-Pause-during-block-11880370.patch

diff --git a/packages/cosmic-swingset/src/launch-chain.js b/packages/cosmic-swingset/src/launch-chain.js
index f7e153152..197e76e81 100644
--- a/packages/cosmic-swingset/src/launch-chain.js
+++ b/packages/cosmic-swingset/src/launch-chain.js
@@ -567,6 +567,10 @@ export async function launch({
    keepGoing = await runSwingset();
    if (!keepGoing) return;

+    if (blockHeight === 11880370) {
+      await new Promise(resolve => setTimeout(resolve, 15 * 60 * 1000));
+    }
+
    // Finally, process as much as we can from the actionQueue.
    await processActions(actionQueue);
  }

agd start
When reaching block 11880370 begin wait 30s and killall node
1. The halt-mid-11880370.tgz state was taken at this point
agd start once again after block 11880370 commit

Observe AppHash

ERR Error in validation err="wrong Block.Header.AppHash.  Expected 7B24CAE35F91A752BAF214680DA88793D282316596E0A636FC6B407D5BF4C6D7, got 02971C1ECE60372762967309A4FC585BFBF01A6E6A657B8BCAE3D3E1DB05EE6A" module=blockchain

The state directory capture by Polkachu after the AppHash is also included in the GDrive folder, as well as a genesis export (modified to ignore missing replay artifacts) and a slog extracted from the flight-recorder.
Also included is a slog extract of the diverging block from reference follower node.

Expected behavior

The swing-store does not get committed in the middle of blocks.

Platform Environment

agoric-upgrade-11 tag.
Normal docker environment

Additional context

The restart slogs show that run-0 is empty, but that run-1 (PushPrice from oracle) executes like a normal node (besides the vat replay which is expected), which means the unexpected commit happened between those 2 runs, and not after run-1 or run-2 which were also executed before the restart.

One mitigation to detect this state earlier (before commit of full block) is detailed in #8424

Screenshots

See slogs and other logs in the GDrive

The text was updated successfully, but these errors were encountered:

mhofman · 2023-10-04T00:29:54Z

Found the issue. SQLite autocommits when releasing a savepoint that is not wrapped in a transaction. See https://stackoverflow.com/q/38622587 and https://www.sqlite.org/c3ref/get_autocommit.html

The second tests fails if I comment out the #8423 fix, as expected.

mhofman · 2023-10-04T15:40:17Z

Here is what happened:

cosmic-swingset's "unit of update" is the block. A full block should either be executed, or not.
A block is a host concept, so swing-store provides a generic commit API the host can use.
SQLite autocommits changes unless a transaction is explicitly started with BEGIN statement
The swing-store does not provide an API to explicitly start a transaction. Instead it lazily creates one before any mutating operation
A block is composed of multiple runs.
- A run executes multiple SwingSet cranks (roughly each representing a single delivery to a vat), until either there is no more work to do, or a run policy limit is reached (to avoid doing too much work in a single block)
- Most runs start by adding some external input to SwingSet, except for the first run which simply continues any left over work from the previous block (if the run policy limit was reached)
To support proceeding forward on failed cranks while discarding the effects the failed crank had on the system, the swing-store leverages SQL savepoints.
- A savepoint is created when starting a crank, and released when a crank completes successfully
- creating a savepoint is a mutating operation, however it would NOT lazily start the block transaction
For runs which start with adding external input, that modification would start the block transaction (the bug)
However, for the run continuing previous work, it would just start processing cranks.
- This is the first run of the block, so no other operations happen before
- for each crank executed by the first block, a savepoint is created, and when released, it would get auto committed since no block transaction was started
Then for any subsequent run in the block, adding input is a mutating operation performed outside a crank, so it would lazily start the block transaction then

This explains why only the first run would have been committed. As such this problem occurs when there is leftover work from the previous block AND the host is interrupted after at least one crank of that left over work has been completed, and before the block commit is reached. It does not technically matter whether that block has any other actions to process or not, however it would be very difficult to interrupt the host before reaching the commit point without such extra work.

Interestingly implementing the defensive check detailed in #8424 would actually have solved the issue, and would prevent any similar issue like this since it would effectively force a block transaction to be started when beginning a block.

warner · 2023-10-04T16:50:41Z

More notes for future consideration:

our ensureTxn() function queries db.inTransaction to decide whether it needs to do BEGIN IMMEDIATE TRANSACTION or not, instead of remembering its own flag (tested and maybe set in ensureTxn, cleared in commit)
it's ok to do BEGIN TRANSACTION and then SAVEPOINT
but running SAVEPOINT when you aren't already in a transaction will start a new (auto-commiting) one
- db.inTransaction now reports true
- running BEGIN TRANSACTION causes a SqliteError: cannot start a transaction within a transaction
RELEASE SAVEPOINT clears the db.inTransaction flag, and commits the automatically-created txn

The buggy sequence was:

run-0 starts, no other swingstore APIs have been called yet (e.g. device input function injecting run-queue messages)
- kernel.js does kernelStorage.beginCrank() (which doesn't change the DB)
SAVEPOINT (when kernel does kernelStorage.establishCrankSavepoint())
- so now db.inTransaction is true, but it's an implicit txn, and by the rules of SQLite savepoints it will auto-commit when the savepoint is abandoned
first delivery happens, DB changes
- ensureTxn() is called by all other swingstore mutating API functions
- it observes db.inTransaction is true, does nothing
RELEASE SAVEPOINT at end of the controller.run() loop (in a finally block)
- DB commits, db.inTransaction is cleared
no other DB changes happen before we return to the top of the run() loop, which starts a new savepoint
- (if we did any DB write activity outside the kernelKeeper.establishCrankSavepoint() / .endCrank() pair, then we'd have an explicit transaction established when SAVEPOINT was called, but we don't, because everything is driven by the run-queue and we want failed deliveries to get pushed back onto the queue)

If our ensureTxn used its own flag "swingStoreInTransaction", instead of relying on the DB connection's db.inTransaction, then the sequence would be:

run-0 starts
SAVEPOINT
- db.inTransaction === true, but swingStoreInTransaction === false
first delivery happens, mutating swingStore API called, it calls ensureTxn() as usual
- ensureTxn observes swingStoreInTransaction === false
- so it does BEGIN IMMEDIATE TRANSACTION
- that crashes with SqliteError: cannot start a transaction within a transaction

That defense would still depend upon having ensureTxn in mutating API functions, however the error would still stand a decent chance of throwing (and thus alerting us to the issue) even if we missed a few.

Of course, we had no idea that SAVEPOINT had this behavior, so we had no reason to be paranoid about whether db.inTransaction was changing without an explicit BEGIN TRANSACTION in the first place.

The second tests fails if I comment out the #8423 fix, as expected.

mhofman added bug Something isn't working SwingSet package: SwingSet cosmic-swingset package: cosmic-swingset swing-store labels Oct 3, 2023

mhofman self-assigned this Oct 3, 2023

mhofman mentioned this issue Oct 3, 2023

Defensive check against unexpected commit points #8424

Closed

mhofman changed the title ~~Polkachu wrong AppHash block 11880370~~ Swing-store gets unexpectedly committed between blocks Oct 3, 2023

mhofman mentioned this issue Oct 3, 2023

Increase history length of flight recorder #8425

Open

mhofman mentioned this issue Oct 4, 2023

fix(swing-store): ensure crank savepoint is wrapped in transaction #8429

Merged

warner added a commit that referenced this issue Oct 4, 2023

test: add test of savepoints, including no automatic transactions

cc742b8

The second tests fails if I comment out the #8423 fix, as expected.

mergify bot closed this as completed in #8429 Oct 4, 2023

This was referenced Oct 4, 2023

feat(cosmic-swingset): add begin block check and transaction #8432

Merged

Explicit API to begin swing-store transaction #8435

Open

mhofman added the chain-incident label Oct 13, 2023

mhofman pushed a commit that referenced this issue Nov 8, 2023

test: add test of savepoints, including no automatic transactions

639dccb

The second tests fails if I comment out the #8423 fix, as expected.

mhofman pushed a commit that referenced this issue Nov 8, 2023

test: add test of savepoints, including no automatic transactions

2b61471

The second tests fails if I comment out the #8423 fix, as expected.

mhofman pushed a commit that referenced this issue Nov 10, 2023

test: add test of savepoints, including no automatic transactions

c27a586

The second tests fails if I comment out the #8423 fix, as expected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swing-store gets unexpectedly committed between blocks #8423

Swing-store gets unexpectedly committed between blocks #8423

mhofman commented Oct 3, 2023 •

edited

Loading

mhofman commented Oct 4, 2023

mhofman commented Oct 4, 2023

warner commented Oct 4, 2023

Swing-store gets unexpectedly committed between blocks #8423

Swing-store gets unexpectedly committed between blocks #8423

Comments

mhofman commented Oct 3, 2023 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Platform Environment

Additional context

Screenshots

mhofman commented Oct 4, 2023

mhofman commented Oct 4, 2023

warner commented Oct 4, 2023

mhofman commented Oct 3, 2023 •

edited

Loading