Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swing-store gets unexpectedly committed between blocks #8423

Closed
mhofman opened this issue Oct 3, 2023 · 3 comments · Fixed by #8429
Closed

Swing-store gets unexpectedly committed between blocks #8423

mhofman opened this issue Oct 3, 2023 · 3 comments · Fixed by #8429
Assignees
Labels
bug Something isn't working chain-incident cosmic-swingset package: cosmic-swingset swing-store SwingSet package: SwingSet

Comments

@mhofman
Copy link
Member

mhofman commented Oct 3, 2023

Describe the bug

Polkachu reported on Discord one of their backup node experienced a wrong AppHash after commiting block 11880370 on mainnet.

Stopping peer for error err="blockchainReactor validation error: wrong Block.Header.AppHash.  Expected 7B24CAE35F91A752BAF214680DA88793D282316596E0A636FC6B407D5BF4C6D7, got 02971C1ECE60372762967309A4FC585BFBF01A6E6A657B8BCAE3D3E1DB05EE6A" module=p2p peer={"Data":{},"Logger":{}}

After a long analysis we have reproduced and confirmed this is caused by the swing-store apparently getting committed mid-block. In this case it got committed after run-0 (complete left-over work), run-1 (high priority push price), run-2 (timer poll), but before run-3 (first action queue item). If a restart of the node happens after such a mid-block commit, it results in an execution divergence.

Because the hangover inconsistency logic assumes that midblock commits do not happen, the saved block height only gets updated at the end of "end block", once all runs have completed. As such the mechanism to detect and handle hangovers does not realize that the current block was partially processed, and attempts re-execution, but starting from a partial block state. In this case the swing-store state ends up identical, which suggests only run-0 had been committed, but the exported data to cosmos DB (for state-sync and consistency) does not contain the same entries as a node that executed normally (it's missing the run-0 related exports), resulting in an AppHash error.

To Reproduce

Artifacts related to this issue in shared a GDrive folder

Steps to reproduce the behavior:

  1. Download agoric-11880351.tgz and extract to ~/.agoric following usual steps to restore a cosmos snapshot
    1. You will likely need to setup some p2p config, such as an addr book or configure a seed node
  2. Checkout agoric-upgrade-11 and build
  3. Modify packages/cosmic-swingset/src/launch-chain.js with the patch in 0001-Pause-during-block-11880370.patch
    diff --git a/packages/cosmic-swingset/src/launch-chain.js b/packages/cosmic-swingset/src/launch-chain.js
    index f7e153152..197e76e81 100644
    --- a/packages/cosmic-swingset/src/launch-chain.js
    +++ b/packages/cosmic-swingset/src/launch-chain.js
    @@ -567,6 +567,10 @@ export async function launch({
        keepGoing = await runSwingset();
        if (!keepGoing) return;
    
    +    if (blockHeight === 11880370) {
    +      await new Promise(resolve => setTimeout(resolve, 15 * 60 * 1000));
    +    }
    +
        // Finally, process as much as we can from the actionQueue.
        await processActions(actionQueue);
      }
  4. agd start
  5. When reaching block 11880370 begin wait 30s and killall node
    1. The halt-mid-11880370.tgz state was taken at this point
  6. agd start once again after block 11880370 commit
  7. Observe AppHash
    ERR Error in validation err="wrong Block.Header.AppHash.  Expected 7B24CAE35F91A752BAF214680DA88793D282316596E0A636FC6B407D5BF4C6D7, got 02971C1ECE60372762967309A4FC585BFBF01A6E6A657B8BCAE3D3E1DB05EE6A" module=blockchain
    

The state directory capture by Polkachu after the AppHash is also included in the GDrive folder, as well as a genesis export (modified to ignore missing replay artifacts) and a slog extracted from the flight-recorder.
Also included is a slog extract of the diverging block from reference follower node.

Expected behavior

The swing-store does not get committed in the middle of blocks.

Platform Environment

  • agoric-upgrade-11 tag.
  • Normal docker environment

Additional context

The restart slogs show that run-0 is empty, but that run-1 (PushPrice from oracle) executes like a normal node (besides the vat replay which is expected), which means the unexpected commit happened between those 2 runs, and not after run-1 or run-2 which were also executed before the restart.

One mitigation to detect this state earlier (before commit of full block) is detailed in #8424

Screenshots

See slogs and other logs in the GDrive

@mhofman mhofman added bug Something isn't working SwingSet package: SwingSet cosmic-swingset package: cosmic-swingset swing-store labels Oct 3, 2023
@mhofman mhofman self-assigned this Oct 3, 2023
@mhofman mhofman changed the title Polkachu wrong AppHash block 11880370 Swing-store gets unexpectedly committed between blocks Oct 3, 2023
@mhofman
Copy link
Member Author

mhofman commented Oct 4, 2023

Found the issue. SQLite autocommits when releasing a savepoint that is not wrapped in a transaction. See https://stackoverflow.com/q/38622587 and https://www.sqlite.org/c3ref/get_autocommit.html

warner added a commit that referenced this issue Oct 4, 2023
The second tests fails if I comment out the #8423 fix, as expected.
@mergify mergify bot closed this as completed in #8429 Oct 4, 2023
@mhofman
Copy link
Member Author

mhofman commented Oct 4, 2023

Here is what happened:

  • cosmic-swingset's "unit of update" is the block. A full block should either be executed, or not.
  • A block is a host concept, so swing-store provides a generic commit API the host can use.
  • SQLite autocommits changes unless a transaction is explicitly started with BEGIN statement
  • The swing-store does not provide an API to explicitly start a transaction. Instead it lazily creates one before any mutating operation
  • A block is composed of multiple runs.
    • A run executes multiple SwingSet cranks (roughly each representing a single delivery to a vat), until either there is no more work to do, or a run policy limit is reached (to avoid doing too much work in a single block)
    • Most runs start by adding some external input to SwingSet, except for the first run which simply continues any left over work from the previous block (if the run policy limit was reached)
  • To support proceeding forward on failed cranks while discarding the effects the failed crank had on the system, the swing-store leverages SQL savepoints.
    • A savepoint is created when starting a crank, and released when a crank completes successfully
    • creating a savepoint is a mutating operation, however it would NOT lazily start the block transaction
  • For runs which start with adding external input, that modification would start the block transaction (the bug)
  • However, for the run continuing previous work, it would just start processing cranks.
    • This is the first run of the block, so no other operations happen before
    • for each crank executed by the first block, a savepoint is created, and when released, it would get auto committed since no block transaction was started
  • Then for any subsequent run in the block, adding input is a mutating operation performed outside a crank, so it would lazily start the block transaction then

This explains why only the first run would have been committed. As such this problem occurs when there is leftover work from the previous block AND the host is interrupted after at least one crank of that left over work has been completed, and before the block commit is reached. It does not technically matter whether that block has any other actions to process or not, however it would be very difficult to interrupt the host before reaching the commit point without such extra work.

Interestingly implementing the defensive check detailed in #8424 would actually have solved the issue, and would prevent any similar issue like this since it would effectively force a block transaction to be started when beginning a block.

@warner
Copy link
Member

warner commented Oct 4, 2023

More notes for future consideration:

  • our ensureTxn() function queries db.inTransaction to decide whether it needs to do BEGIN IMMEDIATE TRANSACTION or not, instead of remembering its own flag (tested and maybe set in ensureTxn, cleared in commit)
  • it's ok to do BEGIN TRANSACTION and then SAVEPOINT
  • but running SAVEPOINT when you aren't already in a transaction will start a new (auto-commiting) one
    • db.inTransaction now reports true
    • running BEGIN TRANSACTION causes a SqliteError: cannot start a transaction within a transaction
  • RELEASE SAVEPOINT clears the db.inTransaction flag, and commits the automatically-created txn

The buggy sequence was:

  • run-0 starts, no other swingstore APIs have been called yet (e.g. device input function injecting run-queue messages)
    • kernel.js does kernelStorage.beginCrank() (which doesn't change the DB)
  • SAVEPOINT (when kernel does kernelStorage.establishCrankSavepoint())
    • so now db.inTransaction is true, but it's an implicit txn, and by the rules of SQLite savepoints it will auto-commit when the savepoint is abandoned
  • first delivery happens, DB changes
    • ensureTxn() is called by all other swingstore mutating API functions
    • it observes db.inTransaction is true, does nothing
  • RELEASE SAVEPOINT at end of the controller.run() loop (in a finally block)
    • DB commits, db.inTransaction is cleared
  • no other DB changes happen before we return to the top of the run() loop, which starts a new savepoint
    • (if we did any DB write activity outside the kernelKeeper.establishCrankSavepoint() / .endCrank() pair, then we'd have an explicit transaction established when SAVEPOINT was called, but we don't, because everything is driven by the run-queue and we want failed deliveries to get pushed back onto the queue)

If our ensureTxn used its own flag "swingStoreInTransaction", instead of relying on the DB connection's db.inTransaction, then the sequence would be:

  • run-0 starts
  • SAVEPOINT
    • db.inTransaction === true, but swingStoreInTransaction === false
  • first delivery happens, mutating swingStore API called, it calls ensureTxn() as usual
    • ensureTxn observes swingStoreInTransaction === false
    • so it does BEGIN IMMEDIATE TRANSACTION
    • that crashes with SqliteError: cannot start a transaction within a transaction

That defense would still depend upon having ensureTxn in mutating API functions, however the error would still stand a decent chance of throwing (and thus alerting us to the issue) even if we missed a few.

Of course, we had no idea that SAVEPOINT had this behavior, so we had no reason to be paranoid about whether db.inTransaction was changing without an explicit BEGIN TRANSACTION in the first place.

mhofman pushed a commit that referenced this issue Nov 8, 2023
The second tests fails if I comment out the #8423 fix, as expected.
mhofman pushed a commit that referenced this issue Nov 8, 2023
The second tests fails if I comment out the #8423 fix, as expected.
mhofman pushed a commit that referenced this issue Nov 10, 2023
The second tests fails if I comment out the #8423 fix, as expected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working chain-incident cosmic-swingset package: cosmic-swingset swing-store SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants