Skip to content

Conversation

@kaimast
Copy link
Contributor

@kaimast kaimast commented Dec 5, 2025

This PR aims to fix #3911 and #3636.

Recently, we added CheckBlockError as a machine-readable response to failures in block checks. The proposed changes use this new error type to differentiate between invalid blocks, and cases where the ledger has already advanced, and the proposed block is contained in the ledger already.
The snarkVM changes also ensure that calls to the ledger are atomic, but provides no transactional guarantees across multiple invocations of Ledger functions.

BFT has a lock field that aims to protect, block advancement. This lock already existed before this change, but was not accessible from outside the struct. A problem with this setup is that check_block_content and advance_to_next_block are dependent operations. Generally, one assumes that if the former succeeds, the latter will also. However, without additional locking, BFT can call advance_to_block between Sync's calls of check_block_content and advance_to_next_block. In this case, Sync second call will fail.

The implementation implements an approached based on optimistic concurrency control to address this. Sync builds a chain of PendingBlocks without grabbing the BFT lock (otherwise, Sync would hold the lock for long periods of time).
Once a chain of pending blocks has reached the availability threshold, it will grab the lock, check their contents for correctness, and advance the ledger. In this step, check_block_content will verify the height of the block again and, if the height is below the current ledger height, Sync will abort and retry.

The main change for BFT is that update_dag does not grab the lock anymore, but callers are required to grab the lock before calling update_dag. There is a new method BFT::pause_for_block_sync that Sync calls before trying to commit blocks.
I removed the ALLOW_LEDGER_ACCESS parameter from update_dag to remove complexity from this function. This flag was only set to false for some tests, and does not seem to be needed any more even in this case, as the LedgerService abstraction allows to simply ignore writes. I would like to eventually have separate update_dag functions for sync and BFT, but this PR already reduces the complexity of the function noticeably.

@kaimast kaimast changed the base branch from staging to fix/revert-revert-pending-blocks December 5, 2025 18:25
@kaimast kaimast force-pushed the fix/sync-race-conditions branch from e8a483a to c32278a Compare December 5, 2025 18:49
Base automatically changed from fix/revert-revert-pending-blocks to staging December 5, 2025 22:03
@kaimast kaimast force-pushed the fix/sync-race-conditions branch 6 times, most recently from 648a2f7 to fe03df4 Compare December 8, 2025 16:58
@kaimast kaimast force-pushed the fix/sync-race-conditions branch from fe03df4 to bd73f5f Compare December 8, 2025 18:40
@kaimast kaimast marked this pull request as ready for review December 8, 2025 18:41
@kaimast kaimast requested review from ljedrz, raychu86 and vicsn December 8, 2025 19:37
@kaimast kaimast force-pushed the fix/sync-race-conditions branch from b6b59ba to 31fbc06 Compare December 8, 2025 19:44
@vicsn vicsn removed request for ljedrz, raychu86 and vicsn December 9, 2025 13:18
@vicsn vicsn marked this pull request as draft December 9, 2025 13:18
@kaimast kaimast force-pushed the fix/sync-race-conditions branch from bcbd681 to 0214517 Compare December 9, 2025 22:27
@kaimast kaimast force-pushed the fix/sync-race-conditions branch from 0214517 to c4f62ba Compare December 9, 2025 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Errors during sync race conditions

2 participants