[Fix] Pause BFT block advancement during sync #4039
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR aims to fix #3911 and #3636.
Recently, we added
CheckBlockErroras a machine-readable response to failures in block checks. The proposed changes use this new error type to differentiate between invalid blocks, and cases where the ledger has already advanced, and the proposed block is contained in the ledger already.The snarkVM changes also ensure that calls to the ledger are atomic, but provides no transactional guarantees across multiple invocations of
Ledgerfunctions.BFThas alockfield that aims to protect, block advancement. This lock already existed before this change, but was not accessible from outside the struct. A problem with this setup is thatcheck_block_contentandadvance_to_next_blockare dependent operations. Generally, one assumes that if the former succeeds, the latter will also. However, without additional locking,BFTcan calladvance_to_blockbetweenSync's calls ofcheck_block_contentandadvance_to_next_block. In this case,Syncsecond call will fail.The implementation implements an approached based on optimistic concurrency control to address this. Sync builds a chain of
PendingBlocks without grabbing theBFTlock (otherwise, Sync would hold the lock for long periods of time).Once a chain of pending blocks has reached the availability threshold, it will grab the lock, check their contents for correctness, and advance the ledger. In this step,
check_block_contentwill verify the height of the block again and, if the height is below the current ledger height,Syncwill abort and retry.The main change for
BFTis thatupdate_dagdoes not grab thelockanymore, but callers are required to grab the lock before callingupdate_dag. There is a new methodBFT::pause_for_block_syncthatSynccalls before trying to commit blocks.I removed the
ALLOW_LEDGER_ACCESSparameter fromupdate_dagto remove complexity from this function. This flag was only set tofalsefor some tests, and does not seem to be needed any more even in this case, as theLedgerServiceabstraction allows to simply ignore writes. I would like to eventually have separateupdate_dagfunctions for sync and BFT, but this PR already reduces the complexity of the function noticeably.