-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: sequencer auto recover when meet an unexpected shutdown #166
Merged
owen-reorg
merged 14 commits into
bnb-chain:develop
from
krish-nr:sequencer_recover_fix
Nov 13, 2024
Merged
feat: sequencer auto recover when meet an unexpected shutdown #166
owen-reorg
merged 14 commits into
bnb-chain:develop
from
krish-nr:sequencer_recover_fix
Nov 13, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Prepare for v0.5.0 release
krish-nr
force-pushed
the
sequencer_recover_fix
branch
from
September 10, 2024 04:06
b0dca36
to
5aa9b46
Compare
bnoieh
reviewed
Sep 14, 2024
krish-nr
force-pushed
the
sequencer_recover_fix
branch
from
October 29, 2024 10:52
38d7a87
to
a11dbb4
Compare
krish-nr
force-pushed
the
sequencer_recover_fix
branch
3 times, most recently
from
November 5, 2024 09:13
1327903
to
074a40f
Compare
krish-nr
force-pushed
the
sequencer_recover_fix
branch
from
November 5, 2024 09:19
074a40f
to
487b722
Compare
bnoieh
reviewed
Nov 7, 2024
bnoieh
previously approved these changes
Nov 11, 2024
owen-reorg
reviewed
Nov 12, 2024
andyzhang2023
previously approved these changes
Nov 12, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
krish-nr
force-pushed
the
sequencer_recover_fix
branch
from
November 13, 2024 03:18
b621ed8
to
c270b45
Compare
bnoieh
approved these changes
Nov 13, 2024
owen-reorg
approved these changes
Nov 13, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR aims to fix the issue where the sequencer node fails to recover after a crash (specifically when the sequencer is a PBSS node).
related node PR
Rationale
When a sequencer node crashes, it may fail to persist the journal in time. As a result, when Geth is restarted, the journal data cannot be read, leading to the loss of recent state data. This causes the sequencer to fail during the buildPayload process, rendering it unable to continue operating. The diagram below illustrates the sequencer block production flow, with the red sections highlighting the logic that cannot proceed due to the crash.
In the diagram, sequencerAction alternates between stages (1) and (2). In stage (1), the payload is constructed, and in stage (2), the data is persisted. This process is not synchronously blocked; in stage (1), operations such as filling the payload's transactions (txs) and some other tasks are performed asynchronously, while the payload is returned synchronously and immediately, controlled by a condition lock (cond). Therefore, before the
update
in stage (1) is completed, thegetPayload
in stage (2) will be in acond.wait
state.After the sequencer crashes, due to the loss of state data (for example, the sequencer crashes at block height 34123456 and the next block to be built is 34123457), the
prepareWork
phase depends on the state data of block 34123456. However, after the crash, the state data of block 34123456 is lost, leading to failure in this phase and consequently preventing the system from entering theupdate
logic. As a result, the process will remain indefinitely blocked atgetPayload
, unable to make progress.The fix involves adding two routines to handle the recovery process and monitor the recovery progress. Upon a failure in the generate phase, a fix routine is started based on specific error conditions. To avoid blocking the main process, a separate routine is also initiated to monitor the specific block being repaired. Once the data recovery is complete, a retry of the update process is triggered, allowing the system to recover the state from before the crash and continue making progress.
There are two scenarios for recovery: recovering from local data or from peers (for the sequencer, peers are its backup nodes). In most cases, the data can be recovered locally. However, there is a corner case where local recovery fails: if the sequencer has already gossiped the latest block to peer nodes but crashes during the local persistence process, the sequencer may fall behind the peer by one block. In this extreme situation, the sequencer must recover from the peer.
This is the complete sequencer recovery flow, as illustrated below.
Example
add an example CLI or API response...
Changes
Notable changes: