-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify handle_checkpoint_from_consensus
#12212
Simplify handle_checkpoint_from_consensus
#12212
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
@longbowlu Wondering if you can provide more detail on what this is fixing? |
If we don't think Stepping back, it may be best efforts, but I think having it is much better than doing two extra DB looks ups on every call. We may be able to fix it by letting consensus insert to that table as well. |
I think we don't want to insert to this table in the consensus path - the purpose of this table was to make it faster for fullnodes to read old, uncached checkpoints, both for execution and for serving via state sync. On validators we can have a much stronger assumption of temporal locality - they should only be reading / serving checkpoints that were created very recently and are still in cache. |
Also, I don't think full_checkpoint_contents needs to be considered as far as watermarks go. The watermark says that the checkpoint contents are available. It doesn't say from which table it is available. Whenever we attempt to read a checkpoint, we try full_checkpoint_contents first, and fall back to scattered reads if it is unavailable. |
691f44d
to
1ea422d
Compare
1ea422d
to
89d6407
Compare
updated the code base on feedback - now we directly bump both highest_verified and highest_synced watermark, but skip the DB lookup. |
89d6407
to
8edc1b7
Compare
handle_checkpoint_from_consensus
let prev_digest = *self.store.get_checkpoint_by_sequence_number(checkpoint.sequence_number() - 1) | ||
.expect("store operation should not fail") | ||
.unwrap_or_else(|| panic!("Got checkpoint {} from consensus but cannot find checkpoint {} in certified_checkpoints", checkpoint.sequence_number(), checkpoint.sequence_number() - 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC this should be fine even with the genesis checkpoint since consensus wont send us the genesis checkpoint right? If not i'm sure we could tweak this to account for checkpoint 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I believe so, consensus won't send state sync genesis checkpoint because it's already certified
if *checkpoint.sequence_number() > next_sequence_number { | ||
debug!( | ||
"consensus sent too new of a checkpoint, expecting: {}, got: {}", | ||
next_sequence_number, | ||
checkpoint.sequence_number() | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your PR description and the comment below mentions that this shouldn't happen right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, being a bit nimble here because there could be bugs somewhere that we overlook
8edc1b7
to
54d7cb6
Compare
This reverts commit f298251.
This reverts commit f298251.
## Description In #12212, we refactored `handle_checkpoint_from_consensus`. The PR missed one thing: when a last checkpoint of an epoch is certified in consensus and sent to state sync, we do not update the committee store. This is problematic because state sync does not reconfig. It means if a validator is slow in reconfiguration (where committee store would also be updated) and gets checkpoints in the next epoch from peers, it will panic in that it couldn't find the committee. This PR fixes it. ## Test Plan ``` for f in `seq 1 20`; do MSIM_TEST_NUM=5 MSIM_TEST_SEED=98$f RUST_LOG=off simt test_simulated_load_reconfig_restarts --no-capture > log-$f 2>&1 &; done ``` ``` grep FAIL log-* ``` --- If your changes are not user-facing and not a breaking change, you can skip the following section. Otherwise, please indicate what changed, and then add to the Release Notes section as highlighted during the release process. ### Type of Change (Check all that apply) - [ ] protocol change - [ ] user-visible impact - [ ] breaking change for a client SDKs - [ ] breaking change for FNs (FN binary must upgrade) - [ ] breaking change for validators or node operators (must upgrade binaries) - [ ] breaking change for on-chain data layout - [ ] necessitate either a data wipe or data migration ### Release notes Insert committee when handling checkpoint from consensus.
Description
Based on feedback, now this PR simplifies
handle_checkpoint_from_consensus
. Previously there is a branch to handle when the checkpoint is too new (newer than the highest_verified_checkpoint). I think this case wouldn't happen because consensus sends checkpoint in order. So we consolidate the two cases into one, which is, whenever consensus sends a checkpoint, we do the following things:Old
Description
When consensus sends checkpoint y, it means all prior checkpoints are verified and sent, so
highest_verified_checkpoint
can be safely updated without looking at the checkpoint store. However it's a bit tricky forhighest_synced_checkpoint
, because there are a few cfs (e.g.full_checkpoint_contents
) that could only be updated in state sync checkpoint contents. If we bump this watermarks, those cfs may not be filled at all.Test Plan
CI
If your changes are not user-facing and not a breaking change, you can skip the following section. Otherwise, please indicate what changed, and then add to the Release Notes section as highlighted during the release process.
Type of Change (Check all that apply)
Release notes
Do not update highest_synced_checkpoint when consensus sends a too new checkpoint