Simplify `handle_checkpoint_from_consensus` #12212

longbowlu · 2023-05-25T21:34:26Z

Description

Based on feedback, now this PR simplifies handle_checkpoint_from_consensus. Previously there is a branch to handle when the checkpoint is too new (newer than the highest_verified_checkpoint). I think this case wouldn't happen because consensus sends checkpoint in order. So we consolidate the two cases into one, which is, whenever consensus sends a checkpoint, we do the following things:

check the previous digest matches. This is to ensure state sync and consensus do not brain-split.
if the checkpoint is older, ignore and exit
update metrics and bump watermark
notify others

Old

Description

When consensus sends checkpoint y, it means all prior checkpoints are verified and sent, so highest_verified_checkpoint can be safely updated without looking at the checkpoint store. However it's a bit tricky for highest_synced_checkpoint, because there are a few cfs (e.g. full_checkpoint_contents) that could only be updated in state sync checkpoint contents. If we bump this watermarks, those cfs may not be filled at all.

Test Plan

CI

If your changes are not user-facing and not a breaking change, you can skip the following section. Otherwise, please indicate what changed, and then add to the Release Notes section as highlighted during the release process.

Type of Change (Check all that apply)

protocol change
user-visible impact
breaking change for a client SDKs
breaking change for FNs (FN binary must upgrade)
breaking change for validators or node operators (must upgrade binaries)
breaking change for on-chain data layout
necessitate either a data wipe or data migration

Release notes

Do not update highest_synced_checkpoint when consensus sends a too new checkpoint

vercel · 2023-05-25T21:34:34Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
explorer	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jun 1, 2023 7:07pm
explorer-storybook	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jun 1, 2023 7:07pm
offline-signer-helper	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jun 1, 2023 7:07pm
sui-wallet-kit	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jun 1, 2023 7:07pm
wallet-adapter	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jun 1, 2023 7:07pm

aschran · 2023-05-30T18:50:09Z

@longbowlu Wondering if you can provide more detail on what this is fixing? full_checkpoint_contents was always meant as a best-effort source. i.e. "use it if you have it, otherwise fall back to individual tx lookups". So it's not clear to me why we want to design the watermarks around whether it would be available or not? I think there is probably some nuance I am missing.

longbowlu · 2023-05-30T20:25:10Z

@longbowlu Wondering if you can provide more detail on what this is fixing? full_checkpoint_contents was always meant as a best-effort source. i.e. "use it if you have it, otherwise fall back to individual tx lookups". So it's not clear to me why we want to design the watermarks around whether it would be available or not? I think there is probably some nuance I am missing.

If we don't think full_checkpoint_contents needs to be filled then we can bump highest_synced_checkpoint directly too. The current code looks into the DB which is not needed.

Stepping back, it may be best efforts, but I think having it is much better than doing two extra DB looks ups on every call. We may be able to fix it by letting consensus insert to that table as well.

mystenmark · 2023-05-30T20:45:14Z

@longbowlu Wondering if you can provide more detail on what this is fixing? full_checkpoint_contents was always meant as a best-effort source. i.e. "use it if you have it, otherwise fall back to individual tx lookups". So it's not clear to me why we want to design the watermarks around whether it would be available or not? I think there is probably some nuance I am missing.

If we don't think full_checkpoint_contents needs to be filled then we can bump highest_synced_checkpoint directly too. The current code looks into the DB which is not needed.

Stepping back, it may be best efforts, but I think having it is much better than doing two extra DB looks ups on every call. We may be able to fix it by letting consensus insert to that table as well.

I think we don't want to insert to this table in the consensus path - the purpose of this table was to make it faster for fullnodes to read old, uncached checkpoints, both for execution and for serving via state sync. On validators we can have a much stronger assumption of temporal locality - they should only be reading / serving checkpoints that were created very recently and are still in cache.

mystenmark · 2023-05-30T20:47:17Z

Also, I don't think full_checkpoint_contents needs to be considered as far as watermarks go. The watermark says that the checkpoint contents are available. It doesn't say from which table it is available. Whenever we attempt to read a checkpoint, we try full_checkpoint_contents first, and fall back to scattered reads if it is unavailable.

longbowlu · 2023-05-31T03:44:57Z

updated the code base on feedback - now we directly bump both highest_verified and highest_synced watermark, but skip the DB lookup.

bmwill · 2023-06-01T14:59:20Z

crates/sui-network/src/state_sync/mod.rs

+        let prev_digest = *self.store.get_checkpoint_by_sequence_number(checkpoint.sequence_number() - 1)
+            .expect("store operation should not fail")
+            .unwrap_or_else(|| panic!("Got checkpoint {} from consensus but cannot find checkpoint {} in certified_checkpoints", checkpoint.sequence_number(), checkpoint.sequence_number() - 1))


IIRC this should be fine even with the genesis checkpoint since consensus wont send us the genesis checkpoint right? If not i'm sure we could tweak this to account for checkpoint 0

yeah I believe so, consensus won't send state sync genesis checkpoint because it's already certified

bmwill · 2023-06-01T15:00:53Z

crates/sui-network/src/state_sync/mod.rs

+        if *checkpoint.sequence_number() > next_sequence_number {
            debug!(
                "consensus sent too new of a checkpoint, expecting: {}, got: {}",
                next_sequence_number,
                checkpoint.sequence_number()
            );
+        }


Your PR description and the comment below mentions that this shouldn't happen right?

right, being a bit nimble here because there could be bugs somewhere that we overlook

This reverts commit f298251.

## Description In #12212, we refactored `handle_checkpoint_from_consensus`. The PR missed one thing: when a last checkpoint of an epoch is certified in consensus and sent to state sync, we do not update the committee store. This is problematic because state sync does not reconfig. It means if a validator is slow in reconfiguration (where committee store would also be updated) and gets checkpoints in the next epoch from peers, it will panic in that it couldn't find the committee. This PR fixes it. ## Test Plan ``` for f in `seq 1 20`; do MSIM_TEST_NUM=5 MSIM_TEST_SEED=98$f RUST_LOG=off simt test_simulated_load_reconfig_restarts --no-capture > log-$f 2>&1 &; done ``` ``` grep FAIL log-* ``` --- If your changes are not user-facing and not a breaking change, you can skip the following section. Otherwise, please indicate what changed, and then add to the Release Notes section as highlighted during the release process. ### Type of Change (Check all that apply) - [ ] protocol change - [ ] user-visible impact - [ ] breaking change for a client SDKs - [ ] breaking change for FNs (FN binary must upgrade) - [ ] breaking change for validators or node operators (must upgrade binaries) - [ ] breaking change for on-chain data layout - [ ] necessitate either a data wipe or data migration ### Release notes Insert committee when handling checkpoint from consensus.

vercel bot deployed to Preview – offline-signer-helper May 25, 2023 21:34 View deployment

Base automatically changed from do-not-insert-checkpoint-when-get_checkpoint_by_sequence_number-returns-data to main May 30, 2023 06:55

longbowlu requested review from andll, aschran, bmwill and mystenmark May 30, 2023 17:52

longbowlu changed the title ~~do not update highest_synced_checkpoint in some cases~~ do not update highest_synced_checkpoint when consensus sends a too new checkpoint May 30, 2023

vercel bot deployed to Preview – offline-signer-helper May 30, 2023 17:56 View deployment

vercel bot temporarily deployed to Preview – sui-wallet-kit May 30, 2023 17:56 Inactive

vercel bot temporarily deployed to Preview – wallet-adapter May 30, 2023 17:56 Inactive

vercel bot deployed to Preview – explorer May 30, 2023 17:56 View deployment

vercel bot temporarily deployed to Preview – explorer-storybook May 30, 2023 17:57 Inactive

longbowlu marked this pull request as ready for review May 30, 2023 18:02

longbowlu force-pushed the do-not-update-highest-synced-checkpoint-in-some-cases branch from 691f44d to 1ea422d Compare May 31, 2023 03:22

vercel bot deployed to Preview – explorer May 31, 2023 03:23 View deployment

vercel bot temporarily deployed to Preview – sui-wallet-kit May 31, 2023 03:23 Inactive

vercel bot temporarily deployed to Preview – wallet-adapter May 31, 2023 03:24 Inactive

vercel bot temporarily deployed to Preview – explorer-storybook May 31, 2023 03:24 Inactive

vercel bot deployed to Preview – offline-signer-helper May 31, 2023 03:24 View deployment

longbowlu changed the base branch from main to releases/sui-v1.2.0-release May 31, 2023 03:26

longbowlu changed the base branch from releases/sui-v1.2.0-release to main May 31, 2023 03:26

longbowlu force-pushed the do-not-update-highest-synced-checkpoint-in-some-cases branch from 1ea422d to 89d6407 Compare May 31, 2023 03:43

vercel bot deployed to Preview – offline-signer-helper May 31, 2023 03:43 View deployment

vercel bot temporarily deployed to Preview – explorer-storybook May 31, 2023 03:44 Inactive

vercel bot deployed to Preview – explorer May 31, 2023 03:44 View deployment

vercel bot temporarily deployed to Preview – wallet-adapter May 31, 2023 03:44 Inactive

vercel bot temporarily deployed to Preview – sui-wallet-kit May 31, 2023 03:45 Inactive

longbowlu marked this pull request as draft May 31, 2023 04:07

longbowlu force-pushed the do-not-update-highest-synced-checkpoint-in-some-cases branch from 89d6407 to 8edc1b7 Compare May 31, 2023 23:31

vercel bot temporarily deployed to Preview – wallet-adapter May 31, 2023 23:31 Inactive

vercel bot deployed to Preview – explorer May 31, 2023 23:32 View deployment

vercel bot temporarily deployed to Preview – sui-wallet-kit May 31, 2023 23:32 Inactive

vercel bot temporarily deployed to Preview – explorer-storybook May 31, 2023 23:32 Inactive

vercel bot deployed to Preview – offline-signer-helper May 31, 2023 23:32 View deployment

longbowlu changed the title ~~do not update highest_synced_checkpoint when consensus sends a too new checkpoint~~ Simplify handle_checkpoint_from_consensus May 31, 2023

longbowlu marked this pull request as ready for review June 1, 2023 00:01

bmwill approved these changes Jun 1, 2023

View reviewed changes

simplify handle_checkpoint_from_consensus

54d7cb6

longbowlu force-pushed the do-not-update-highest-synced-checkpoint-in-some-cases branch from 8edc1b7 to 54d7cb6 Compare June 1, 2023 19:05

vercel bot temporarily deployed to Preview – wallet-adapter June 1, 2023 19:05 Inactive

vercel bot deployed to Preview – offline-signer-helper June 1, 2023 19:05 View deployment

vercel bot temporarily deployed to Preview – explorer-storybook June 1, 2023 19:05 Inactive

vercel bot deployed to Preview – explorer June 1, 2023 19:06 View deployment

vercel bot temporarily deployed to Preview – sui-wallet-kit June 1, 2023 19:06 Inactive

longbowlu merged commit f298251 into main Jun 1, 2023

longbowlu deleted the do-not-update-highest-synced-checkpoint-in-some-cases branch June 1, 2023 22:16

mwtian added a commit to mwtian/sui that referenced this pull request Jun 2, 2023

Revert "Simplify handle_checkpoint_from_consensus (MystenLabs#12212)"

ad07269

This reverts commit f298251.

mystenmark added a commit that referenced this pull request Jun 2, 2023

Revert "Simplify handle_checkpoint_from_consensus (#12212)"

80099f2

This reverts commit f298251.

longbowlu mentioned this pull request Jun 2, 2023

insert committee when handling checkpoint from consensus #12324

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify `handle_checkpoint_from_consensus` #12212

Simplify `handle_checkpoint_from_consensus` #12212

longbowlu commented May 25, 2023 •

edited

Loading

vercel bot commented May 25, 2023 •

edited

Loading

aschran commented May 30, 2023

longbowlu commented May 30, 2023

mystenmark commented May 30, 2023

mystenmark commented May 30, 2023

longbowlu commented May 31, 2023

bmwill Jun 1, 2023

longbowlu Jun 1, 2023

bmwill Jun 1, 2023

longbowlu Jun 1, 2023

Simplify handle_checkpoint_from_consensus #12212

Simplify handle_checkpoint_from_consensus #12212

Conversation

longbowlu commented May 25, 2023 • edited Loading

Description

Old

Description

Test Plan

Type of Change (Check all that apply)

Release notes

vercel bot commented May 25, 2023 • edited Loading

aschran commented May 30, 2023

longbowlu commented May 30, 2023

mystenmark commented May 30, 2023

mystenmark commented May 30, 2023

longbowlu commented May 31, 2023

bmwill Jun 1, 2023

Choose a reason for hiding this comment

longbowlu Jun 1, 2023

Choose a reason for hiding this comment

bmwill Jun 1, 2023

Choose a reason for hiding this comment

longbowlu Jun 1, 2023

Choose a reason for hiding this comment

Simplify `handle_checkpoint_from_consensus` #12212

Simplify `handle_checkpoint_from_consensus` #12212

longbowlu commented May 25, 2023 •

edited

Loading

vercel bot commented May 25, 2023 •

edited

Loading