Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify handle_checkpoint_from_consensus #12212

Merged

Conversation

longbowlu
Copy link
Contributor

@longbowlu longbowlu commented May 25, 2023

Description

Based on feedback, now this PR simplifies handle_checkpoint_from_consensus. Previously there is a branch to handle when the checkpoint is too new (newer than the highest_verified_checkpoint). I think this case wouldn't happen because consensus sends checkpoint in order. So we consolidate the two cases into one, which is, whenever consensus sends a checkpoint, we do the following things:

  1. check the previous digest matches. This is to ensure state sync and consensus do not brain-split.
  2. if the checkpoint is older, ignore and exit
  3. update metrics and bump watermark
  4. notify others

Old

Description

When consensus sends checkpoint y, it means all prior checkpoints are verified and sent, so highest_verified_checkpoint can be safely updated without looking at the checkpoint store. However it's a bit tricky for highest_synced_checkpoint, because there are a few cfs (e.g. full_checkpoint_contents) that could only be updated in state sync checkpoint contents. If we bump this watermarks, those cfs may not be filled at all.

Test Plan

CI


If your changes are not user-facing and not a breaking change, you can skip the following section. Otherwise, please indicate what changed, and then add to the Release Notes section as highlighted during the release process.

Type of Change (Check all that apply)

  • protocol change
  • user-visible impact
  • breaking change for a client SDKs
  • breaking change for FNs (FN binary must upgrade)
  • breaking change for validators or node operators (must upgrade binaries)
  • breaking change for on-chain data layout
  • necessitate either a data wipe or data migration

Release notes

Do not update highest_synced_checkpoint when consensus sends a too new checkpoint

@vercel
Copy link

vercel bot commented May 25, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
explorer ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 1, 2023 7:07pm
explorer-storybook ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 1, 2023 7:07pm
offline-signer-helper ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 1, 2023 7:07pm
sui-wallet-kit ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 1, 2023 7:07pm
wallet-adapter ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 1, 2023 7:07pm

Base automatically changed from do-not-insert-checkpoint-when-get_checkpoint_by_sequence_number-returns-data to main May 30, 2023 06:55
@longbowlu longbowlu changed the title do not update highest_synced_checkpoint in some cases do not update highest_synced_checkpoint when consensus sends a too new checkpoint May 30, 2023
@vercel vercel bot temporarily deployed to Preview – sui-wallet-kit May 30, 2023 17:56 Inactive
@vercel vercel bot temporarily deployed to Preview – wallet-adapter May 30, 2023 17:56 Inactive
@vercel vercel bot temporarily deployed to Preview – explorer-storybook May 30, 2023 17:57 Inactive
@longbowlu longbowlu marked this pull request as ready for review May 30, 2023 18:02
@aschran
Copy link
Contributor

aschran commented May 30, 2023

@longbowlu Wondering if you can provide more detail on what this is fixing? full_checkpoint_contents was always meant as a best-effort source. i.e. "use it if you have it, otherwise fall back to individual tx lookups". So it's not clear to me why we want to design the watermarks around whether it would be available or not? I think there is probably some nuance I am missing.

@longbowlu
Copy link
Contributor Author

@longbowlu Wondering if you can provide more detail on what this is fixing? full_checkpoint_contents was always meant as a best-effort source. i.e. "use it if you have it, otherwise fall back to individual tx lookups". So it's not clear to me why we want to design the watermarks around whether it would be available or not? I think there is probably some nuance I am missing.

If we don't think full_checkpoint_contents needs to be filled then we can bump highest_synced_checkpoint directly too. The current code looks into the DB which is not needed.

Stepping back, it may be best efforts, but I think having it is much better than doing two extra DB looks ups on every call. We may be able to fix it by letting consensus insert to that table as well.

@mystenmark
Copy link
Contributor

@longbowlu Wondering if you can provide more detail on what this is fixing? full_checkpoint_contents was always meant as a best-effort source. i.e. "use it if you have it, otherwise fall back to individual tx lookups". So it's not clear to me why we want to design the watermarks around whether it would be available or not? I think there is probably some nuance I am missing.

If we don't think full_checkpoint_contents needs to be filled then we can bump highest_synced_checkpoint directly too. The current code looks into the DB which is not needed.

Stepping back, it may be best efforts, but I think having it is much better than doing two extra DB looks ups on every call. We may be able to fix it by letting consensus insert to that table as well.

I think we don't want to insert to this table in the consensus path - the purpose of this table was to make it faster for fullnodes to read old, uncached checkpoints, both for execution and for serving via state sync. On validators we can have a much stronger assumption of temporal locality - they should only be reading / serving checkpoints that were created very recently and are still in cache.

@mystenmark
Copy link
Contributor

Also, I don't think full_checkpoint_contents needs to be considered as far as watermarks go. The watermark says that the checkpoint contents are available. It doesn't say from which table it is available. Whenever we attempt to read a checkpoint, we try full_checkpoint_contents first, and fall back to scattered reads if it is unavailable.

@longbowlu longbowlu force-pushed the do-not-update-highest-synced-checkpoint-in-some-cases branch from 691f44d to 1ea422d Compare May 31, 2023 03:22
@vercel vercel bot temporarily deployed to Preview – sui-wallet-kit May 31, 2023 03:23 Inactive
@vercel vercel bot temporarily deployed to Preview – wallet-adapter May 31, 2023 03:24 Inactive
@vercel vercel bot temporarily deployed to Preview – explorer-storybook May 31, 2023 03:24 Inactive
@longbowlu longbowlu changed the base branch from main to releases/sui-v1.2.0-release May 31, 2023 03:26
@longbowlu longbowlu changed the base branch from releases/sui-v1.2.0-release to main May 31, 2023 03:26
@longbowlu longbowlu force-pushed the do-not-update-highest-synced-checkpoint-in-some-cases branch from 1ea422d to 89d6407 Compare May 31, 2023 03:43
@vercel vercel bot temporarily deployed to Preview – explorer-storybook May 31, 2023 03:44 Inactive
@vercel vercel bot temporarily deployed to Preview – wallet-adapter May 31, 2023 03:44 Inactive
@longbowlu
Copy link
Contributor Author

updated the code base on feedback - now we directly bump both highest_verified and highest_synced watermark, but skip the DB lookup.

@vercel vercel bot temporarily deployed to Preview – sui-wallet-kit May 31, 2023 03:45 Inactive
@longbowlu longbowlu marked this pull request as draft May 31, 2023 04:07
@longbowlu longbowlu force-pushed the do-not-update-highest-synced-checkpoint-in-some-cases branch from 89d6407 to 8edc1b7 Compare May 31, 2023 23:31
@vercel vercel bot temporarily deployed to Preview – wallet-adapter May 31, 2023 23:31 Inactive
@vercel vercel bot temporarily deployed to Preview – sui-wallet-kit May 31, 2023 23:32 Inactive
@vercel vercel bot temporarily deployed to Preview – explorer-storybook May 31, 2023 23:32 Inactive
@longbowlu longbowlu changed the title do not update highest_synced_checkpoint when consensus sends a too new checkpoint Simplify handle_checkpoint_from_consensus May 31, 2023
@longbowlu longbowlu marked this pull request as ready for review June 1, 2023 00:01
Comment on lines +447 to +449
let prev_digest = *self.store.get_checkpoint_by_sequence_number(checkpoint.sequence_number() - 1)
.expect("store operation should not fail")
.unwrap_or_else(|| panic!("Got checkpoint {} from consensus but cannot find checkpoint {} in certified_checkpoints", checkpoint.sequence_number(), checkpoint.sequence_number() - 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this should be fine even with the genesis checkpoint since consensus wont send us the genesis checkpoint right? If not i'm sure we could tweak this to account for checkpoint 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I believe so, consensus won't send state sync genesis checkpoint because it's already certified

Comment on lines +467 to +473
if *checkpoint.sequence_number() > next_sequence_number {
debug!(
"consensus sent too new of a checkpoint, expecting: {}, got: {}",
next_sequence_number,
checkpoint.sequence_number()
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your PR description and the comment below mentions that this shouldn't happen right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, being a bit nimble here because there could be bugs somewhere that we overlook

@longbowlu longbowlu force-pushed the do-not-update-highest-synced-checkpoint-in-some-cases branch from 8edc1b7 to 54d7cb6 Compare June 1, 2023 19:05
@vercel vercel bot temporarily deployed to Preview – wallet-adapter June 1, 2023 19:05 Inactive
@vercel vercel bot temporarily deployed to Preview – explorer-storybook June 1, 2023 19:05 Inactive
@vercel vercel bot temporarily deployed to Preview – sui-wallet-kit June 1, 2023 19:06 Inactive
@longbowlu longbowlu merged commit f298251 into main Jun 1, 2023
@longbowlu longbowlu deleted the do-not-update-highest-synced-checkpoint-in-some-cases branch June 1, 2023 22:16
mwtian added a commit to mwtian/sui that referenced this pull request Jun 2, 2023
mystenmark added a commit that referenced this pull request Jun 2, 2023
longbowlu added a commit that referenced this pull request Jun 3, 2023
## Description 

In #12212, we refactored
`handle_checkpoint_from_consensus`. The PR missed one thing: when a last
checkpoint of an epoch is certified in consensus and sent to state sync,
we do not update the committee store. This is problematic because state
sync does not reconfig. It means if a validator is slow in
reconfiguration (where committee store would also be updated) and gets
checkpoints in the next epoch from peers, it will panic in that it
couldn't find the committee. This PR fixes it.


## Test Plan 

```
for f in `seq 1 20`; do MSIM_TEST_NUM=5 MSIM_TEST_SEED=98$f RUST_LOG=off simt test_simulated_load_reconfig_restarts --no-capture > log-$f 2>&1 &; done
```

```
grep FAIL log-*
```
---
If your changes are not user-facing and not a breaking change, you can
skip the following section. Otherwise, please indicate what changed, and
then add to the Release Notes section as highlighted during the release
process.

### Type of Change (Check all that apply)

- [ ] protocol change
- [ ] user-visible impact
- [ ] breaking change for a client SDKs
- [ ] breaking change for FNs (FN binary must upgrade)
- [ ] breaking change for validators or node operators (must upgrade
binaries)
- [ ] breaking change for on-chain data layout
- [ ] necessitate either a data wipe or data migration

### Release notes
Insert committee when handling checkpoint from consensus.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants