Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Conversation

@ryoqun
Copy link
Contributor

@ryoqun ryoqun commented Oct 24, 2020

Problem

Some persisted tower related assertion incorrectly causes panics.

Currently, we incorrectly don't allow saved tower where root_slot and the first slot in the lockouts is same.
For normal vote accounts on the chain, this cannot be possible to occur.
But for tower, it's possible in this following scenario:

  1. a voting validator starts for the first time with new identity, booting from a snapshot (say, at slot 123) with empty ledger and creating a new vote account and a tower.
  2. the validator block processor finishes at the slot 123 because it has no ledger (this is also the root in the brand-new tower).
  3. the validator goes full-functioning and starts to vote and saves the tower with (root=123, and slot=123) and onwards (slot=124, 125, 126, ...).
  4. the validator is stopped before it votes enough to update the root in tower (i.e. less than 32 votes)
  5. the validator is started, restoring the saved tower.
  6. the panic occurs.

initial tower:

[2020-10-24T18:00:12.438130961Z INFO  solana_core::validator] Ignoring expected failed tower restore because this is the initial validator start with the vote account...
[2020-10-24T18:00:12.438151868Z INFO  solana_core::consensus] vote account(GtgBX5drw33wm5UdDdgwAQscFqkWYv6sj4z83bnuBobv) not found in bank (slot=1203)
[2020-10-24T18:00:12.438160385Z INFO  solana_core::validator] Tower state: Tower { node_pubkey: GkscD4UDMkFo1isx6QL88mfdmFcB4rqhZfDwZAS7ckif, threshold_depth: 8, threshold_size: 0.6666666666666666, lockouts: VoteState { node_pubkey: 11111111111111111111111111111111, authorized_withdrawer: 11111111111111111111111111111111, commission: 0, votes: [], root_slot: Some(1203), authorized_voters: AuthorizedVoters { authorized_voters: {} }, prior_voters: CircBuf { buf: [(11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0)], idx: 31, is_empty: true }, epoch_credits: [], last_timestamp: BlockTimestamp { slot: 0, timestamp: 0 } }, last_vote: Vote { slots: [], hash: 11111111111111111111111111111111, timestamp: None }, last_timestamp: BlockTimestamp { slot: 0, timestamp: 0 }, path: "/home/ryoqun/work/solana/solana/root-assert-tower-error/tower-GkscD4UDMkFo1isx6QL88mfdmFcB4rqhZfDwZAS7ckif.bin", tmp_path: "/home/ryoqun/work/solana/solana/root-assert-tower-error/tower-GkscD4UDMkFo1isx6QL88mfdmFcB4rqhZfDwZAS7ckif.bin.new", stray_restored_slot: None, last_switch_threshold_check: None }
[2020-10-24T18:00:12.438203414Z INFO  solana_core::validator] Starting validator with working bank slot 1203

on validator restart (before this pr):

[2020-10-24T18:00:17.812515569Z INFO  solana_ledger::blockstore_processor] ledger processed in 68ms. 23 MB allocated. root=1203, 1 fork at 1232, with 30 frozen banks
[2020-10-24T18:00:17.812909139Z INFO  solana_core::consensus] adjusting lockouts (after replay up to 1203): [1203, 1204, 1205, 1206, 1207, 1208, 1209, 1210, 1211, 1212, 1213, 1214, 1215, 1216, 1217, 1218, 1219, 1220, 1221, 1222, 1223, 1224, 1225, 1226, 1227, 1228, 1229, 1230] tower root: 1203 replayed root: 1203
thread 'main' panicked at 'slot_in_tower(1203) < checked_slot(1203)', core/src/consensus.rs:931:21
stack backtrace:

on validator restarrt (after this pr):

[2020-10-24T18:15:01.469104441Z INFO  solana_ledger::blockstore_processor] ledger processed in 0ms. 0 MB allocated. root=1603, 1 fork at 1603, with 1 frozen bank
[2020-10-24T18:15:01.469630231Z INFO  solana_core::consensus] adjusting lockouts (after replay up to 1603): [1203, 1204, 1205, 1206, 1207, 1208, 1209, 1210, 1211, 1212, 1213, 1214, 1215, 1216, 1217, 1218, 1219, 1220, 1221, 1222, 1223, 1224, 1225, 1226, 1227, 1228, 1229, 1230] tower root: 1203 replayed root: 1603
[2020-10-24T18:15:01.469646407Z INFO  solana_core::consensus] adjusted tower's anchored slot: Some(1230)
[2020-10-24T18:15:01.469651590Z INFO  solana_core::consensus] All restored votes were behind; resetting root_slot and last_vote in tower!
[2020-10-24T18:15:01.469656690Z INFO  solana_core::validator] Tower state: Tower { node_pubkey: GkscD4UDMkFo1isx6QL88mfdmFcB4rqhZfDwZAS7ckif, threshold_depth: 8, threshold_size: 0.6666666666666666, lockouts: VoteState { node_pubkey: 11111111111111111111111111111111, authorized_withdrawer: 11111111111111111111111111111111, commission: 0, votes: [], root_slot: Some(1603), authorized_voters: AuthorizedVoters { authorized_voters: {} }, prior_voters: CircBuf { buf: [(11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0), (11111111111111111111111111111111, 0, 0)], idx: 31, is_empty: true }, epoch_credits: [], last_timestamp: BlockTimestamp { slot: 0, timestamp: 0 } }, last_vote: Vote { slots: [], hash: 11111111111111111111111111111111, timestamp: None }, last_timestamp: BlockTimestamp { slot: 1230, timestamp: 1603562413 }, path: "/home/ryoqun/work/solana/solana/root-assert-tower-error/tower-GkscD4UDMkFo1isx6QL88mfdmFcB4rqhZfDwZAS7ckif.bin", tmp_path: "/home/ryoqun/work/solana/solana/root-assert-tower-error/tower-GkscD4UDMkFo1isx6QL88mfdmFcB4rqhZfDwZAS7ckif.bin.new", stray_restored_slot: None, last_switch_threshold_check: None }

Well, I made the assertion too strict because of my misunderstanding....

And this is the example panic.

BarcaR5KvTdESod5UZTbPVdcjZD2ZSLFMQws3N5Gsqdp validator main panicked at 'assertion failed: *slot_in_tower < checked_slot', core/src/consensus.rs:865:21

Summary of Changes

Simply allow such a saved tower because this is completely sane behavior of validator's voting.

Fixes #13128

Context: #10718

This is spin-off of #12671

@ryoqun ryoqun added the v1.4 label Oct 24, 2020
@ryoqun ryoqun requested a review from carllin October 24, 2020 18:51
Comment on lines -924 to -928
// This is really special, only if tower is initialized (root = slot 0) for genesis and contains
// a vote (= slot 0) for the genesis, the slot 0 can repeat only once
let voting_from_genesis = *slot_in_tower == checked_slot && *slot_in_tower == 0;
// This is really special, only if tower is initialized and contains
// a vote for the root, the root slot can repeat only once
let voting_for_root =
*slot_in_tower == checked_slot && *slot_in_tower == tower_root;

if !voting_from_genesis {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering this false panic, it's so tempting to just give up this asserting entirely and call self.initialize_lockouts(|v| v.slot > tower_root); and get done by forcibly erasing any odd state. But, I rather want to keep this assertion by tweaking like this.

This is to catch a really ill-state of badly ordered vote account / tower if it ever happens in the future.

@codecov
Copy link

codecov bot commented Oct 24, 2020

Codecov Report

Merging #13135 into master will decrease coverage by 0.2%.
The diff coverage is 100.0%.

@@            Coverage Diff            @@
##           master   #13135     +/-   ##
=========================================
- Coverage    82.1%    81.9%   -0.3%     
=========================================
  Files         372      396     +24     
  Lines       86530    87057    +527     
  Branches        0      203    +203     
=========================================
+ Hits        71085    71319    +234     
- Misses      15445    15689    +244     
- Partials        0       49     +49     

@carllin
Copy link
Contributor

carllin commented Oct 26, 2020

good find!

@ryoqun
Copy link
Contributor Author

ryoqun commented Oct 26, 2020

buildkite/solana/downstream-projects — Failed (exit status 1)

I'm merging this in spite of the quoted failing build. I'm sure this isn't related to this pr.

@ryoqun ryoqun merged commit 66c7a98 into solana-labs:master Oct 26, 2020
mergify bot pushed a commit that referenced this pull request Oct 26, 2020
ryoqun added a commit that referenced this pull request Oct 26, 2020
(cherry picked from commit 66c7a98)

Co-authored-by: Ryo Onodera <ryoqun@gmail.com>
CriesofCarrots pushed a commit to CriesofCarrots/solana that referenced this pull request Dec 4, 2020
CriesofCarrots pushed a commit to CriesofCarrots/solana that referenced this pull request Dec 4, 2020
CriesofCarrots pushed a commit to CriesofCarrots/solana that referenced this pull request Dec 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Solana validator v1.4.2 is not starting: thread 'main' panicked at 'slot_in_tower(43526861) < checked_slot(43526861)'

2 participants