Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Conversation

@ryoqun
Copy link
Contributor

@ryoqun ryoqun commented Oct 5, 2020

Problem

there are various persisted tower assertions which shouldn't be triggered (thanks for reporting, @sakridge )

All of them is related to flushing timing issue or unintended ledger manual manipulation:

FKFE8cSNZ5AzTNq4VsC7y6wEeoYWbzVRfxuzc36Sqr5y validator solana-replay-stage panicked at 'assertion failed: !last_vote_ancestors.contains(&switch_slot)', core/src/consensus.rs:540:17
twP716sMjTFynrZRSCKkHhoeHaQEXnR2bzqDEzuxP81 validator main panicked at 'at least 1 parent slot must be found', core/src/consensus.rs:1094:13

Summary of Changes

Adjust those assertions to accomodate any kind of incomplete unsync between tower <=> blockstore. Also, support a future tower which is descendant of ledger root.

Fixes #13128

context: #10718

@ryoqun
Copy link
Contributor Author

ryoqun commented Oct 5, 2020

FYI @carllin

@ryoqun
Copy link
Contributor Author

ryoqun commented Oct 7, 2020

FKFE8cSNZ5AzTNq4VsC7y6wEeoYWbzVRfxuzc36Sqr5y validator solana-replay-stage panicked at 'assertion failed: !last_vote_ancestors.contains(&switch_slot)', core/src/consensus.rs:540:17

After some extensitve testing, I've failed to reproduce with #12663. So, I think the pr manages to fix this assertion as well.

@ryoqun
Copy link
Contributor Author

ryoqun commented Oct 9, 2020

FKFE8cSNZ5AzTNq4VsC7y6wEeoYWbzVRfxuzc36Sqr5y
validator
solana-replay-stage
panicked at 'assertion failed: !last_vote_ancestors.contains(&switch_slot)', core/src/consensus.rs:540:17

After some extensitve testing, I've failed to reproduce with #12663. So, I think the pr manages to fix this assertion as well.

I've finally managed to produce this: https://github.com/solana-labs/solana/pull/12350/files#r502317201

@codecov
Copy link

codecov bot commented Oct 15, 2020

Codecov Report

Merging #12671 into master will increase coverage by 0.0%.
The diff coverage is 71.1%.

@@           Coverage Diff           @@
##           master   #12671   +/-   ##
=======================================
  Coverage    82.2%    82.2%           
=======================================
  Files         376      376           
  Lines       87180    87214   +34     
=======================================
+ Hits        71665    71707   +42     
+ Misses      15515    15507    -8     

@ryoqun ryoqun force-pushed the persisted-tower-asserts branch 2 times, most recently from 6f32430 to 19c1556 Compare October 19, 2020 15:49
@ryoqun ryoqun force-pushed the persisted-tower-asserts branch from 19c1556 to aed865a Compare October 21, 2020 02:26
@ryoqun ryoqun mentioned this pull request Oct 23, 2020
1 task
@ryoqun ryoqun force-pushed the persisted-tower-asserts branch 2 times, most recently from 977ebc9 to e7e4d44 Compare October 25, 2020 11:45
@ryoqun ryoqun changed the title Adjust persisted tower assertions Fix tower/blockstore unsync due to external causes Oct 25, 2020
@ryoqun ryoqun marked this pull request as ready for review October 25, 2020 12:17
@ryoqun ryoqun force-pushed the persisted-tower-asserts branch from e7e4d44 to 3625843 Compare October 29, 2020 04:56
@ryoqun
Copy link
Contributor Author

ryoqun commented Oct 29, 2020

@carllin Thanks for the review! I think I've addressed all your comments.

// validator SEGV, OS/HW crash, or plain No Free Space FS error.

// When we're in this clause, it basically means validator is badly running
// with a future tower and RE-CREATING a block for one of past slots as the leader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, the validator doesn't even have to be leader/recreating a block for this to happen right? For example in this example here: #12671 (comment), validator A doesn't necessarily have to have generated blocks 1-4, those could be any validator's blocks, the key is it will still try to call the switching logic on a previously seen block.

Side note, could we include a simple diagram like the one in #12671 (comment) in the comment? Helps to clarify these points 😃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the validator doesn't even have to be leader/recreating a block for this to happen right?

You're correct. it seems validator doesn't have to be leader. I got false impression of leader requirement. Because, unstaked validators seldom reached to this closure while testing locally and manually.

Comment updated: 677887a

Now, I improved test_future_tower to test for leader and non-leader to validate this: 7d60a0b

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note, could we include a simple diagram like the one in #12671 (comment) in the comment? Helps to clarify these points smiley

I'll do this later: #12671 (comment)

.map(|last_voted_slot| {
let root = self.root();
let empty_ancestors = HashSet::default();
let empty_ancestors_due_to_minor_unsynced_ledger = || {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, I like this new structure! I think if possible we should add a simple diagram/explanation to the comment to help clarify these points/add proof to the correctness.

Maybe we can import from here

  1. Safety: Persistent tower #10718 (comment)

  2. Corner case consideration: https://github.com/solana-labs/solana/pull/10718/files#r478740859

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll do this later: #12671 (comment)

carllin
carllin previously approved these changes Oct 29, 2020
Copy link
Contributor

@carllin carllin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with some nits!

@mergify mergify bot dismissed carllin’s stale review October 30, 2020 06:33

Pull request has been modified.

@ryoqun
Copy link
Contributor Author

ryoqun commented Oct 30, 2020

@carllin FYI, I'm merging this albeit there is still desirable comment additions and adjustments because I believe there should be no remaining outstanding implementation change and I want to deliver this rather now. Currently, I'm rather busy with inflation inspection/tweak, vote program change, and solana-genesis debugging...

I'll create another further follow-up comment-only pr later after my task queue got shallow. :) Also note that I have already 1 nano bug fix pr and logging cleaning pr relating to tower. So, there are 3 remaining PRs I'd like to create.

@ryoqun ryoqun merged commit 1df15d8 into solana-labs:master Oct 30, 2020
mergify bot pushed a commit that referenced this pull request Oct 30, 2020
* Fix tower/blockstore unsync due to external causes

* Add and clean up long comments

* Clean up test

* Comment about warped_slot_history

* Run test_future_tower with master-only/master-slave

* Update comments about false leader condition

(cherry picked from commit 1df15d8)
@ryoqun
Copy link
Contributor Author

ryoqun commented Oct 30, 2020

@carllin Anyway, thanks so much for careful review!

// at the banking stage: https://github.com/solana-labs/solana/issues/8232)
//
// To be specific, the replay stage is tricked into a false perception where
// last_vote_ancestors is AVAILABLE for descendant-of-`switch_slot`, stale, and
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

improve wording here

// following logic
return Err(TowerError::TooOldTower(
if tower_root <= replayed_root {
// Normally, we goes into this clause with possible help of
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly with.

mergify bot added a commit that referenced this pull request Oct 30, 2020
* Fix tower/blockstore unsync due to external causes

* Add and clean up long comments

* Clean up test

* Comment about warped_slot_history

* Run test_future_tower with master-only/master-slave

* Update comments about false leader condition

(cherry picked from commit 1df15d8)

Co-authored-by: Ryo Onodera <ryoqun@gmail.com>
CriesofCarrots pushed a commit to CriesofCarrots/solana that referenced this pull request Dec 4, 2020
* Fix tower/blockstore unsync due to external causes

* Add and clean up long comments

* Clean up test

* Comment about warped_slot_history

* Run test_future_tower with master-only/master-slave

* Update comments about false leader condition
CriesofCarrots pushed a commit to CriesofCarrots/solana that referenced this pull request Dec 4, 2020
* Fix tower/blockstore unsync due to external causes

* Add and clean up long comments

* Clean up test

* Comment about warped_slot_history

* Run test_future_tower with master-only/master-slave

* Update comments about false leader condition
CriesofCarrots pushed a commit to CriesofCarrots/solana that referenced this pull request Dec 4, 2020
* Fix tower/blockstore unsync due to external causes

* Add and clean up long comments

* Clean up test

* Comment about warped_slot_history

* Run test_future_tower with master-only/master-slave

* Update comments about false leader condition
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Solana validator v1.4.2 is not starting: thread 'main' panicked at 'slot_in_tower(43526861) < checked_slot(43526861)'

2 participants