-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Fix tower/blockstore unsync due to external causes #12671
Conversation
|
FYI @carllin |
After some extensitve testing, I've failed to reproduce with #12663. So, I think the pr manages to fix this assertion as well. |
I've finally managed to produce this: https://github.com/solana-labs/solana/pull/12350/files#r502317201 |
75a6059 to
b5bd6ae
Compare
Codecov Report
@@ Coverage Diff @@
## master #12671 +/- ##
=======================================
Coverage 82.2% 82.2%
=======================================
Files 376 376
Lines 87180 87214 +34
=======================================
+ Hits 71665 71707 +42
+ Misses 15515 15507 -8 |
6f32430 to
19c1556
Compare
19c1556 to
aed865a
Compare
977ebc9 to
e7e4d44
Compare
e7e4d44 to
3625843
Compare
|
@carllin Thanks for the review! I think I've addressed all your comments. |
core/src/consensus.rs
Outdated
| // validator SEGV, OS/HW crash, or plain No Free Space FS error. | ||
|
|
||
| // When we're in this clause, it basically means validator is badly running | ||
| // with a future tower and RE-CREATING a block for one of past slots as the leader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, the validator doesn't even have to be leader/recreating a block for this to happen right? For example in this example here: #12671 (comment), validator A doesn't necessarily have to have generated blocks 1-4, those could be any validator's blocks, the key is it will still try to call the switching logic on a previously seen block.
Side note, could we include a simple diagram like the one in #12671 (comment) in the comment? Helps to clarify these points 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the validator doesn't even have to be leader/recreating a block for this to happen right?
You're correct. it seems validator doesn't have to be leader. I got false impression of leader requirement. Because, unstaked validators seldom reached to this closure while testing locally and manually.
Comment updated: 677887a
Now, I improved test_future_tower to test for leader and non-leader to validate this: 7d60a0b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note, could we include a simple diagram like the one in #12671 (comment) in the comment? Helps to clarify these points smiley
I'll do this later: #12671 (comment)
| .map(|last_voted_slot| { | ||
| let root = self.root(); | ||
| let empty_ancestors = HashSet::default(); | ||
| let empty_ancestors_due_to_minor_unsynced_ledger = || { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, I like this new structure! I think if possible we should add a simple diagram/explanation to the comment to help clarify these points/add proof to the correctness.
Maybe we can import from here
-
Corner case consideration: https://github.com/solana-labs/solana/pull/10718/files#r478740859
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll do this later: #12671 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with some nits!
|
@carllin FYI, I'm merging this albeit there is still desirable comment additions and adjustments because I believe there should be no remaining outstanding implementation change and I want to deliver this rather now. Currently, I'm rather busy with inflation inspection/tweak, vote program change, and I'll create another further follow-up comment-only pr later after my task queue got shallow. :) Also note that I have already 1 nano bug fix pr and logging cleaning pr relating to tower. So, there are 3 remaining PRs I'd like to create. |
* Fix tower/blockstore unsync due to external causes * Add and clean up long comments * Clean up test * Comment about warped_slot_history * Run test_future_tower with master-only/master-slave * Update comments about false leader condition (cherry picked from commit 1df15d8)
|
@carllin Anyway, thanks so much for careful review! |
| // at the banking stage: https://github.com/solana-labs/solana/issues/8232) | ||
| // | ||
| // To be specific, the replay stage is tricked into a false perception where | ||
| // last_vote_ancestors is AVAILABLE for descendant-of-`switch_slot`, stale, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
improve wording here
| // following logic | ||
| return Err(TowerError::TooOldTower( | ||
| if tower_root <= replayed_root { | ||
| // Normally, we goes into this clause with possible help of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly with.
* Fix tower/blockstore unsync due to external causes * Add and clean up long comments * Clean up test * Comment about warped_slot_history * Run test_future_tower with master-only/master-slave * Update comments about false leader condition (cherry picked from commit 1df15d8) Co-authored-by: Ryo Onodera <ryoqun@gmail.com>
* Fix tower/blockstore unsync due to external causes * Add and clean up long comments * Clean up test * Comment about warped_slot_history * Run test_future_tower with master-only/master-slave * Update comments about false leader condition
* Fix tower/blockstore unsync due to external causes * Add and clean up long comments * Clean up test * Comment about warped_slot_history * Run test_future_tower with master-only/master-slave * Update comments about false leader condition
* Fix tower/blockstore unsync due to external causes * Add and clean up long comments * Clean up test * Comment about warped_slot_history * Run test_future_tower with master-only/master-slave * Update comments about false leader condition
Problem
there are various persisted tower assertions which shouldn't be triggered (thanks for reporting, @sakridge )
All of them is related to flushing timing issue or unintended ledger manual manipulation:
Summary of Changes
Adjust those assertions to accomodate any kind of incomplete unsync between tower <=> blockstore. Also, support a future tower which is descendant of ledger root.
Fixes #13128
context: #10718