-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Better tower logs for SwitchForkDecision and etc #12875
Conversation
| NoSwitch, | ||
| FailedSwitchThreshold, | ||
| SameFork, | ||
| FailedSwitchThreshold(u64, u64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's start passing around locked_out_stake and total_stake for logging. I found this very useful to debug.
core/src/consensus.rs
Outdated
| // (newer snapshot, or only a saved tower is moved over to new setup?) | ||
| // compare slots not to error! just because of newer snapshots | ||
| if self.is_first_switch_check() && switch_slot < last_voted_slot { | ||
| error!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I didn't error! here at #10718 is that doing so printed so many duplicate errors because check_switch_threshold is called frequently. Thanks to is_first_switch_check() (and last_switch_threshold_check behind it) this isn't true anymore.
| SwitchForkDecision::FailedSwitchThreshold => None, | ||
| SwitchForkDecision::NoSwitch => Some(vote_instruction::vote( | ||
| SwitchForkDecision::FailedSwitchThreshold(_, total_stake) => { | ||
| assert_ne!(*total_stake, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
place some sanity check as a bonus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that, this assumes there is no way to control total_stake remotely. So there should be no risk of dos.
| ); | ||
| let new_check = Some((switch_slot, decision.clone())); | ||
| if new_check != self.last_switch_threshold_check { | ||
| trace!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wanted to trace! this. But because of very frequent call of check_switch_threshold with same arguments with same check result, I have to implement this very simple cache for old new_check.
Codecov Report
|
| // This condition shouldn't occur under normal validator operation, indicating | ||
| // something unusual happened. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, this is the well-written proof of this:
pleb:
This is my line of reasoning/understanding of why that log shouldn't ever trigger under normal operation.https://github.com/solana-labs/solana/pull/12875/files#diff-0b654209e209924bfc70e8dd4c84d0979e36b6ca76e85a5899ba4572521a8170R550, so here: last_voted_slot is a stray with no ancestors.
So assume last_voted_slot is a stray with no ancestors. Assume that when the vote for last_voted_slot was made, the snapshot was S_old. So when we restart from a snapshot S:
- S we start from is > last_voted_slot. But then the warning added in that PR will not trigger because switch_slot > S > last_voted_slot
- If the snapshot S <= last_voted_slot then
a) if S >= S_old is the latest snapshot across all restarts, then you're guaranteed to have ancestors (unless something went wrong) because last_voted_slot is descended from S and the ledger should contain all the descendants of S.
b) If S < S_old, the do_adjust_lockouts_after_replay() should panic because it can't find any anchored slotsSo in all cases under normal operation, that log shouldn't tigger, is that right?
And this proof is correct as far as I peer-reviewed this.
* Better tower logs for SwitchForkDecision and etc * nits * Update comment (cherry picked from commit a44e4d3)
* Better tower logs for SwitchForkDecision and etc * nits * Update comment
* Better tower logs for SwitchForkDecision and etc * nits * Update comment
* Better tower logs for SwitchForkDecision and etc * nits * Update comment
Problem
When writing test for #12350, debugging was so painful that I've added some logs here and there.
Summary of Changes
All of this is just some logging improvements. I separated this form the original pr to make logically separated prs independent.
There is no functional change.
Fixes #