Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Conversation

@ryoqun
Copy link
Contributor

@ryoqun ryoqun commented Oct 14, 2020

Problem

When writing test for #12350, debugging was so painful that I've added some logs here and there.

Summary of Changes

All of this is just some logging improvements. I separated this form the original pr to make logically separated prs independent.

There is no functional change.

Fixes #

NoSwitch,
FailedSwitchThreshold,
SameFork,
FailedSwitchThreshold(u64, u64),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's start passing around locked_out_stake and total_stake for logging. I found this very useful to debug.

// (newer snapshot, or only a saved tower is moved over to new setup?)
// compare slots not to error! just because of newer snapshots
if self.is_first_switch_check() && switch_slot < last_voted_slot {
error!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I didn't error! here at #10718 is that doing so printed so many duplicate errors because check_switch_threshold is called frequently. Thanks to is_first_switch_check() (and last_switch_threshold_check behind it) this isn't true anymore.

SwitchForkDecision::FailedSwitchThreshold => None,
SwitchForkDecision::NoSwitch => Some(vote_instruction::vote(
SwitchForkDecision::FailedSwitchThreshold(_, total_stake) => {
assert_ne!(*total_stake, 0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

place some sanity check as a bonus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that, this assumes there is no way to control total_stake remotely. So there should be no risk of dos.

);
let new_check = Some((switch_slot, decision.clone()));
if new_check != self.last_switch_threshold_check {
trace!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to trace! this. But because of very frequent call of check_switch_threshold with same arguments with same check result, I have to implement this very simple cache for old new_check.

@ryoqun ryoqun marked this pull request as ready for review October 14, 2020 09:48
@ryoqun
Copy link
Contributor Author

ryoqun commented Oct 14, 2020

@carllin Could you review this? (Cool, it seems that you've always reading this pr..) This blocks #12671 and #12350. So I rather want to merge this quickly. :)

@ryoqun ryoqun added the v1.4 label Oct 14, 2020
@codecov
Copy link

codecov bot commented Oct 14, 2020

Codecov Report

Merging #12875 into master will decrease coverage by 0.0%.
The diff coverage is 72.3%.

@@            Coverage Diff            @@
##           master   #12875     +/-   ##
=========================================
- Coverage    81.9%    81.9%   -0.1%     
=========================================
  Files         360      360             
  Lines       84925    84951     +26     
=========================================
  Hits        69580    69580             
- Misses      15345    15371     +26     

@ryoqun ryoqun requested a review from carllin October 15, 2020 04:24
Comment on lines +533 to +534
// This condition shouldn't occur under normal validator operation, indicating
// something unusual happened.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, this is the well-written proof of this:

pleb:
This is my line of reasoning/understanding of why that log shouldn't ever trigger under normal operation.

https://github.com/solana-labs/solana/pull/12875/files#diff-0b654209e209924bfc70e8dd4c84d0979e36b6ca76e85a5899ba4572521a8170R550, so here: last_voted_slot is a stray with no ancestors.

So assume last_voted_slot is a stray with no ancestors. Assume that when the vote for last_voted_slot was made, the snapshot was S_old. So when we restart from a snapshot S:

  1. S we start from is > last_voted_slot. But then the warning added in that PR will not trigger because switch_slot > S > last_voted_slot
  2. If the snapshot S <= last_voted_slot then
    a) if S >= S_old is the latest snapshot across all restarts, then you're guaranteed to have ancestors (unless something went wrong) because last_voted_slot is descended from S and the ledger should contain all the descendants of S.
    b) If S < S_old, the do_adjust_lockouts_after_replay() should panic because it can't find any anchored slots

So in all cases under normal operation, that log shouldn't tigger, is that right?

And this proof is correct as far as I peer-reviewed this.

@ryoqun ryoqun merged commit a44e4d3 into solana-labs:master Oct 15, 2020
mergify bot pushed a commit that referenced this pull request Oct 15, 2020
* Better tower logs for SwitchForkDecision and etc

* nits

* Update comment

(cherry picked from commit a44e4d3)
mergify bot added a commit that referenced this pull request Oct 15, 2020
* Better tower logs for SwitchForkDecision and etc

* nits

* Update comment

(cherry picked from commit a44e4d3)

Co-authored-by: Ryo Onodera <ryoqun@gmail.com>
carllin pushed a commit to carllin/solana that referenced this pull request Oct 17, 2020
* Better tower logs for SwitchForkDecision and etc

* nits

* Update comment
CriesofCarrots pushed a commit to CriesofCarrots/solana that referenced this pull request Dec 4, 2020
* Better tower logs for SwitchForkDecision and etc

* nits

* Update comment
CriesofCarrots pushed a commit to CriesofCarrots/solana that referenced this pull request Dec 4, 2020
* Better tower logs for SwitchForkDecision and etc

* nits

* Update comment
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants