Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Conversation

@ryoqun
Copy link
Contributor

@ryoqun ryoqun commented Oct 5, 2020

Problem

When a booting validator is fetching snapshot from the cluster, it's possible to fetch older snapshot than local...
Depending on gossip status, it's possible...

If that's the case, the persisted tower ultimately vomits a panic because it now can detect root got warped to the past.

Summary of Changes

Don't ever download older snapshot.

Fixes #12591

@ryoqun ryoqun requested a review from carllin October 5, 2020 03:36
&eligible_rpc_peers[thread_rng().gen_range(0, eligible_rpc_peers.len())];
return (contact_info.clone(), highest_snapshot_hash);
} else {
retry_reason = Some("No snapshots available".to_owned());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more clearer place to put this message in terms of control flow..

Comment on lines 272 to 278
if eligible_rpc_peers.is_empty() {
retry_reason = Some(format!(
"Wait for newer snapshot than local: {:?}",
highest_snapshot_hash
));
continue;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🍖

Copy link
Contributor

@carllin carllin Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If local has a newer snapshot why don't we just use it.

We can do this generically when trusted validators exist: if all available snapshots from trusted validators are older than what local, use local. There could even be a new cli argument, --use-local-snapshot-if-newer-than <NUMBER_OF_SLOTS> that would use a local snapshot if it's at most NUMBER_OF_SLOTS slots behind the newest trusted validator. Default value of NUMBER_OF_SLOTS is maybe 100-500.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carllin

Nit: I this log message doesn't account for:

Well, I misread the code too but these continues are for the inner loop: https://github.com/solana-labs/solana/blob/master/validator/src/main.rs#L226

Copy link
Contributor Author

@ryoqun ryoqun Oct 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If local has a newer snapshot why don't we just use it.

The main reason is that original snapshot RPC selection logic doesn't wait for the full set of specified rpc trusted nodes to be available. So, we can't determine that predicate to begin with.

So, if a booting validator sees a subset of trusted rpc nodes with older snapshot, it considers the snapshot to be the newest one the cluster has to offer and fetches it.

Definitely, we can wait to determine if all available snapshots from trusted validators are older than what local, ..., but this requires 100% liveness of trusted validators. Given two or three trusted nodes for mainnet-beta/testnet, using 90% or 80% liveness threshold wouldn't quite work well.

Also, it's operationally rare to have newer local snapshot assuming solid setup uses --no-snapshot-fetch anyway and/or the connected cluster aren't making newer snapshots.

So this new change is rather for corner-casing. Also it's for development purpose, where frequent and forcible restarts are rampant. ;)

So, I opted for simplicity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'll consider --use-local-snapshot-if-newer-than for a bit...

Copy link
Contributor Author

@ryoqun ryoqun Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--use-local-snapshot-if-newer-than <NUMBER_OF_SLOTS> that would use a local snapshot if it's at most NUMBER_OF_SLOTS slots behind the newest trusted validator. Default value of NUMBER_OF_SLOTS is maybe 100-500.

Well, sadly supporting this takes a bit of effort than just plumbling (reasons below). How about merging this pr first? So that v1.4 (=~ testnet) doesn't suffer this known persistent-tower bug? (Heh, I was completely unaware of this snapshot-fetch-related failure scenario...).

The detailed reason of more efforts:

First, persistent tower never supports older root bank than voted banks to prevent double vote and other general slashing events. So, we can't use local snapshots even if it's just slightly older (practically, like 100-500) while voting is enabled. (If we really want to relax this restriction, I think we have to adjust persistent tower and general voting behavior; but I don't think this worth to do.).

Persistent tower always assumes local availability of voted slots which aren't isolated like snapshot's root banks. The rationale is that you should have the slots and have replayed them if you voted on them (duh).

The only exception is to start over newer snapshots. This is for some disk failure or corrupted rocksdb.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvines Could you check my previous comment?

Especially,

Well, sadly supporting this takes a bit of effort than just plumbling (reasons below). How about merging this pr first? So that v1.4 (=~ testnet) doesn't suffer this known persistent-tower bug? (Heh, I was completely unaware of this snapshot-fetch-related failure scenario...).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. This gets us is more in the right direction


let mut highest_snapshot_hash: Option<(Slot, Hash)> = None;
let mut highest_snapshot_hash: Option<(Slot, Hash)> =
get_highest_snapshot_archive_path(ledger_path)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩

retry_reason
.as_ref()
.map(|s| format!(" (Retrying: {})", s))
.unwrap_or_default()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly improved logging to indicate retries more clearly.

@ryoqun
Copy link
Contributor Author

ryoqun commented Oct 5, 2020

And sadly, there is no tests...

@codecov
Copy link

codecov bot commented Oct 5, 2020

Codecov Report

Merging #12663 into master will decrease coverage by 0.0%.
The diff coverage is n/a.

@@            Coverage Diff            @@
##           master   #12663     +/-   ##
=========================================
- Coverage    81.9%    81.9%   -0.1%     
=========================================
  Files         359      359             
  Lines       83989    83989             
=========================================
- Hits        68869    68857     -12     
- Misses      15120    15132     +12     

@ryoqun ryoqun added the v1.3 label Oct 5, 2020
@ryoqun
Copy link
Contributor Author

ryoqun commented Oct 5, 2020

Although, v1.3 doesn't cause persistent-tower-triggered panics because of this bug, I'm going to backport this there anyway because this is still a correctness error in the v1.3 as well even if v1.3 doesn't implement persistent tower.

carllin
carllin previously approved these changes Oct 5, 2020
Copy link
Contributor

@carllin carllin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

@mergify mergify bot dismissed carllin’s stale review October 6, 2020 04:54

Pull request has been modified.

@ryoqun
Copy link
Contributor Author

ryoqun commented Oct 7, 2020

ryoqun added the v1.3 label 2 days ago

I'll wait this for a while. I'll backport this after the mainnet-beta validators have fully transitioned to the v1.3.

@ryoqun
Copy link
Contributor Author

ryoqun commented Oct 8, 2020

ryoqun added the v1.3 label 2 days ago

I'll wait this for a while. I'll backport this after the mainnet-beta validators have fully transitioned to the v1.3.

Condition cleared! :)

@mvines mvines added this to the v1.4.0 milestone Oct 8, 2020
@ryoqun ryoqun merged commit 81489cc into solana-labs:master Oct 9, 2020
mergify bot pushed a commit that referenced this pull request Oct 9, 2020
* Only fetch snapshot if it's newer than local

* Prefer as_ref over clone

* More nits

* Don't wait forwever for newer snapshot

(cherry picked from commit 81489cc)
mergify bot pushed a commit that referenced this pull request Oct 9, 2020
* Only fetch snapshot if it's newer than local

* Prefer as_ref over clone

* More nits

* Don't wait forwever for newer snapshot

(cherry picked from commit 81489cc)
mergify bot added a commit that referenced this pull request Oct 9, 2020
* Only fetch snapshot if it's newer than local

* Prefer as_ref over clone

* More nits

* Don't wait forwever for newer snapshot

(cherry picked from commit 81489cc)

Co-authored-by: Ryo Onodera <ryoqun@gmail.com>
mergify bot added a commit that referenced this pull request Oct 9, 2020
* Only fetch snapshot if it's newer than local

* Prefer as_ref over clone

* More nits

* Don't wait forwever for newer snapshot

(cherry picked from commit 81489cc)

Co-authored-by: Ryo Onodera <ryoqun@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash from tower root assertion on validator startup

3 participants