Only fetch snapshot if it's newer than local #12663

ryoqun · 2020-10-05T03:36:29Z

Problem

When a booting validator is fetching snapshot from the cluster, it's possible to fetch older snapshot than local...
Depending on gossip status, it's possible...

If that's the case, the persisted tower ultimately vomits a panic because it now can detect root got warped to the past.

Summary of Changes

Don't ever download older snapshot.

Fixes #12591

ryoqun · 2020-10-05T03:42:21Z

validator/src/main.rs

                &eligible_rpc_peers[thread_rng().gen_range(0, eligible_rpc_peers.len())];
            return (contact_info.clone(), highest_snapshot_hash);
+        } else {
+            retry_reason = Some("No snapshots available".to_owned());


I think this is more clearer place to put this message in terms of control flow..

ryoqun · 2020-10-05T03:42:41Z

validator/src/main.rs

+                    if eligible_rpc_peers.is_empty() {
+                        retry_reason = Some(format!(
+                            "Wait for newer snapshot than local: {:?}",
+                            highest_snapshot_hash
+                        ));
+                        continue;
+                    }


Nit: I this log message doesn't account for:

trusted_snapshot_hashes is empty: https://github.com/solana-labs/solana/blob/master/validator/src/main.rs#L221-L222 or

rpc_peers is empty: https://github.com/solana-labs/solana/blob/master/validator/src/main.rs#L181-L185

If local has a newer snapshot why don't we just use it.

We can do this generically when trusted validators exist: if all available snapshots from trusted validators are older than what local, use local. There could even be a new cli argument, --use-local-snapshot-if-newer-than <NUMBER_OF_SLOTS> that would use a local snapshot if it's at most NUMBER_OF_SLOTS slots behind the newest trusted validator. Default value of NUMBER_OF_SLOTS is maybe 100-500.

@carllin

Nit: I this log message doesn't account for:

Well, I misread the code too but these continues are for the inner loop: https://github.com/solana-labs/solana/blob/master/validator/src/main.rs#L226

If local has a newer snapshot why don't we just use it.

The main reason is that original snapshot RPC selection logic doesn't wait for the full set of specified rpc trusted nodes to be available. So, we can't determine that predicate to begin with.

So, if a booting validator sees a subset of trusted rpc nodes with older snapshot, it considers the snapshot to be the newest one the cluster has to offer and fetches it.

Definitely, we can wait to determine if all available snapshots from trusted validators are older than what local, ..., but this requires 100% liveness of trusted validators. Given two or three trusted nodes for mainnet-beta/testnet, using 90% or 80% liveness threshold wouldn't quite work well.

Also, it's operationally rare to have newer local snapshot assuming solid setup uses --no-snapshot-fetch anyway and/or the connected cluster aren't making newer snapshots.

So this new change is rather for corner-casing. Also it's for development purpose, where frequent and forcible restarts are rampant. ;)

So, I opted for simplicity.

Also, I'll consider --use-local-snapshot-if-newer-than for a bit...

--use-local-snapshot-if-newer-than <NUMBER_OF_SLOTS> that would use a local snapshot if it's at most NUMBER_OF_SLOTS slots behind the newest trusted validator. Default value of NUMBER_OF_SLOTS is maybe 100-500.

Well, sadly supporting this takes a bit of effort than just plumbling (reasons below). How about merging this pr first? So that v1.4 (=~ testnet) doesn't suffer this known persistent-tower bug? (Heh, I was completely unaware of this snapshot-fetch-related failure scenario...).

The detailed reason of more efforts:

First, persistent tower never supports older root bank than voted banks to prevent double vote and other general slashing events. So, we can't use local snapshots even if it's just slightly older (practically, like 100-500) while voting is enabled. (If we really want to relax this restriction, I think we have to adjust persistent tower and general voting behavior; but I don't think this worth to do.).

Persistent tower always assumes local availability of voted slots which aren't isolated like snapshot's root banks. The rationale is that you should have the slots and have replayed them if you voted on them (duh).

The only exception is to start over newer snapshots. This is for some disk failure or corrupted rocksdb.

@mvines Could you check my previous comment?

Especially,

Well, sadly supporting this takes a bit of effort than just plumbling (reasons below). How about merging this pr first? So that v1.4 (=~ testnet) doesn't suffer this known persistent-tower bug? (Heh, I was completely unaware of this snapshot-fetch-related failure scenario...).

Works for me. This gets us is more in the right direction

ryoqun · 2020-10-05T03:43:08Z

validator/src/main.rs


-        let mut highest_snapshot_hash: Option<(Slot, Hash)> = None;
+        let mut highest_snapshot_hash: Option<(Slot, Hash)> =
+            get_highest_snapshot_archive_path(ledger_path)


ryoqun · 2020-10-05T03:44:19Z

validator/src/main.rs

+            retry_reason
+                .as_ref()
+                .map(|s| format!(" (Retrying: {})", s))
+                .unwrap_or_default()


I slightly improved logging to indicate retries more clearly.

ryoqun · 2020-10-05T03:53:52Z

And sadly, there is no tests...

codecov · 2020-10-05T04:57:01Z

Codecov Report

Merging #12663 into master will decrease coverage by 0.0%.
The diff coverage is n/a.

@@            Coverage Diff            @@
##           master   #12663     +/-   ##
=========================================
- Coverage    81.9%    81.9%   -0.1%     
=========================================
  Files         359      359             
  Lines       83989    83989             
=========================================
- Hits        68869    68857     -12     
- Misses      15120    15132     +12

ryoqun · 2020-10-05T05:03:11Z

Although, v1.3 doesn't cause persistent-tower-triggered panics because of this bug, I'm going to backport this there anyway because this is still a correctness error in the v1.3 as well even if v1.3 doesn't implement persistent tower.

carllin

Awesome, thanks!

Pull request has been modified.

ryoqun · 2020-10-07T06:03:39Z

ryoqun added the v1.3 label 2 days ago

I'll wait this for a while. I'll backport this after the mainnet-beta validators have fully transitioned to the v1.3.

ryoqun · 2020-10-08T09:08:02Z

ryoqun added the v1.3 label 2 days ago

I'll wait this for a while. I'll backport this after the mainnet-beta validators have fully transitioned to the v1.3.

Condition cleared! :)

* Only fetch snapshot if it's newer than local * Prefer as_ref over clone * More nits * Don't wait forwever for newer snapshot (cherry picked from commit 81489cc)

* Only fetch snapshot if it's newer than local * Prefer as_ref over clone * More nits * Don't wait forwever for newer snapshot (cherry picked from commit 81489cc) Co-authored-by: Ryo Onodera <ryoqun@gmail.com>

Only fetch snapshot if it's newer than local

a8837d5

ryoqun requested a review from carllin October 5, 2020 03:36

Prefer as_ref over clone

b1abfc1

ryoqun commented Oct 5, 2020

View reviewed changes

More nits

e0aa14a

ryoqun added the v1.3 label Oct 5, 2020

carllin previously approved these changes Oct 5, 2020

View reviewed changes

Don't wait forwever for newer snapshot

1d4e6a0

ryoqun mentioned this pull request Oct 7, 2020

Fix tower/blockstore unsync due to external causes #12671

Merged

mvines added the v1.4 label Oct 8, 2020

mvines added this to the v1.4.0 milestone Oct 8, 2020

mvines approved these changes Oct 8, 2020

View reviewed changes

ryoqun merged commit 81489cc into solana-labs:master Oct 9, 2020

mergify bot mentioned this pull request Oct 9, 2020

Only fetch snapshot if it's newer than local (bp #12663) #12751

Merged

mergify bot mentioned this pull request Oct 9, 2020

Only fetch snapshot if it's newer than local (bp #12663) #12752

Merged

Only fetch snapshot if it's newer than local #12663

Only fetch snapshot if it's newer than local #12663

Uh oh!

Conversation

ryoqun commented Oct 5, 2020

Problem

Summary of Changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carllin Oct 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryoqun Oct 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryoqun Oct 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryoqun commented Oct 5, 2020

Uh oh!

codecov bot commented Oct 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ryoqun commented Oct 5, 2020

Uh oh!

carllin left a comment

Choose a reason for hiding this comment

Uh oh!

ryoqun commented Oct 7, 2020

Uh oh!

ryoqun commented Oct 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

carllin Oct 5, 2020 •

edited

Loading

ryoqun Oct 6, 2020 •

edited

Loading

ryoqun Oct 7, 2020 •

edited

Loading

codecov bot commented Oct 5, 2020 •

edited

Loading