-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Follow up to persistent tower with tests and API cleaning #12350
Conversation
Codecov Report
@@ Coverage Diff @@
## master #12350 +/- ##
=========================================
- Coverage 81.9% 81.9% -0.1%
=========================================
Files 362 362
Lines 85221 85236 +15
=========================================
+ Hits 69862 69863 +1
- Misses 15359 15373 +14 |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
|
This stale pull request has been automatically closed. Thank you for your contributions. |
3f5ecc9 to
a607473
Compare
|
I already want #12739 ;) |
core/src/consensus.rs
Outdated
| // Should never consider switching to an ancestor | ||
| // of your last vote | ||
| assert!(!last_vote_ancestors.contains(&switch_slot)); | ||
| // Generally, should never consider switching to an ancestor of your last vote |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should go to #12671
core/src/consensus.rs
Outdated
| // meaning some inconsistency between saved tower and ledger. | ||
| // (newer snapshot, or only a saved tower is moved over to new setup?) | ||
| // compare slots not to error! just because of newer snapshots | ||
| if self.last_switch_threshold_check.is_none() && switch_slot < last_voted_slot { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define should_warn_on_first_switch_check()
| SameFork, | ||
| FailedSwitchThreshold(u64, u64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract this into separate pr.
core/src/replay_stage.rs
Outdated
| ); | ||
| if switch_fork_decision == SwitchForkDecision::FailedSwitchThreshold { | ||
| let a = Some((heaviest_bank.slot(), switch_fork_decision.clone())); | ||
| if tower.last_switch_threshold_check != a { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to tower
core/src/replay_stage.rs
Outdated
| && propagation_confirmed | ||
| && switch_fork_decision != SwitchForkDecision::FailedSwitchThreshold | ||
| { | ||
| if let SwitchForkDecision::FailedSwitchThreshold(_, _) = switch_fork_decision { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deduplicate with the following else clause.
|
@ryoqun - should we bring this into v1.4 as well? I'm thinking why not, since v1.4 will live for a couple months |
Yeah, definitely, I'll do. :) |
803ca2a to
d65328d
Compare
d65328d to
416c5c3
Compare
|
@carllin Could you review this? |
| // With the persisted tower: | ||
| // `A` should not be able to generate a switching proof. | ||
| // | ||
| fn do_test_optimistic_confirmation_violation_with_or_without_tower(with_tower: bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, this looked really involved/painful 🤕. Thanks for implementing that hacky suggestion for a test, hopefully this makes us feel better about the optimistic confirmation + saved tower interaction!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, to be frank, writing this was very hard and needed more logging improvement resulting in #12875. ;) Well, I'm actually surprised that I could write one. xD
| ); | ||
| blockstore.set_dead_slot(prev_voted_slot).unwrap(); | ||
|
|
||
| std::fs::remove_file(Tower::get_filename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this no longer necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carllin Yeah. As I figured out this in hard way, saved tower doesn't protect us from violating opt. conf., if voted slot is marked dead, forcibly. So, as you guessed, i'm effectively making this test harder to pass (=actually violate) by not removing the saved tower and indeed this test still passes. :)
Well, I found it's very hard to make a validator violate optimistic confirmation if and only if tower is removed.
The hard part is that marking a slot as dead makes a validator to violate the optimistic conf. even if tower is persited. And eventually, the cluster makes new roots.
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added explicit comment mentioning for the saved tower: https://github.com/solana-labs/solana/pull/12350/files#r506909261
local-cluster/tests/local_cluster.rs
Outdated
| &opt, | ||
| ) | ||
| .unwrap(); | ||
| std::fs::remove_file(Tower::get_filename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe extract this and other similar calls to a delete_tower() function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
local-cluster/tests/local_cluster.rs
Outdated
| )) | ||
| .unwrap(); | ||
|
|
||
| let blockstore = Blockstore::open_with_access_type( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can extract this open -> purge logic into a separate function because the same logic is reused below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
local-cluster/tests/local_cluster.rs
Outdated
| loop { | ||
| sleep(Duration::from_millis(100)); | ||
| let highest_bank = client | ||
| .get_slot_with_commitment(CommitmentConfig::recent()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to assert here that validator b actually voted on next_slot_on_a (maybe a variation of last_vote_in_tower() that checks tower_contains_vote()?
I think CommitmentConfig::recent() implies the validator voted, but want to double check in case that recent() guarantee ever changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this suggestion makes sense. well, I was just lazy... ;)
local-cluster/tests/local_cluster.rs
Outdated
| )) | ||
| .unwrap(); | ||
|
|
||
| // For some reason, fork_choice always selects slot 27 over votes_on_c_fork. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryoqun I think I see what's happening.
If you don't delete slot 27 from the ledger, then the validator A will immediately vote for 27 on restart, because it hasn't gotten the heavier fork from validator C yet. Then it will be stuck on 27 unable to switch because C doesn't have enough stake to generate a switching proof
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool. that makes sense. The hard part of writing this was that validator actually try very hard to move forward the chain and resolve any divergent. I guess I can't hold validator from voting after restart, right? I'll update this comment of for some reason with your comment. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryoqun yeah no way to prevent the validator from voting on restart, so I think this is the correct thing!
| let node_stakes = vec![31, 36, 33, 0]; | ||
|
|
||
| // Each pubkeys are prefixed with A, B, C and D. | ||
| // D is needed to avoid NoPropagatedConfirmation erorrs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carllin oh, btw, NoPropagatedConfirmation can be worked around just by not staked votes by non-existing vote account. So, any validator can be tricked into the false illusion of propagation. And attacker can do this without fear of slashing, maybe? This might be actually be a bug? Should we exclude 0-lamport votes and introduce some threshold like 1 (or N) lamport(s) or 0.01% stake?
| } | ||
|
|
||
| pub fn to_base58_string(&self) -> String { | ||
| // Remove .iter() once we're rust 1.47+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: @t-nelson Another LengthAtMost32 artificial shackle. I'm looking forward to #12739. :)
error[E0277]: arrays only have std trait implementations for lengths 0..=32
--> sdk/src/signature.rs:51:22
|
51 | bs58::encode(&self.0.to_bytes()).into_string()
| ^^^^^^^^^^^^^^^^^^ the trait `std::array::LengthAtMost32` is not implemented for `[u8; 64]`
|
::: /home/.cargo/registry/src/github.com-1ecc6299db9ec823/bs58-0.3.1/src/lib.rs:207:18
|
207 | pub fn encode<I: AsRef<[u8]>>(input: I) -> encode::EncodeBuilder<'static, I> {
| ----------- required by this bound in `bs58::encode`
|
= note: required because of the requirements on the impl of `std::convert::AsRef<[u8]>` for `[u8; 64]`
= note: required because of the requirements on the impl of `std::convert::AsRef<[u8]>` for `&[u8; 64]`
| // marking this voted slot as dead makes the saved tower garbage | ||
| // effectively. That's because its stray last vote becomes stale (= no | ||
| // ancestor in bank forks). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
|
@carllin I think this pr is ready for another round of review. :) |
we'll take a look at this later. I'll first merge this as this lgtm-ed by @carllin via discord. |
* Follow up to persistent tower * Ignore for now... * Hard-code validator identities for easy reasoning * Add a test for opt. conf violation without tower * Fix compile with rust < 1.47 * Remove unused method * More move of assert tweak to the asser pr * Add comments * Clean up * Clean the test addressing various review comments * Clean up a bit (cherry picked from commit 54517ea)
…12972) * Follow up to persistent tower * Ignore for now... * Hard-code validator identities for easy reasoning * Add a test for opt. conf violation without tower * Fix compile with rust < 1.47 * Remove unused method * More move of assert tweak to the asser pr * Add comments * Clean up * Clean the test addressing various review comments * Clean up a bit (cherry picked from commit 54517ea) Co-authored-by: Ryo Onodera <ryoqun@gmail.com>
| } | ||
|
|
||
| fn purge_slots(blockstore: &Blockstore, start_slot: Slot, slot_count: Slot) { | ||
| blockstore.purge_from_next_slots(start_slot, start_slot + slot_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was the correct order (cf: #13065 )
…s#12350) * Follow up to persistent tower * Ignore for now... * Hard-code validator identities for easy reasoning * Add a test for opt. conf violation without tower * Fix compile with rust < 1.47 * Remove unused method * More move of assert tweak to the asser pr * Add comments * Clean up * Clean the test addressing various review comments * Clean up a bit
…s#12350) * Follow up to persistent tower * Ignore for now... * Hard-code validator identities for easy reasoning * Add a test for opt. conf violation without tower * Fix compile with rust < 1.47 * Remove unused method * More move of assert tweak to the asser pr * Add comments * Clean up * Clean the test addressing various review comments * Clean up a bit
…s#12350) * Follow up to persistent tower * Ignore for now... * Hard-code validator identities for easy reasoning * Add a test for opt. conf violation without tower * Fix compile with rust < 1.47 * Remove unused method * More move of assert tweak to the asser pr * Add comments * Clean up * Clean the test addressing various review comments * Clean up a bit
| // actually saved tower must have at least one vote. | ||
| let last_vote = Tower::restore(&ledger_path, &node_pubkey) | ||
| .unwrap() | ||
| .last_voted_slot() | ||
| .unwrap(); | ||
| Some(last_vote) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @CriesofCarrots noticed (https://discord.com/channels/428295358100013066/560503042458517505/858178548916027402):
Also, why does this method [2] call Tower::restore twice?
this is oversight according to this local commit (fyi, I retain all edits since i cloned the repo locally for accountability...)
The root ancesty of the redundant Tower::restore calls originated from this commit and then evolved significantly, without the bug not being noticed to this day... ;)
$ git show 765ec9c462dbe4e2282426f9230b941dbf5cb8f6
commit 765ec9c462dbe4e2282426f9230b941dbf5cb8f6
Author: Ryo Onodera <ryoqun@gmail.com>
Date: Thu Oct 8 04:13:46 2020 +0900
save
diff --git a/local-cluster/tests/local_cluster.rs b/local-cluster/tests/local_cluster.rs
index f07d021614..128e5071c4 100644
--- a/local-cluster/tests/local_cluster.rs
+++ b/local-cluster/tests/local_cluster.rs
@@ -1717,6 +1717,8 @@ fn do_test_optimistic_confirmation_violation_with_or_without_tower(with_tower: b
let mut bad_vote_detected = false;
let mut retry = 100;
loop {
+ let tower = Tower::restore(&val_a_ledger_path, &val_a_7B);
+ if tower.is_err() { continue }
let s = Tower::restore(&val_a_ledger_path, &val_a_7B).unwrap().last_voted_slot().unwrap();
if val_c_slots.contains(&s) {
bad_vote_detected = true;
Problem
Tower::root() can get away of
Option<_>.And there is postponed local test addition suggested from last review at #10718.
Summary of Changes
Let's simplify code and reduce combinatorial complexes for humans.
And some minor follow up fixes.
Also, adds long-awaited interesting test: #10718 (comment)
There should be no functional change.
Context
follow up #10718