Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add switch threshold local cluster tests + fix check_switch_threshold #10343

Closed
wants to merge 5 commits into from

Conversation

carllin
Copy link
Contributor

@carllin carllin commented May 31, 2020

Problem

  1. No local cluster integration tests for switch threshold logic
  2. https://github.com/solana-labs/solana/blob/master/core/src/consensus.rs#L451-L454 is not longer a correct check because BankForks.ancestors() can now return with keys < root.
  3. https://github.com/solana-labs/solana/blob/master/core/src/consensus.rs#L415 to check only tips of forks can cause liveness problems if the tip of the fork is not frozen/lockouts not computed

Summary of Changes

  1. Add local cluster tests to check switch threshold success/failure when there is less than and greater than maximal failures
  2. Fix check_switch_threshold checks, update unit tests

Fixes #

@carllin carllin requested a review from aeyakovenko May 31, 2020 06:45
@carllin carllin added the v1.2 label May 31, 2020
Comment on lines +510 to +511
let alive_stake_1 = total_alive_stake / 2;
let alive_stake_2 = total_alive_stake - alive_stake_1;
Copy link
Contributor Author

@carllin carllin May 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aeyakovenko, @sakridge I think this test highlights a potential liveness problem caused by the interaction of the fork choice rule and the switching threshold (you can adjust the max_failures stake and the distribution of alive_stake_1 and alive_stake_2 to test the below example).

Recall the current behavior when a validator on fork A detects a heavier fork B, the validator will

  1. Stop voting on its own fork A (to prevent liveness issues where you can never generate a switching proof b/c your own vote is a moving target)
  2. Continue generating blocks on its own fork A
  3. Wait for a switching proof to fork B, at which point it will vote and reset to B

Now consider the below example, where the switching threshold is 38% (need >38% to switch). In the example, a partition occurs such that:

  1. One partition has <=38% of the stake but is heavier (it has a lucky run of slots)
  2. One partition has >62% but is lighter
         4- 5- 6- 7-8-9-10  (38%)
        / 
1----2
        \
          3-11 (62%)

In the above example, during the partition there was a brief run where the 38% partition got more blocks.

Then when the partitions resolve, partition 2 will stop voting and wait for a switching proof to the heavier partition 1 (code here: https://github.com/solana-labs/solana/blob/master/core/src/replay_stage.rs#L1451-L1459) but 1 will never have enough stake for a switching proof).

Note that even though partition 2 will eventually generate more blocks than partition 1, validators on partition 2 are not voting, so the weight doesn't catch up to partition 1 (original design was based on fact that fork with more stake/slots would eventually catch up due to generating more blocks, but this reasoning seems flawed) because partitioin 1 is both generating blocks and voting (adding a voting bank increases fork weight by a factor of 2^n, simply adding unvoted banks only increases the fork weight linearly)

Ideally we would want to adjust the fork choice rule such that 38% switch over to the 62% fork because it has more of the most recent stake voting/building on it, even though our current fork weight may say otherwise.

For example, what if we tracked the most recent votes landed by each validator, and updated these as validators voted on blocks on same/different forks? Then when considering two forks, traverse the tree, picking the fork that has the greatest amount of most recent stake voting on top of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, i see. What's weird is that 38% fork can construct a switching proof to the 62% fork, but its lighter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the accumulation of weight ignores the expiration of votes. that's part of the problem. because eventually the 38% fork would expire all their votes and the 62 will catch up.

@carllin carllin requested a review from sakridge May 31, 2020 10:17
@codecov
Copy link

codecov bot commented May 31, 2020

Codecov Report

Merging #10343 into master will increase coverage by 0.0%.
The diff coverage is 58.3%.

@@           Coverage Diff            @@
##           master   #10343    +/-   ##
========================================
  Coverage    81.2%    81.2%            
========================================
  Files         288      288            
  Lines       67093    67171    +78     
========================================
+ Hits        54481    54596   +115     
+ Misses      12612    12575    -37     

@stale
Copy link

stale bot commented Jun 8, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jun 8, 2020
@stale
Copy link

stale bot commented Jun 15, 2020

This stale pull request has been automatically closed. Thank you for your contributions.

@stale stale bot closed this Jun 15, 2020
@carllin carllin removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jul 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants