Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Devs SBAT be sure of Istanbul liveness #116

Closed
timmoreton opened this issue Feb 26, 2019 · 9 comments · Fixed by #366
Closed

Devs SBAT be sure of Istanbul liveness #116

timmoreton opened this issue Feb 26, 2019 · 9 comments · Fixed by #366
Assignees

Comments

@timmoreton
Copy link
Contributor

Istanbul has a liveness issue described in:
https://arxiv.org/pdf/1901.07160.pdf
https://drive.google.com/open?id=1v5kOvu5eppGHXdHqHoCIUMVmV7eAsDwpLljDtvVBR_U

and also described here:
ethereum/EIPs#650 (comment)

Istanbul learns of new blocks via miner/worker.go line 428, which results in a new FinalCommittedEvent being posted to the engine, handled at backend/handler.go line 123. Currently this just starts a new round.

@timmoreton
Copy link
Contributor Author

I believe (but haven't checked) that the Pantheon Istanbul implementation fixes this -- see here for more info: https://github.com/PegaSysEng/pantheon/tree/master/consensus/ibft

@nambrot
Copy link
Contributor

nambrot commented Mar 5, 2019

The liveness issue seems to be as following.

  • Assume n validators, with Quorum(n) as the byzantine fault threshold.
  • Assume n-1 honest validators, and one byzantine validator F (doesn't even have to be malicious, just
  • Set W contains Quorum(n)-1 honest validators and the byzantine validator
  • Set V contains all the other honest validators, the size of V is by definition strictly smaller than Quorum(n)

The scenario is as following:

  • A valid proposer proposes Block B1 for height h, sends the PREPREPARE and gets everyone to send a PREPARE based upon that
  • Only the nodes in V receive QUORUM(n) PREPARE, and will thus lock onto B1 for height h, the nodes in W never receive enough PREPAREs for the lock
  • Everyone eventually will issue a ROUND-CHANGE after the timeout
  • Byzantine node F will propose block B2 (it never locked on B1)
  • All nodes in W will PREPARE on B2 and with F will achieve QUORUM(n) and lock onto B2
  • All nodes in V will issue ROUND-CHANGE as they have already locked on B1

No progress can be made since no one will unlock

@nambrot
Copy link
Contributor

nambrot commented Mar 5, 2019

Also interesting comment openethereum/parity-ethereum#9298 (comment) and Consensys/quorum#305

@nambrot
Copy link
Contributor

nambrot commented Mar 5, 2019

My understanding of the original locking mechanism in IBFT is to ensure the safety during round changes. Otherwise byzantine nodes can send multiple PREPARE and COMMITs to different subsets of honest nodes which will then cause honest nodes to agree on different blocks at the same height, although I'm not entirely sure how, if honest nodes only ever send out one PREPARE

It seems to me that the locking mechanism is inherently impossible to remedy with liveness, an example being Consensys/quorum#305.

The solution above as well as the second on in the correctness analysis paper is to basically implement what tendermint has, a proof that it is safe for the nodes to unlock.

The first solution in the paper is get rid of locking completely apparently, but not sure I completely understand it. To quote:

The key change to IBFT is to remove the locking mechanism and ensure safety between round changes via a round change sub-protocol similar to the one in the PBFT protocol

@nambrot
Copy link
Contributor

nambrot commented Mar 6, 2019

I believe this is the equivalent liveness PR in pantheon PegaSysEng/pantheon#872

@trianglesphere
Copy link
Contributor

trianglesphere commented Mar 9, 2019

As noted in the arxiv paper (section 6.2), there are two solutions to the liveness problem proposed in the paper.

1. Remove the locking and use a PBFT like round change.

Per this article IBFT 2.0 from PegaSys is going to use the PBFT style mechanism.

The round change modification addresses the liveness issues in IBFT 1.0. In IBFT 1.0, a round change always results in new block proposals. In IBFT 2.0, an enhanced round change protocol ensures that if any validator commits in a round, then the block proposed in any successive round matches the committed block. This mechanism, inspired by the PBFT view change protocol, removes the need for the block locking mechanism used in IBFT 1.0 which caused liveness issues.

Removing the locking mechanism solves the problem of getting a split locking situation, but it removes the safety guarantee.

IBFT 2.0 reintroduces the safety guarantee by introducing the following requirements on round changes. (PBFT view changes are more general, see section 4.4. of PBFT paper).

  1. Each round change has a prepared certificate which includes proof that a block has 2n/3 valid prepare messages for it.
  2. When the next proposer proposes a block, it must include a pre-prepare message with the new round message.
  3. The pre-prepare message from the new proposer must propose the highest block that has a valid prepare certificate. If no such certificate exists, the new proposer can propose any block that it would like to.

In a PBFT context with multiple actions in a single view (analogous to a round), the new proposer needs to use prepare certificates rather than commit certificates (proof that a node has commited an action) because it appears possible for the new proposer to not get a commit certificate and thus safety is lost if a node has locally committed, but the proposer won't. Here is a good explanation of the view change in PBFT and why it needs the prepare certificates.

Looking into PegaSys's source code. When a message is broadcast, it can either be in round 0, or a later round. If my understanding is correct that at each block height a new instance of IBFT block finalization is started, then round starts off at 0 for each new block. Here is PegaSys's block validating for ibft2.0. This thing to pay attention to is validateProposalAndRoundChangeAreConsistent which requires no round change certificate in the case of the first round or having a valid round change cert in the case of actually changing a round. IBFT 2.0 may have relaxed prepare certificate to commit certificate, but I'm not quite sure yet.

2. Allow relocking on newer round like tendermint does.

The lock causes problems because some nodes end up locked forever because of network partitioning.

The reason that tendermint has locking and proof of lock change (polka) is explained here (thesis on tendermint) on page 23. I've just quoted relevant sections that explain why the lock and proof of lock change is required.

Explanation of locking

These rules can be understood more intuitively by way of examples. Consider four validators, A, B, C, D, and suppose there is a proposal for blockX at round R. Suppose there is a polka for blockX, but A doesn’t see it, and pre-commits nil, while the others pre-commit for blockX. Now suppose the only one to see all pre-commits is D, while the others, say, don’t see D’s pre-commit (they only see their two pre-commits and A’s pre-commit nil). D will now commit the block, while the others go to round R + 1. Since any of the validators might be the new proposer, if they can propose and vote for any new block, say blockY , then they might commit it and compromise safety, since D already committed blockX. Note that there isn’t even any Byzantine behaviour here, just asynchrony! Locking solves the problem by forcing validators to stick with the block they pre-committed, since other validators might have committed based on those pre-commits (as D did in this example). In essence, once more than two-thirds pre-commit a block in a round, the network is locked on that block, which is to say it must be impossible to produce a valid polka for a different block at a higher round. This is direct motivation for Prevote-the-Lock.

Explanation of why polka needs to exist

Prevote-the-Lock is not sufficient, however. There must be a way to unlock, lest we sacrifice liveness. Consider a round where A and B precommitted blockX while C and D pre-committed nil - a split vote. They all move to the next round, and blockY is proposed, which C and D prevote for. Suppose A is Byzantine, and prevotes for blockY as well (despite being locked on blockX), resulting in a polka. Suppose B does not see the polka and pre-commits nil, while A goes off-line and C and D pre-commit blockY . They move to the next round, but B is still locked on blockX, while C and D are now locked on blockY , and since A is offline, they can never get a polka. Hence, we’ve compromised liveness with less than a third (here, only one) Byzantine validators. The obvious justification for unlocking is a polka. Once B sees the polka for blockY (which C and D used to jusitfy their pre-commits for blockY ), it ought to be able to unlock, and hence pre-commit blockY . This is the motivation for Unlock-on-Polka, which allows validators to unlock (and precommit a new block), if they have seen a polka in a round greater than that in which they locked.

@trianglesphere trianglesphere self-assigned this Mar 13, 2019
@trianglesphere
Copy link
Contributor

I've started a small test to show how istanbul can lock up. The test doesn't run, but it has a comment explaining how to set up the test. The branch is trianglesphere/istanbul-liveness-failure

@nambrot
Copy link
Contributor

nambrot commented Apr 23, 2019

Fun fact, quorum put Consensys/quorum#305 on their april deliverable

@nambrot
Copy link
Contributor

nambrot commented Jul 19, 2019

There was some additional conversation on Consensys/quorum#305 with the author of the correctness analysis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants