Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QBFT Bonsai 24.3.3 private network halted #7642

Open
carlossrp opened this issue Sep 19, 2024 · 4 comments
Open

QBFT Bonsai 24.3.3 private network halted #7642

carlossrp opened this issue Sep 19, 2024 · 4 comments
Labels
non mainnet (private networks) not related to mainnet features - covers privacy, permissioning, IBFT2, QBFT P3 Medium (ex: JSON-RPC request not working with a specific client library due to loose spec assumtion) QBFT QBFT Consensus re;ated

Comments

@carlossrp
Copy link

Description

Hi guys, we have a production private QBFT PoA network running Besu 24.3.3 that halted this week. The network architecture is:

  • 4 RPC Bonsai Full nodes
  • 1 Archive Forest Full node
  • 4 Validator Bonsai Full nodes
  • Other 4 RPC Bonsai Full nodes, that act as “standby validators”.

The nodes are distributed in different AZs and regions. We have each validator in one AZ, two of them in one region and the other two in another. It’s the same for “standby validators” and RPC node.
This week, the network just halted. The validators, which were in INFO log level, stopped producing blocks with no error message, and even stopped logging anything.
To quick recover from this, we manually restarted the 4 validators, bringing the network online again. Validator logs are attached.
The network was running for +- 4 months without any issues.
Something curious is that some RPC nodes came back, some not. We restarted 2 of the main RPC nodes, which applications connect, and they started to sync blocks, but the other 2 recovered themselves, without intervention.
But there’s still “standby validators” that we didn’t restart and are halted. We changed the log level of them to debug and trace but didn’t have much luck with the information. Both logs are attached as well.
One last important information is that near the time the network froze, we’re deploying an application contract to the network. Smart contract is solidity 0.8.24 version, with hardhat.
This contract was deployed before in one of our development/QA networks without any issues, and after the crash of the network and the restart, we’re able to deploy it to the network, but it’s the major event that happened near the crash time besides the “business as usual” load in other contracts. The applications were running as usual among this deploy.
Our guess here is maybe the deploy triggered a bug, or it could be a bug alone in the 24.3.3. We already planning to upgrade to 24.9.1, but we would like to share this case with the community because this could be a bug, and we’re afraid it could be something related to Bonsai and QBFT.
Thank you

Steps to Reproduce

We’re unable to reproduce the issue in our test/QA environments. We have a QA network running for about the same time with no issues.

Versions

  • Software Version: Besu 24.3.3
  • Docker image: official hyperledger/besu:24.3.3, arm image
  • AWS EKS: 1.29
  • EC2: m6g.xlarge

Smart contract information

  • Solidity version: 0.8.24
  • ethers v6
  • hardhat 2.22.3

Additional information

validator-1_-_Copy
validator-2_-_Copy
validator-4_-_Copy

@jframe
Copy link
Contributor

jframe commented Sep 30, 2024

Do you have the debug RPC enabled? If so can use the debug_getBadBlocks to retrieve any bad blocks?

@carlossrp
Copy link
Author

Hi @jframe, we have DEBUG enabled, I ran against a still failed node (validator 7, which is still frozen, generating no logs but with rpc port open) and one active validator, validator-1. Validator 7 didn't return anything, but validator 1 returned lots of info. Both follows attached.
validator-7-debug-getbadblocks.txt
validator-1-debug-getbadblocks.txt

@jflo jflo added non mainnet (private networks) not related to mainnet features - covers privacy, permissioning, IBFT2, QBFT QBFT QBFT Consensus re;ated P3 Medium (ex: JSON-RPC request not working with a specific client library due to loose spec assumtion) labels Oct 8, 2024
@jflo
Copy link
Contributor

jflo commented Oct 8, 2024

@matthew1001 we think this is a P3, what do you think?

@matthew1001
Copy link
Contributor

Yeah that sounds about right to me @jflo Worth saying that we've tried to improve logging around this sort of scenario, without needing DEBUG enabled. With 24.8.0 onwards (might actually be 24.7.1 onwards) when BFT nodes aren't mining blocks they report updates like this:

2024-10-08 16:24:55.617+01:00 | BftProcessorExecutor-QBFT-0 | INFO  | RoundTimer | BFT round 5 expired. Moved to round 6 which will expire in 320 seconds
2024-10-08 16:24:55.626+01:00 | BftProcessorExecutor-QBFT-0 | INFO  | RoundChangeManager | BFT round summary (quorum = 3)
2024-10-08 16:24:55.626+01:00 | BftProcessorExecutor-QBFT-0 | INFO  | RoundChangeManager | Address: 0xa0685b78b74b63841651855f23d8161482e3bd1b  Round: 6 (Local node)
2024-10-08 16:24:55.626+01:00 | BftProcessorExecutor-QBFT-0 | INFO  | RoundChangeManager | Address: 0x33b38bbb139832055630194c0673a43b6075663f  Round: 6 
2024-10-08 16:26:37.671+01:00 | BftProcessorExecutor-QBFT-0 | INFO  | RoundChangeManager | BFT round summary (quorum = 3)
2024-10-08 16:26:37.672+01:00 | BftProcessorExecutor-QBFT-0 | INFO  | RoundChangeManager | Address: 0xf4ffcf181a94328e848a9622d2d668d9a7035089  Round: 1 
2024-10-08 16:26:37.672+01:00 | BftProcessorExecutor-QBFT-0 | INFO  | RoundChangeManager | Address: 0xa0685b78b74b63841651855f23d8161482e3bd1b  Round: 6 (Local node)
2024-10-08 16:26:37.672+01:00 | BftProcessorExecutor-QBFT-0 | INFO  | RoundChangeManager | Address: 0x33b38bbb139832055630194c0673a43b6075663f  Round: 6 
2024-10-08 16:26:47.683+01:00 | BftProcessorExecutor-QBFT-0 | INFO  | RoundChangeManager | BFT round summary (quorum = 3)
2024-10-08 16:26:47.684+01:00 | BftProcessorExecutor-QBFT-0 | INFO  | RoundChangeManager | Address: 0xf4ffcf181a94328e848a9622d2d668d9a7035089  Round: 2 
2024-10-08 16:26:47.684+01:00 | BftProcessorExecutor-QBFT-0 | INFO  | RoundChangeManager | Address: 0xa0685b78b74b63841651855f23d8161482e3bd1b  Round: 6 (Local node)
2024-10-08 16:26:47.685+01:00 | BftProcessorExecutor-QBFT-0 | INFO  | RoundChangeManager | Address: 0x33b38bbb139832055630194c0673a43b6075663f  Round: 6 
2024-10-08 16:27:07.691+01:00 | BftProcessorExecutor-QBFT-0 | INFO  | RoundChangeManager | BFT round summary (quorum = 3)
2024-10-08 16:27:07.692+01:00 | BftProcessorExecutor-QBFT-0 | INFO  | RoundChangeManager | Address: 0xf4ffcf181a94328e848a9622d2d668d9a7035089  Round: 3 
2024-10-08 16:27:07.692+01:00 | BftProcessorExecutor-QBFT-0 | INFO  | RoundChangeManager | Address: 0xa0685b78b74b63841651855f23d8161482e3bd1b  Round: 6 (Local node)
2024-10-08 16:27:07.692+01:00 | BftProcessorExecutor-QBFT-0 | INFO  | RoundChangeManager | Address: 0x33b38bbb139832055630194c0673a43b6075663f  Round: 6 

which helps indicate if the nodes are trying to agree on a new round. Often, resetting the rounds to 0 (i.e. restarted the nodes) is the quickest way to resolve this. At some point I do have in mind to make this more automated/quicker to recover

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non mainnet (private networks) not related to mainnet features - covers privacy, permissioning, IBFT2, QBFT P3 Medium (ex: JSON-RPC request not working with a specific client library due to loose spec assumtion) QBFT QBFT Consensus re;ated
Projects
None yet
Development

No branches or pull requests

4 participants