Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive resource utilization when ethstats reporting is enabled #3896

Closed
nicexe opened this issue Mar 22, 2022 · 15 comments
Closed

Excessive resource utilization when ethstats reporting is enabled #3896

nicexe opened this issue Mar 22, 2022 · 15 comments
Assignees

Comments

@nicexe
Copy link

nicexe commented Mar 22, 2022

Describe the bug
Nethermind is using excessive resources

To Reproduce
Steps to reproduce the behavior:

  1. Setup an AuRa network such xdaichain/posdao-testnet-prepare
  2. Before launching the nodes for the first time (./run_all.sh), segregate the nodes into a different server for each one.
    1. Modify each node's docker-compose.yml and update NETHERMIND_NETWORKCONFIG_EXTERNALIP and NETHERMIND_ETHSTATSCONFIG_SERVER environment variables
    2. Optionally modify run_all.sh and stop_all.sh to only run/stop the specific node or ethstats for each server
  3. Start nodes
  4. On archive server, deploy staking contract cd .. # go back to the `compose` directory and docker-compose start && docker-compose logs -f
  5. Observe system load. This is the baseline with ethstats reporting enabled
  6. Stop the nodes
  7. Modify each node's docker-compose.yml and update NETHERMIND_ETHSTATSCONFIG_ENABLED to false
  8. Observe system load. This is with ethstats reporting disabled

Screenshots
Here you can see some system metrics with ethstats reporting enabled
Focus on system load average
The only outlier is validator1 which seems to report the wrong information yet has the least resource utilization
Screenshot 2022-03-22 at 4 40 14 PM
image
Here you can see some system metrics with ethstats reporting disabled
Screenshot 2022-03-22 at 5 01 07 PM

System Information:

Additional notes
I cannot really explain the discrepancy between validator1 and the other validators in the ethstats enabled scenario.
All validators have almost identical environment variables set. They only differ on node name, keys and external ip.
Each node lives on it own VM, each on a different host machine, some host machines on a different datacenters.
All VMs have identical specs.

@nicexe
Copy link
Author

nicexe commented Mar 22, 2022

In the ethstats enabled scenario, validators are starting to struggle to parse/produce blocks on time and this results in lots of benign AuRa reports

@nicexe
Copy link
Author

nicexe commented Mar 22, 2022

Logs in the ethstats disabled scenario:
Screenshot 2022-03-22 at 5 33 09 PM

Logs in the ethstats enabled scenario
Screenshot 2022-03-22 at 5 47 10 PM

@nicexe
Copy link
Author

nicexe commented Mar 23, 2022

I forgot to mention the VM orientation shown in my screenshots:

archive validator1 validator2
validator3 validator4 validator5

@varasev
Copy link
Contributor

varasev commented Mar 24, 2022

@nicexe Did you try that with v1.12.3?

@nicexe
Copy link
Author

nicexe commented Mar 24, 2022

Hi @varasev thanks for the reply.
I just tried 1.12.3 but the nodes cannot start anymore and I'm getting those 2 errors:

2022-03-24 13:55:05.4288|ERROR|13|Step InitializeBlockchainAuRa failed after 151ms Nethermind.Trie.TrieException: Node 0x3adfe602f80cfa36db0ea3b41a64819071037a685399bf9381508d255e8d9f71 is missing from the DB
   at Nethermind.Trie.Pruning.TrieStore.LoadRlp(Keccak keccak, IKeyValueStore keyValueStore) in /src/Nethermind/Nethermind.Trie/Pruning/TrieStore.cs:line 331
   at Nethermind.Trie.Pruning.ReadOnlyTrieStore.LoadRlp(Keccak hash) in /src/Nethermind/Nethermind.Trie/Pruning/ReadOnlyTrieStore.cs:line 41
   at Nethermind.Trie.TrieNode.ResolveNode(ITrieNodeResolver tree) in /src/Nethermind/Nethermind.Trie/TrieNode.cs:line 234
   at Nethermind.Trie.PatriciaTree.Run(Span`1 updatePath, Int32 nibblesCount, Byte[] updateValue, Boolean isUpdate, Boolean ignoreMissingDelete, Keccak startRootHash) in /src/Nethermind/Nethermind.Trie/PatriciaTree.cs:line 401
   at Nethermind.Trie.PatriciaTree.Get(Span`1 rawKey, Keccak rootHash) in /src/Nethermind/Nethermind.Trie/PatriciaTree.cs:line 319
   at Nethermind.State.StateTree.Get(Address address, Keccak rootHash) in /src/Nethermind/Nethermind.State/StateTree.cs:line 49
   at Nethermind.State.StateProvider.GetState(Address address) in /src/Nethermind/Nethermind.State/StateProvider.cs:line 659
   at Nethermind.State.StateProvider.GetAndAddToCache(Address address) in /src/Nethermind/Nethermind.State/StateProvider.cs:line 674
   at Nethermind.State.StateProvider.GetThroughCache(Address address) in /src/Nethermind/Nethermind.State/StateProvider.cs:line 695
   at Nethermind.State.StateProvider.GetCodeHash(Address address) in /src/Nethermind/Nethermind.State/StateProvider.cs:line 311
   at Nethermind.State.IReadOnlyStateProvider.IsContract(Address address) in /src/Nethermind/Nethermind.State/IReadOnlyStateProvider.cs:line 39
   at Nethermind.Evm.TransactionProcessing.ReadOnlyTransactionProcessor.IsContractDeployed(Address address) in /src/Nethermind/Nethermind.Evm/TransactionProcessing/ReadOnlyTransactionProcessor.cs:line 55
   at Nethermind.Blockchain.Contracts.Contract.ConstantContract.CallRaw(CallInfo callInfo, IReadOnlyTransactionProcessor readOnlyTransactionProcessor) in /src/Nethermind/Nethermind.Blockchain/Contracts/Contract.ConstantContract.cs:line 111
   at Nethermind.Blockchain.Contracts.Contract.ConstantContract.Call(CallInfo callInfo) in /src/Nethermind/Nethermind.Blockchain/Contracts/Contract.ConstantContract.cs:line 104
   at Nethermind.Blockchain.Contracts.Contract.IConstantContract.Call[T](CallInfo callInfo) in /src/Nethermind/Nethermind.Blockchain/Contracts/Contract.ConstantContract.cs:line 43
   at Nethermind.Blockchain.Contracts.Contract.IConstantContract.Call[T](BlockHeader parentHeader, String functionName, Address sender, Object[] arguments) in /src/Nethermind/Nethermind.Blockchain/Contracts/Contract.ConstantContract.cs:line 52
   at Nethermind.Consensus.AuRa.Contracts.ValidatorContract.GetValidators(BlockHeader parentHeader) in /src/Nethermind/Nethermind.Consensus.AuRa/Contracts/ValidatorContract.cs:line 104
   at Nethermind.Consensus.AuRa.Validators.ContractBasedValidator.LoadValidatorsFromContract(BlockHeader parentHeader) in /src/Nethermind/Nethermind.Consensus.AuRa/Validators/ContractBasedValidator.cs:line 246
   at Nethermind.Consensus.AuRa.Validators.ContractBasedValidator.SetFinalizationManager(IAuRaBlockFinalizationManager finalizationManager, BlockHeader parentHeader) in /src/Nethermind/Nethermind.Consensus.AuRa/Validators/ContractBasedValidator.cs:line 84
   at Nethermind.Consensus.AuRa.Validators.ContractBasedValidator..ctor(IValidatorContract validatorContract, IBlockTree blockTree, IReceiptFinder receiptFinder, IValidatorStore validatorStore, IValidSealerStrategy validSealerStrategy, IAuRaBlockFinalizationManager finalizationManager, BlockHeader parentHeader, ILogManager logManager, Int64 startBlockNumber, Int64 posdaoTransition, Boolean forSealing) in /src/Nethermind/Nethermind.Consensus.AuRa/Validators/ContractBasedValidator.cs:line 71
   at Nethermind.Consensus.AuRa.AuRaValidatorFactory.<CreateValidatorProcessor>g__GetContractBasedValidator|19_2(<>c__DisplayClass19_0& ) in /src/Nethermind/Nethermind.Consensus.AuRa/AuRaValidatorFactory.cs:line 104
   at Nethermind.Consensus.AuRa.AuRaValidatorFactory.CreateValidatorProcessor(Validator validator, BlockHeader parentHeader, Nullable`1 startBlock) in /src/Nethermind/Nethermind.Consensus.AuRa/AuRaValidatorFactory.cs:line 131
   at Nethermind.Consensus.AuRa.Validators.MultiValidator.CreateValidator(Int64 finalizedAtBlockNumber, Validator validatorPrototype, BlockHeader parentHeader) in /src/Nethermind/Nethermind.Consensus.AuRa/Validators/MultiValidator.cs:line 222
   at Nethermind.Consensus.AuRa.Validators.MultiValidator.SetCurrentValidator(Int64 finalizedAtBlockNumber, Validator validatorPrototype, BlockHeader parentHeader) in /src/Nethermind/Nethermind.Consensus.AuRa/Validators/MultiValidator.cs:line 204
   at Nethermind.Consensus.AuRa.Validators.MultiValidator.SetCurrentValidator(KeyValuePair`2 validatorInfo, BlockHeader parentHeader) in /src/Nethermind/Nethermind.Consensus.AuRa/Validators/MultiValidator.cs:line 196
   at Nethermind.Consensus.AuRa.Validators.MultiValidator.InitCurrentValidator(Int64 blockNumber, BlockHeader parentHeader) in /src/Nethermind/Nethermind.Consensus.AuRa/Validators/MultiValidator.cs:line 75
   at Nethermind.Consensus.AuRa.Validators.MultiValidator.SetFinalizationManager(IAuRaBlockFinalizationManager finalizationManager, BlockHeader parentHeader) in /src/Nethermind/Nethermind.Consensus.AuRa/Validators/MultiValidator.cs:line 186
   at Nethermind.Consensus.AuRa.Validators.MultiValidator..ctor(Validator validator, IAuRaValidatorFactory validatorFactory, IBlockTree blockTree, IValidatorStore validatorStore, IAuRaBlockFinalizationManager finalizationManager, BlockHeader parentHeader, ILogManager logManager, Boolean forSealing) in /src/Nethermind/Nethermind.Consensus.AuRa/Validators/MultiValidator.cs:line 66
   at Nethermind.Consensus.AuRa.AuRaValidatorFactory.CreateValidatorProcessor(Validator validator, BlockHeader parentHeader, Nullable`1 startBlock) in /src/Nethermind/Nethermind.Consensus.AuRa/AuRaValidatorFactory.cs:line 145
   at Nethermind.Consensus.AuRa.InitializationSteps.InitializeBlockchainAuRa.CreateAuRaValidator(IBlockProcessor processor, IReadOnlyTxProcessorSource readOnlyTxProcessorSource) in /src/Nethermind/Nethermind.Consensus.AuRa/InitializationSteps/InitializeBlockchainAuRa.cs:line 131
   at Nethermind.Consensus.AuRa.InitializationSteps.InitializeBlockchainAuRa.CreateBlockProcessor() in /src/Nethermind/Nethermind.Consensus.AuRa/InitializationSteps/InitializeBlockchainAuRa.cs:line 94
   at Nethermind.Init.Steps.InitializeBlockchain.InitBlockchain() in /src/Nethermind/Nethermind.Init/Steps/InitializeBlockchain.cs:line 220
   at Nethermind.Init.Steps.InitializeBlockchain.Execute(CancellationToken _) in /src/Nethermind/Nethermind.Init/Steps/InitializeBlockchain.cs:line 69
2022-03-24 13:55:05.4476|ERROR|10|Error during ethereum runner start Nethermind.Init.Steps.StepDependencyException: BlockchainProcessor
   at Nethermind.Init.Steps.StartBlockProcessor.Execute(CancellationToken _) in /src/Nethermind/Nethermind.Init/Steps/StartBlockProcessor.cs:line 37
   at Nethermind.Init.Steps.EthereumStepsManager.RunOneRoundOfInitialization(CancellationToken cancellationToken) in /src/Nethermind/Nethermind.Init/Steps/EthereumStepsManager.cs:line 140
   at Nethermind.Init.Steps.EthereumStepsManager.InitializeAll(CancellationToken cancellationToken) in /src/Nethermind/Nethermind.Init/Steps/EthereumStepsManager.cs:line 108
   at Nethermind.Runner.Ethereum.EthereumRunner.Start(CancellationToken cancellationToken) in /src/Nethermind/Nethermind.Runner/Ethereum/EthereumRunner.cs:line 51

I'll try keeping all nodes to latest except one where I would configure it to use 1.12.3 and sync from 0

@nicexe
Copy link
Author

nicexe commented Mar 24, 2022

@varasev 1.12.3 definitely helped

system load
image

ethstats
image

I restored the archive node and validators from 1 to 4 and configured them to use 1.12.6. I removed all data for validator 5 and configured it to use 1.12.3.

After validator 5 synced with the rest of the nodes I let it rest for a few minutes and then checked the system load as seen in the screenshot.

Validator 5 has a system load average of about 1 on a 4 core VM and the other nodes have a system load average of about 8-10 on a 4 core VM each.

Again, the only outlier is validator 1 which has significantly lower CPU utilization and system load average of about 2 on a 4 core VM but it is also the only node reporting invalid stats to the ethstat server.

@LukaszRozmej
Copy link
Member

Hi @nicexe looking into your issue, if 1.12.3 solves it then there is slight chance that the actual reason is different and related to: #3901

@varasev
Copy link
Contributor

varasev commented Mar 27, 2022

@nicexe Can you try to run the setup using the nethermindeth/nethermind:local_txs_broadcasting docker image? It is the build from #3901

@LukaszRozmej
Copy link
Member

Also pushed #3906

@nicexe
Copy link
Author

nicexe commented Mar 28, 2022

I tried to use nethermindeth/nethermind:local_txs_broadcasting on validator 5. This brings the VM that validator 5 runs on to a system load of about 9 and 100% CPU utilization just like the rest of the nodes as opposed to a system load of about 1 when using 1.12.3

@varasev
Copy link
Contributor

varasev commented Mar 29, 2022

Also pushed #3906

@LukaszRozmej Can you please make a docker image for this?

@LukaszRozmej
Copy link
Member

LukaszRozmej commented Mar 30, 2022

@nicexe I fail to actually reproduce this:
image
image

Would you be able to provide a performance trace from one of the misbehaving nodes? I am inquiring our devops on how to do that on nodes running inside docker as that is more complicated.

@matilote would you be able to provide a docker image + instructions on how to use dottrace here?

@LukaszRozmej
Copy link
Member

LukaszRozmej commented Mar 31, 2022

@nicexe can you help us gather data here?

  1. Please use image nethermindeth/nethermind:test
  2. After running it on one or multiple misbehaving nodes please do:
    sudo docker exec -it xdaitestnet-validatorX /bin/bash
    mkdir dumps
    /opt/tools/dottrace attach PID --save-to=dumps
  3. After some time like a minute or two close dottrace with Ctrl+C, it will take the snapshot and save it to dumps folder
  4. Now you need to send us the dump, you can do it different ways. one potential is:
    apt update
    apt install curl
    apt install zip
    zip -9 -r perf.zip dump
    curl --upload-file ./perf.zip https://transfer.sh/perf.zip
    The last one would give you url with the file being available for download.

@MarekM25
Copy link
Contributor

MarekM25 commented Jul 8, 2022

@nicexe we found one potential issue with our ethstats integration. It could be connected with your problem. We will release the fix soon. The fix is merged to master

@LukaszRozmej
Copy link
Member

#4249

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants