Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BONSAI: Heal started directly during checkpoint sync (missing first step) #6173

Open
non-fungible-nelson opened this issue Nov 15, 2023 · 1 comment
Labels
bonsai bug Something isn't working mainnet P3 Medium (ex: JSON-RPC request not working with a specific client library due to loose spec assumtion) syncing

Comments

@non-fungible-nelson
Copy link
Contributor

non-fungible-nelson commented Nov 15, 2023

Description

As a user, I am encountering invalid slots in the bonsai tree during normal operation of a validating node. This triggers an auto-heal, but would like this to be avoidable in the future. This occurred on a rocketpool node but not using the rocketpool configs.

Regarding logs it seems that node restarted three times at the beginning. I'm not sure why these restarts of Besu occurred. this triggered a bug in the checkpoint sync, which then started directly in heal mode, corresponding to the last step. Since the first step is missing, the state is not completely valid

we need to find a way to avoid starting directly the heal if the steps before are not finished

DOCKER environment for 23.10.1.

Discord context

Acceptance Criteria

  • Bonsai PARTIAL and FULL do not encounter invalid slots during normal operation, and the DB does not get corrupted because checkpoint should never start the heal if the steps before are not finished yet

Logs (if a bug)

From the user -

sync-mode = "X_CHECKPOINT"
data-storage-format = "BONSAI"

Log during initial sync

2023-11-11 05:20:20.127+00:00 | BesuCommand-Shutdown-Hook | INFO  | EthProtocolManager | Stopping eth Subprotocol.
2023-11-11 05:20:20.128+00:00 | BesuCommand-Shutdown-Hook | INFO  | WorldDownloadState | World state download cancelled
2023-11-11 05:20:21.128+00:00 | BesuCommand-Shutdown-Hook | INFO  | EthProtocolManager | eth Subprotocol stopped.
2023-11-11 05:20:21.129+00:00 | BesuCommand-Shutdown-Hook | INFO  | NetworkRunner | Network stopped.
2023-11-11 05:20:21.130+00:00 | BesuCommand-Shutdown-Hook | INFO  | AutoTransactionLogBloomCachingService | Shutting down Auto transaction logs caching service.
2023-11-11 05:20:21.131+00:00 | BesuCommand-Shutdown-Hook | INFO  | NatService | No NAT environment detected so no service could be stopped
2023-11-11 05:20:26.723+00:00 | main | INFO  | Besu | Starting Besu
2023-11-11 05:20:26.968+00:00 | main | WARN  | Besu | --graphql-http-host has been ignored because --graphql-http-enabled was not defined on the command line.
2023-11-11 05:20:26.969+00:00 | main | WARN  | Besu | --rpc-ws-host has been ignored because --rpc-ws-enabled was not defined on the command line.
2023-11-11 05:20:27.026+00:00 | main | INFO  | Besu |
....
2023-11-11 05:20:27.027+00:00 | main | INFO  | Besu | Connecting to 0 static nodes.
2023-11-11 05:20:27.027+00:00 | main | INFO  | Besu | Security Module: localfile
2023-11-11 05:20:27.052+00:00 | main | INFO  | Besu | Using the native implementation of alt bn128
2023-11-11 05:20:27.073+00:00 | main | INFO  | Besu | Using the native implementation of modexp
2023-11-11 05:20:27.073+00:00 | main | INFO  | Besu | Using the native implementation of the signature algorithm
2023-11-11 05:20:27.075+00:00 | main | INFO  | Besu | Using the native implementation of the blake2bf algorithm
2023-11-11 05:20:27.165+00:00 | main | INFO  | KeyPairUtil | Loaded public key 0xe7d0ff6a448b0c072f955450e1bf04b41a1a425e0b020e6b7385b9bcf48541c3c55fc005faed00348d348ef871906706310006b0d79a09857457b84d6bffa20c from /var/lib/besu/data/key
2023-11-11 05:20:27.242+00:00 | main | INFO  | ProtocolScheduleBuilder | Protocol schedule created with milestones: [Frontier:0, Homestead:1150000, DaoRecoveryInit:1920000, DaoRecoveryTransition:1920001, Homestead:1920010, TangerineWhistle:2463000, SpuriousDragon:2675000, Byzantium:4370000, Petersburg:7280000, Istanbul:9069000, MuirGlacier:9200000, Berlin:12244000, London:12965000, ArrowGlacier:13773000, GrayGlacier:15050000, Shanghai:1681338455]
2023-11-11 05:20:27.248+00:00 | main | INFO  | ProtocolScheduleBuilder | Protocol schedule created with milestones: [Paris:0, Paris:1150000, Paris:1920000, Paris:1920001, Paris:1920010, Paris:2463000, Paris:2675000, Paris:4370000, Paris:7280000, Paris:9069000, Paris:9200000, Paris:12244000, Paris:12965000, Paris:13773000, Paris:15050000, Shanghai:1681338455]
2023-11-11 05:20:27.451+00:00 | main | INFO  | DatabaseMetadata | Lookup database metadata file in data directory: /var/lib/besu/data
2023-11-11 05:20:27.479+00:00 | main | INFO  | RocksDBKeyValueStorageFactory | Existing database detected at /var/lib/besu/data. Version 2. Compacting database...
2023-11-11 05:20:28.264+00:00 | main | INFO  | BonsaiWorldStateKeyValueStorage | Bonsai flat db mode found PARTIAL
2023-11-11 05:20:28.309+00:00 | main | INFO  | TransactionPoolFactory | Transaction pool disabled while initial sync in progress
2023-11-11 05:20:28.310+00:00 | main | INFO  | EthPeers | Updating the default best peer comparator
2023-11-11 05:20:28.316+00:00 | main | INFO  | BesuControllerBuilder | TTD difficulty is present, creating initial sync for PoS
2023-11-11 05:20:28.329+00:00 | main | INFO  | CheckpointDownloaderFactory | Checkpoint sync start with block 11052984 and hash 0x44bca881b07a6a09f83b130798072441705d9a665c5ac8bdf2f39a3cdf3bee29
2023-11-11 05:20:28.334+00:00 | main | INFO  | TransitionBesuControllerBuilder | TTD present, creating DefaultSynchronizer that stops propagating after finalization
2023-11-11 05:20:28.345+00:00 | main | INFO  | RunnerBuilder | Detecting NAT service.
2023-11-11 05:20:28.463+00:00 | main | INFO  | Runner | Starting external services ...
2023-11-11 05:20:28.463+00:00 | main | INFO  | MetricsHttpService | Starting metrics http service on 0.0.0.0:9545
2023-11-11 05:20:28.502+00:00 | vert.x-eventloop-thread-1 | INFO  | MetricsHttpService | Metrics service started and listening on 0.0.0.0:9545
2023-11-11 05:20:28.502+00:00 | main | INFO  | JsonRpcHttpService | Starting JSON-RPC service on 0.0.0.0:8545
2023-11-11 05:20:28.508+00:00 | vert.x-eventloop-thread-1 | INFO  | JsonRpcHttpService | JSON-RPC service started and listening on 0.0.0.0:8545
2023-11-11 05:20:28.508+00:00 | main | INFO  | EngineJsonRpcService | Starting JSON-RPC service on 0.0.0.0:8551
2023-11-11 05:20:28.510+00:00 | vert.x-eventloop-thread-1 | INFO  | EngineJsonRpcService | JSON-RPC service started and listening on 0.0.0.0:8551
2023-11-11 05:20:28.512+00:00 | main | INFO  | AutoTransactionLogBloomCachingService | Starting auto transaction log bloom caching service.
2023-11-11 05:20:28.512+00:00 | main | INFO  | LogBloomCacheMetadata | Lookup cache metadata file in data directory: /var/lib/besu/data/caches
2023-11-11 05:20:28.516+00:00 | main | INFO  | Runner | Starting Ethereum main loop ...
2023-11-11 05:20:28.516+00:00 | main | INFO  | NatService | No NAT environment detected so no service could be started
2023-11-11 05:20:28.516+00:00 | main | INFO  | NetworkRunner | Starting Network.
2023-11-11 05:20:28.517+00:00 | main | INFO  | DefaultP2PNetwork | Starting DNS discovery with URL enrtree://AKA3AM6LPBYEUDMVNU3BSVQJ5AD45Y7YPOHJLEF6W26QOE4VTUDPE@all.mainnet.ethdisco.net
2023-11-11 05:20:28.536+00:00 | nioEventLoopGroup-2-1 | INFO  | RlpxAgent | P2P RLPx agent started and listening on /0.0.0.0:50200.
2023-11-11 05:20:28.537+00:00 | main | INFO  | PeerDiscoveryAgent | Starting peer discovery agent on host=0.0.0.0, port=50200
2023-11-11 05:20:28.542+00:00 | vert.x-eventloop-thread-1 | INFO  | VertxPeerDiscoveryAgent | Started peer discovery agent successfully, on effective host=0.0.0.0 and port=50200
2023-11-11 05:20:28.542+00:00 | vert.x-eventloop-thread-1 | INFO  | PeerDiscoveryAgent | P2P peer discovery agent started and listening on /0.0.0.0:50200
2023-11-11 05:20:28.557+00:00 | vert.x-eventloop-thread-1 | INFO  | PeerDiscoveryAgent | Writing node record to disk. NodeRecord{seq=3, publicKey=0x02e7d0ff6a448b0c072f955450e1bf04b41a1a425e0b020e6b7385b9bcf48541c3, udpAddress=Optional[/127.0.0.1:50200], tcpAddress=Optional[/127.0.0.1:50200], asBase64=-Jq4QCWil-QSSWYyEYDgjuRFciH2XSVabYYubB6b_bO4R0lHBj2kYt_FVhTQrXqgnotvKeWkBJId-vruwlqTE5Vpf_ADg2V0aMrJhOAp6ZGDutQggmlkgnY0gmlwhH8AAAGJc2VjcDI1NmsxoQLn0P9qRIsMBy-VVFDhvwS0GhpCXgsCDmtzhbm89IVBw4N0Y3CCxBiDdWRwgsQY, nodeId=0x2fa58bfe37839dd68b1bfac09be074107888e5c21c2651c6a956fdf0a22b9c78, customFields={tcp=50200, udp=50200, ip=0x7f000001, eth=[[0xe029e991, 0xbad420]], id=V4, secp256k1=0x02e7d0ff6a448b0c072f955450e1bf04b41a1a425e0b020e6b7385b9bcf48541c3}}
2023-11-11 05:20:28.575+00:00 | main | INFO  | DefaultP2PNetwork | Enode URL enode://e7d0ff6a448b0c072f955450e1bf04b41a1a425e0b020e6b7385b9bcf48541c3c55fc005faed00348d348ef871906706310006b0d79a09857457b84d6bffa20c@127.0.0.1:50200
2023-11-11 05:20:28.576+00:00 | main | INFO  | DefaultP2PNetwork | Node address 0x9be074107888e5c21c2651c6a956fdf0a22b9c78
2023-11-11 05:20:28.582+00:00 | main | INFO  | NetworkRunner | Supported capabilities: [eth/63, eth/64, eth/65, eth/66, eth/67, eth/68], [snap/1]
2023-11-11 05:20:28.582+00:00 | main | INFO  | DefaultSynchronizer | Starting synchronizer.
2023-11-11 05:20:28.583+00:00 | main | INFO  | FastSyncDownloader | Starting sync
2023-11-11 05:20:28.587+00:00 | main | INFO  | SnapWorldStateDownloader | Downloading world state from peers for pivot block 18546455 (0xb016d0574313c56cc7bc97854adc43d32ffbbad1859c0c0d68a0ec8dd8f859eb). State root 0xc8ac4510acf1c00335dc4d06ef330652d81da6769b238cef8b0eb52d770e91bc pending requests 0
2023-11-11 05:20:28.699+00:00 | main | INFO  | FastSyncTargetManager | Unable to find sync target. Currently checking 0 peers for usefulness.
2023-11-11 05:20:28.703+00:00 | main | INFO  | Runner | Ethereum main loop is up.
2023-11-11 05:20:39.120+00:00 | vert.x-worker-thread-0 | INFO  | BackwardSyncContext | Starting a new backward sync session
2023-11-11 05:22:26.996+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 958 world state trie nodes, Peer count: 2
2023-11-11 05:22:28.706+00:00 | nioEventLoopGroup-3-6 | INFO  | FastSyncTargetManager | Unable to find sync target. Currently checking 3 peers for usefulness.
2023-11-11 05:23:26.995+00:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO  | SnapsyncMetricsManager | Healed 124504 world state trie nodes, Peer count: 3

Had about 30 of these errors over about 3 hours:

2023-11-15 04:29:01.400+00:00 | vert.x-worker-thread-0 | WARN  | BonsaiWorldStateProvider | Invalid slot found for account 0xddac6d3a2a787b1f4bf26ab6faf519ae3f1a94cf at location 0x
2023-11-15 12:40:12.050+00:00 | ForkJoinPool.commonPool-worker-13 | WARN  | BonsaiWorldStateProvider | Invalid slot found for account 0x0c5853bfe1dfaafb2142b2b9fa68e00c923654f7 at location 0x
2023-11-15 12:46:37.308+00:00 | nioEventLoopGroup-3-3 | WARN  | BonsaiWorldStateProvider | Invalid slot found for account 0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2 at location 0x

Everything was fine keeping up with the chain and looked to be synced for about five hours. Then had another error with the world state.

And have been down now for about an hour while the synchronizer is downloading.

Any idea how I can validate that I wont continue with this issue?

zfs pools dont show any disk errors.

# Your node will fetch the correct data from peers to repair the problem.             
# Starting the sync pipeline...                                                                    
2023-11-15 19:46:12.398+00:00 | vert.x-worker-thread-0 | INFO  | TransactionPoolFactory | Disabling transaction handling during re-sync
2023-11-15 19:46:12.399+00:00 | vert.x-worker-thread-0 | WARN  | BonsaiWorldStateProvider | Invalid slot found for account 0x966fa7acf1b6c732458e4d3264fd2393aec840ba at location 0x
2023-11-15 19:46:12.399+00:00 | vert.x-worker-thread-0 | INFO  | BonsaiWorldStateKeyValueStorage | Bonsai flat db mode found PARTIAL
2023-11-15 19:46:12.563+00:00 | vert.x-worker-thread-0 | INFO  | DefaultSynchronizer | Starting synchronizer.
2023-11-15 19:46:12.563+00:00 | vert.x-worker-thread-0 | INFO  | FastSyncDownloader | Starting sync
2023-11-15 19:46:12.814+00:00 | nioEventLoopGroup-3-1 | INFO  | SnapWorldStateDownloader | Downloading world state from peers for pivot block 18577783 (0x246d9f028b787aa2b90c6b22322a3a031bfab9211af0579d42b9c251742149ee). State root 0x92a7221f3912bb329c363afe24ecc02883a14195eed91035efcf2a6742ae0bc6 pending requests 0
2023-11-15 19:46:13.495+00:00 | nioEventLoopGroup-3-1 | INFO  | SnapWorldDownloadState | Running world state heal process from peers with pivot block 18577783
2023-11-15 19:46:14.701+00:00 | vert.x-worker-thread-0 | INFO  | BackwardSyncContext | Starting a new backward sync session

2023-11-15 19:46:18.121+00:00 | nioEventLoopGroup-3-1 | WARN  | DefaultBlockchain | Chain Reorganization +0 new / -1675 old
       Old - hash: 0xd195107e81b11ca67746207a1fb9becfdea256c5c28c3802485ab4e359de96e1, height: 18579458
       New - hash: 0x246d9f028b787aa2b90c6b22322a3a031bfab9211af0579d42b9c251742149ee, height: 18577783
  Ancestor - hash: 0x246d9f028b787aa2b90c6b22322a3a031bfab9211af0579d42b9c251742149ee, height: 18577783

Versions (Add all that apply)

  • Software version: 23.10.1
  • Docker ENV
@matkt matkt changed the title BONSAI: Invalid Slot Found for account at 0x BONSAI: Heal started directly during checkpoint sync (missing first step) Nov 16, 2023
@non-fungible-nelson
Copy link
Contributor Author

During the review of the user's logs, it was found that a dirty restart confused the checkpoint sync to start its heal process before the worldstate download was completed. This caused a few differing issues.

@non-fungible-nelson non-fungible-nelson added P3 Medium (ex: JSON-RPC request not working with a specific client library due to loose spec assumtion) TeamChupa GH issues worked on by Chupacabara Team labels Nov 20, 2023
@jflo jflo removed the TeamChupa GH issues worked on by Chupacabara Team label Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bonsai bug Something isn't working mainnet P3 Medium (ex: JSON-RPC request not working with a specific client library due to loose spec assumtion) syncing
Projects
None yet
Development

No branches or pull requests

2 participants