-
Notifications
You must be signed in to change notification settings - Fork 21.1k
Description
Hey team, we’ve been unable to run snap sync on our mainnet (https://github.com/maticnetwork/bor). To be more specific, it completes the state sync phase, but keeps running in a never ending healing phase. We are aware that the sync has to run faster than the block production and state in order to finish. We have performed some experiments and checks at our end (The block time and block gas limit for our mainnet is 2s and 30M respectively). Also, we know that the process is I/O and network heavy and we've allocated more than enough in these machines.
- We tried running a node in snap sync on our mumbai testnet to make sure the issue has nothing to do the other components of our PoS chain specifically the consensus. It works well on the testnet.
- We tried scaling up some of the parameters involved in the sync mechanism like the pivot marker, the size (bytes) of data to be received in the snap sync trie node (and storage) requests, the dynamic timeout to see the behaviours. We did not see any significant changes in the mechanism. Also, we’re not sure which metrics would be appropriate to checkout while modifying these parameters. Well, to be specific, modifying the pivot parameters didn’t really work as it stopped the healing phase and node stopped syncing totally.
- We conducted an experiment where we took a full synced mainnet node, disconnected it from outer world and only let 1 fresh node sync from it using snap sync. We saw a lot of peer connectivity issues and after some point, the snap sync node wasn’t able to connect to the full synced peer (maybe it figured out that the opposite peer is stale?). This was an attempt to see if the issue is with the state moving fast or not.
We’re currently exploring some ways to understand the mechanism and internals through tests as we thought just tweaking the parameters might not help and would make the process much longer. But, it would be great if you can suggest us some important points/places to look at to dig further, or some experiments to conduct and ways to do so (like finding more internal details about the trie nodes and the rate at which they’re being produced vs downloaded, etc).
Let us know, if there's anything more which we can share from our end. Thanks!
EDIT: the tag was auto chosen to "docs". I'd put it under "help wanted".