-
Notifications
You must be signed in to change notification settings - Fork 20.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Unexpected trie node in disk" with path-based storage #27983
Comments
Just a note, the above screenshot has a node whilst it is still span syncing (heal phase) where the state trie is not yet fuly assembled. Justing confirmed that once sync completed, the error messages disappeared. Interestingly, Lighthouse should not be querying anything in checkpoint syncmode, but perhaps the root of the annyoing logs is the restart of lighthouse in normal mode before Geth finished syncing. We should try to repro that and see if it does some weird RPC calls and why we're serving them instead of just ignoring the unservable calls (missing root node). |
I'm seeing this after upgrading geth, removing and rebuilding the DB. Lighthouse is the consensus client. CLI args: --syncmode snap --state.scheme path geth finished sync'ing but this error has remained for the past 96 hours.
|
What's the geth version you are using? What's the original geth version you used?
Did you upgrade geth in the middle of snap sync? Or you delete the entire geth database and sync from scratch? |
Do you have the log during the sync? Please attach it if possible. |
I was on geth 1.10.23 which kept getting out of sync/requiring a restart. I purged the DB (rm -rf data-dir) before I started the new version geth 1.13.5. This error doesn't appear to be affecting the validators/beacon. I've attached the log output 10 minutes before the error appeared in the log until the first instance(s). |
Do you have the full log for sync?
Unfortunately your state is corrupted because of a "unsuccessful" state sync. The node can still operate without printing any other error if because the corrupted state is not touched yet. The only way to fix it is resync the node(of course, you can still keep the ancient which is correct) But I would appreciate if you can run this branch [EDIT, it's merged into master branch now] on top of your existent datadir just to dump more information for debugging. |
Correction I was on ... instance=Geth/v1.11.5-stable-a38f4108/linux-amd64/go1.20.2 I will share more logs when I started rebuilding. Here it is -- I had a few mis-starts in the log but each time I purged the DB |
Thanks. Your node took very long time to finish the initial sync(roughly 40 hours). It might touch some corner case and eventually result in the corrupted state. Would you mind run the latest master branch on top of your existent node to expose more information? |
|
Another catch up is there are a tons of block truncations which I believe shouldn't occur. Although maybe it's not relevant with the issue itself. Should still fix it.
|
I had a toml file but after running into invalid config issues I removed it. After changing the config I wiped the DB. This is on an EC2 box -- so maybe my EBS volumes iops aren't very high.. |
But the weird thing is: log prints |
I will need to set up the build environment for the binary -- will likely be tomorrow before I can run it. |
Sure no hurry, thanks for it. |
Here it is
|
This is the rlpdump output of your corrupted trie node, with hash
This is the rlpdump output of the correct trie node, with hash
The wrong node contains an unexpected child at index 1. I will try to think about why it can happen, probably relevant with storage deletion. |
So with this error -- is my only recourse rebuilding/syncing? |
You can try this approach as a hotfix
I have no idea if this node is the only corrupted one, or not. But It's definitely interesting to have a try. I will investigate why it can happen in the mean time |
The corresponding account is |
Did you try my suggested approach by any chance? Also can you please try to load this database entry |
geth db put seems to have cleared that error when resuming the node -- many thanks! Here's the output of the put command
Here's the output of the get command
|
As I said, I am not sure if it's the only corruption. Let's have a try anyway and please let me know once the snapshot is fully generated with log |
OK -- is the recurring sequence of "Generating state snapshot"/ "Aborting state snapshot" / "Resuming state snapshot" indicative of corruption?
|
It's expected, background snapshot generation is aborted due to chain progressing and resumed later. |
I observed the |
Cool, also I have identified the issue occurred in your case, thanks for your support for debugging. Regarding your node, since now the state is fully generated and you can still use the current go binary for production usage. |
I will close this ticket as the original issue is already resolved. We can open a new one if this kind of error shows up again, perhaps in other corner cases. |
I'm running
geth version 1.13.0-unstable-5976e584-20230818
withstate.scheme=path
on a Mac and getting some "Unexpected trie node in disk" errors. Steps I used:geth removedb
and deleted thechaindata
foldergeth state.scheme=path
The text was updated successfully, but these errors were encountered: