Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Unexpected trie node in disk" with path-based storage #27983

Closed
JustinDrake opened this issue Aug 23, 2023 · 27 comments
Closed

"Unexpected trie node in disk" with path-based storage #27983

JustinDrake opened this issue Aug 23, 2023 · 27 comments
Labels

Comments

@JustinDrake
Copy link
Contributor

JustinDrake commented Aug 23, 2023

I'm running geth version 1.13.0-unstable-5976e584-20230818 with state.scheme=path on a Mac and getting some "Unexpected trie node in disk" errors. Steps I used:

  1. started a fresh checkpoint sync with Lighthouse v4.3.0
  2. ran geth removedb and deleted the chaindata folder
  3. ran geth state.scheme=path
  4. shut down Lighthouse with checkpoint sync flags
  5. restarted Lighthouse without checkpoint sync flags

photo_2023-08-23_10-39-32

@JustinDrake JustinDrake changed the title "Unexpected trie node in disk" "Unexpected trie node in disk" with path-based storage Aug 23, 2023
@karalabe
Copy link
Member

Just a note, the above screenshot has a node whilst it is still span syncing (heal phase) where the state trie is not yet fuly assembled. Justing confirmed that once sync completed, the error messages disappeared.

Interestingly, Lighthouse should not be querying anything in checkpoint syncmode, but perhaps the root of the annyoing logs is the restart of lighthouse in normal mode before Geth finished syncing. We should try to repro that and see if it does some weird RPC calls and why we're serving them instead of just ignoring the unservable calls (missing root node).

@molaughlen
Copy link

molaughlen commented Nov 20, 2023

I'm seeing this after upgrading geth, removing and rebuilding the DB. Lighthouse is the consensus client.

CLI args: --syncmode snap --state.scheme path

geth finished sync'ing but this error has remained for the past 96 hours.

   slots=659,803,107             storage=50.89GiB   dangling=0 elapsed=74h34m12.257s eta=131h14m44.871s
Nov 20 12:21:00 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-20|12:21:00.346] Imported new potential chain segment     number=18,612,963 hash=296aab..f8682a blocks=1  txs=143   mgas=11.844 elapsed=264.770ms     mgasps=44.731  snapdiffs=320.09KiB triediffs=183.38MiB triedirty=56.53MiB
Nov 20 12:21:00 ip-10-0-0-11.ec2.internal geth[18425]: ERROR[11-20|12:21:00.401] Unexpected trie node in disk             owner=5cc0a4..667982 path="[12 5 9 3 7]"             expect=8b09b1..e87152 got=99f9a0..b9f78f
Nov 20 12:21:00 ip-10-0-0-11.ec2.internal geth[18425]: ERROR[11-20|12:21:00.401] State snapshotter failed to iterate trie err=nil
Nov 20 12:21:00 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-20|12:21:00.948] Chain head was updated                   number=18,612,963 hash=296aab..f8682a root=6b5095..844674 elapsed=19.059446ms
Nov 20 12:21:13 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-20|12:21:13.817] Resuming state snapshot generation       root=88a2ed..86827b in=5cc0a4..667982 at=c528e6..3f228f accounts=81,827,943           slots=659,806,109             storage=50.89GiB   dangling=0 elapsed=74h34m25.752s eta=131h15m8.623s
Nov 20 12:21:13 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-20|12:21:13.864] Imported new potential chain segment     number=18,612,964 hash=16504b..a6a093 blocks=1  txs=256   mgas=20.011 elapsed=761.573ms     mgasps=26.276  snapdiffs=324.66KiB triediffs=183.94MiB triedirty=56.80MiB
Nov 20 12:21:13 ip-10-0-0-11.ec2.internal geth[18425]: ERROR[11-20|12:21:13.992] Unexpected trie node in disk             owner=5cc0a4..667982 path="[12 5 9 3 7]"             expect=8b09b1..e87152 got=99f9a0..b9f78f
Nov 20 12:21:13 ip-10-0-0-11.ec2.internal geth[18425]: ERROR[11-20|12:21:13.992] State snapshotter failed to iterate trie err=nil
Nov 20 12:21:14 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-20|12:21:14.244] Chain head was updated                   number=18,612,964 hash=16504b..a6a093 root=35526f..7fe483 elapsed=24.553583ms
Nov 20 12:21:24 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-20|12:21:24.636] Resuming state snapshot generation       root=6cbb62..60d860 in=5cc0a4..667982 at=c528e6..3f228f accounts=81,827,943           slots=659,809,111             storage=50.89GiB   dangling=0 elapsed=74h34m36.571s eta=131h15m27.664s
Nov 20 12:21:24 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-20|12:21:24.665] Imported new potential chain segment     number=18,612,965 hash=833fba..22f170 blocks=1  txs=125   mgas=8.195  elapsed=173.435ms     mgasps=47.249  snapdiffs=335.19KiB triediffs=183.77MiB triedirty=57.10MiB
Nov 20 12:21:24 ip-10-0-0-11.ec2.internal geth[18425]: ERROR[11-20|12:21:24.728] Unexpected trie node in disk             owner=5cc0a4..667982 path="[12 5 9 3 7]"             expect=8b09b1..e87152 got=99f9a0..b9f78f
Nov 20 12:21:24 ip-10-0-0-11.ec2.internal geth[18425]: ERROR[11-20|12:21:24.728] State snapshotter failed to iterate trie err=nil
Nov 20 12:21:24 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-20|12:21:24.930] Chain head was updated                   number=18,612,965 hash=833fba..22f170 root=b0587e..adf841 elapsed=4.411775ms
Nov 20 12:21:37 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-20|12:21:37.758] Resuming state snapshot generation       root=771977..09f7db in=5cc0a4..667982 at=c528e6..3f228f accounts=81,827,943   

@rjl493456442
Copy link
Member

rjl493456442 commented Nov 20, 2023

What's the geth version you are using? What's the original geth version you used?

I'm seeing this after upgrading geth, removing and rebuilding the DB

Did you upgrade geth in the middle of snap sync? Or you delete the entire geth database and sync from scratch?

@rjl493456442
Copy link
Member

rjl493456442 commented Nov 20, 2023

Do you have the log during the sync? Please attach it if possible.

@molaughlen
Copy link

molaughlen commented Nov 20, 2023

I was on geth 1.10.23 which kept getting out of sync/requiring a restart. I purged the DB (rm -rf data-dir) before I started the new version geth 1.13.5. This error doesn't appear to be affecting the validators/beacon.

I've attached the log output 10 minutes before the error appeared in the log until the first instance(s).

geth-unexpected-trie-2023-11-17.txt

@rjl493456442
Copy link
Member

rjl493456442 commented Nov 21, 2023

Do you have the full log for sync?

This error doesn't appear to be affecting the validators/beacon.

Unfortunately your state is corrupted because of a "unsuccessful" state sync. The node can still operate without printing any other error if because the corrupted state is not touched yet. The only way to fix it is resync the node(of course, you can still keep the ancient which is correct)

But I would appreciate if you can run this branch [EDIT, it's merged into master branch now] on top of your existent datadir just to dump more information for debugging.

@molaughlen
Copy link

molaughlen commented Nov 21, 2023

Correction I was on ... instance=Geth/v1.11.5-stable-a38f4108/linux-amd64/go1.20.2

I will share more logs when I started rebuilding.

Here it is -- I had a few mis-starts in the log but each time I purged the DB

sync-log.zip

@rjl493456442
Copy link
Member

Thanks. Your node took very long time to finish the initial sync(roughly 40 hours). It might touch some corner case and eventually result in the corrupted state.

Would you mind run the latest master branch on top of your existent node to expose more information?

@rjl493456442
Copy link
Member

Nov 15 17:47:33 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-15|17:47:33.376] Writing custom genesis block This line is pretty weird, a customized genesis is written instead of the default mainnet block? I guess you are using some config file with genesis specified?

@rjl493456442
Copy link
Member

Another catch up is there are a tons of block truncations which I believe shouldn't occur. Although maybe it's not relevant with the issue itself. Should still fix it.

Nov 15 18:21:30 ip-10-0-0-11.ec2.internal geth[18425]: WARN [11-15|18:21:30.430] Rewinding blockchain to block            target=3,223,551
Nov 15 18:21:33 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-15|18:21:33.424] Loaded most recent local header          number=3,223,551  hash=a0aaff..914938 td=146,234,259,464,463,532,817 age=6y9mo4w
Nov 15 18:21:33 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-15|18:21:33.424] Loaded most recent local block           number=0          hash=d4e567..cb8fa3 td=17,179,869,184              age=54y7mo3w
Nov 15 18:21:33 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-15|18:21:33.424] Loaded most recent local snap block      number=3,223,551  hash=a0aaff..914938 td=146,234,259,464,463,532,817 age=6y9mo4w
Nov 15 18:21:33 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-15|18:21:33.424] Loaded last snap-sync pivot marker       number=18,578,974
Nov 15 18:21:33 ip-10-0-0-11.ec2.internal geth[18425]: ERROR[11-15|18:21:33.424] Failed to reset txpool state             err="missing trie node d7f8974fb5ac78d9ac099b9ad5018bedc2ce0a72dad1827a1709da30580f0544 (path ) layer stale"
Nov 15 18:21:33 ip-10-0-0-11.ec2.internal geth[18425]: ERROR[11-15|18:21:33.424] Failed to reset blobpool state           err="missing trie node d7f8974fb5ac78d9ac099b9ad5018bedc2ce0a72dad1827a1709da30580f0544 (path ) layer stale"

@molaughlen
Copy link

I had a toml file but after running into invalid config issues I removed it. After changing the config I wiped the DB.

This is on an EC2 box -- so maybe my EBS volumes iops aren't very high..

@rjl493456442
Copy link
Member

But the weird thing is: log prints Writing custom genesis block and then starts sync. They are in the same cycle I think.

@molaughlen
Copy link

I will need to set up the build environment for the binary -- will likely be tomorrow before I can run it.

@rjl493456442
Copy link
Member

Sure no hurry, thanks for it.

@molaughlen
Copy link

Here it is

Nov 22 12:33:19 ip-10-0-0-11.ec2.internal geth[30414]: INFO [11-22|12:33:19.850] Initialized transaction indexer          limit=2,350,000
Nov 22 12:33:19 ip-10-0-0-11.ec2.internal geth[30414]: INFO [11-22|12:33:19.850] Loaded local transaction journal         transactions=0 dropped=0
Nov 22 12:33:19 ip-10-0-0-11.ec2.internal geth[30414]: INFO [11-22|12:33:19.851] Regenerated local transaction journal    transactions=0 accounts=0
Nov 22 12:33:20 ip-10-0-0-11.ec2.internal geth[30414]: WARN [11-22|12:33:20.370] Switch sync mode from snap sync to full sync reason="snap sync complete"
Nov 22 12:33:20 ip-10-0-0-11.ec2.internal geth[30414]: INFO [11-22|12:33:20.370] Chain post-merge, sync via beacon client
Nov 22 12:33:20 ip-10-0-0-11.ec2.internal geth[30414]: INFO [11-22|12:33:20.370] Gasprice oracle is ignoring threshold set threshold=2
Nov 22 12:33:20 ip-10-0-0-11.ec2.internal geth[30414]: ERROR[11-22|12:33:20.389] Unexpected trie node in disk             owner=5cc0a4..667982 path="[12 5 9 3 7]" expect=8b09b1..e87152 got=99f9a0..b9f78f
Nov 22 12:33:20 ip-10-0-0-11.ec2.internal geth[30414]: ERROR[11-22|12:33:20.389] State snapshotter failed to iterate trie err="missing trie node 8b09b17b3a4e17de5274c52cc6387cf42c1fb25fd97effda757bb9a2cde87152 (owner 5cc0a47442e6bc69eb1ec9e2ff1fe0c9657c26dfa5836f560fd7141038667982) (path 0c05090307) unexpected node, loc: disk, node: (5cc0a47442e6bc69eb1ec9e2ff1fe0c9657c26dfa5836f560fd7141038667982 [12 5 9 3 7]), 8b09b17b3a4e17de5274c52cc6387cf42c1fb25fd97effda757bb9a2cde87152!=99f9a0c9f954cd0d8cf5bb7df9c2b5e529a1652fcc97824ee446ba9300b9f78f, blob: 0xf87180a0df5465feffb831b1f31a6184b1efdf75f10f13b2b4900956c22f41a6108c45c9808080808080a0b1902b4fca66415f63634e3ddeae1bfa7b877a1db5ed4c029730e166ba2031ae808080a02ded9e78076e79e96fcd5562c7951f678d22a167429cc75c17d30a08705bb6e780808080"

@rjl493456442
Copy link
Member

rjl493456442 commented Nov 22, 2023

This is the rlpdump output of your corrupted trie node, with hash 99f9a0..b9f78f

(base) ➜  ~ rlpdump -hex 0xf87180a0df5465feffb831b1f31a6184b1efdf75f10f13b2b4900956c22f41a6108c45c9808080808080a0b1902b4fca66415f63634e3ddeae1bfa7b877a1db5ed4c029730e166ba2031ae808080a02ded9e78076e79e96fcd5562c7951f678d22a167429cc75c17d30a08705bb6e780808080
[
  "",
  df5465feffb831b1f31a6184b1efdf75f10f13b2b4900956c22f41a6108c45c9,
  "",
  "",
  "",
  "",
  "",
  "",
  b1902b4fca66415f63634e3ddeae1bfa7b877a1db5ed4c029730e166ba2031ae,
  "",
  "",
  "",
  2ded9e78076e79e96fcd5562c7951f678d22a167429cc75c17d30a08705bb6e7,
  "",
  "",
  "",
  "",
]

This is the rlpdump output of the correct trie node, with hash 8b09b1..e87152

(base) ➜  ~ rlpdump -hex 0xf8518080808080808080a0b1902b4fca66415f63634e3ddeae1bfa7b877a1db5ed4c029730e166ba2031ae808080a02ded9e78076e79e96fcd5562c7951f678d22a167429cc75c17d30a08705bb6e780808080
[
  "",
  "",
  "",
  "",
  "",
  "",
  "",
  "",
  b1902b4fca66415f63634e3ddeae1bfa7b877a1db5ed4c029730e166ba2031ae,
  "",
  "",
  "",
  2ded9e78076e79e96fcd5562c7951f678d22a167429cc75c17d30a08705bb6e7,
  "",
  "",
  "",
  "",
]

The wrong node contains an unexpected child at index 1. I will try to think about why it can happen, probably relevant with storage deletion.

@molaughlen
Copy link

So with this error -- is my only recourse rebuilding/syncing?

@rjl493456442
Copy link
Member

You can try this approach as a hotfix

  • Stop your geth properly
  • Manually fix the incorrect data entry by this command

geth db put 0x4f5cc0a47442e6bc69eb1ec9e2ff1fe0c9657c26dfa5836f560fd71410386679820c05090307 0xf8518080808080808080a0b1902b4fca66415f63634e3ddeae1bfa7b877a1db5ed4c029730e166ba2031ae808080a02ded9e78076e79e96fcd5562c7951f678d22a167429cc75c17d30a08705bb6e780808080

I have no idea if this node is the only corrupted one, or not. But It's definitely interesting to have a try.

I will investigate why it can happen in the mean time

@rjl493456442
Copy link
Member

The corresponding account is 0x32400084C286CF3E17e7B677ea9583e60a000324, a huge and busy contract. My hunch is we have some corner case in sync cycle restart, e.g. unclean slot persistence, etc.

@rjl493456442
Copy link
Member

rjl493456442 commented Nov 23, 2023

Did you try my suggested approach by any chance?

Also can you please try to load this database entry geth db get 0x4f5cc0a47442e6bc69eb1ec9e2ff1fe0c9657c26dfa5836f560fd71410386679820c0509030701?

@molaughlen
Copy link

molaughlen commented Nov 23, 2023

Did you try my suggested approach by any chance?

Also can you please try to load this database entry geth db get 0x4f5cc0a47442e6bc69eb1ec9e2ff1fe0c9657c26dfa5836f560fd71410386679820c0509030701?

geth db put seems to have cleared that error when resuming the node -- many thanks!

Here's the output of the put command

Previous value: 0xf87180a0df5465feffb831b1f31a6184b1efdf75f10f13b2b4900956c22f41a6108c45c9808080808080a0b1902b4fca66415f63634e3ddeae1bfa7b877a1db5ed4c029730e166ba2031ae808080a02ded9e78076e79e96fcd5562c7951f678d22a167429cc75c17d30a08705bb6e780808080

Here's the output of the get command

key 0x4f5cc0a47442e6bc69eb1ec9e2ff1fe0c9657c26dfa5836f560fd71410386679820c0509030701: 0xf8419e20a8ccaf952498df75fd7dfb93763a000cc976d1512a01bc3afbbbe2ba92a1a06a62a088d03375c29f8c41b3cd5d4e350f25031c4a712bd7ec2f6555b3365cc5

@rjl493456442
Copy link
Member

As I said, I am not sure if it's the only corruption. Let's have a try anyway and please let me know once the snapshot is fully generated with log Generated state snapshot

@molaughlen
Copy link

OK -- is the recurring sequence of "Generating state snapshot"/ "Aborting state snapshot" / "Resuming state snapshot" indicative of corruption?

Nov 23 14:30:34 ip-10-0-0-11.ec2.internal geth[15060]: INFO [11-23|14:30:34.909] Generating state snapshot                root=b9f34b..c87adc in=8d54b0..a795f8 at=93137d..c11226 accounts=124,774,993 slots=907,006,756 storage=70.72GiB dangling=0 elapsed=1h34m35.609s eta=3h53m12.349s
Nov 23 14:30:38 ip-10-0-0-11.ec2.internal geth[15060]: INFO [11-23|14:30:38.827] Aborting state snapshot generation       root=b9f34b..c87adc in=8d5d01..94def6 at=8bca15..30b889 accounts=124,803,826 slots=907,134,814 storage=70.73GiB dangling=0 elapsed=1h34m39.526s eta=3h53m8.268s
Nov 23 14:30:38 ip-10-0-0-11.ec2.internal geth[15060]: INFO [11-23|14:30:38.836] Resuming state snapshot generation       root=8b06b9..02da89 in=8d5d01..94def6 at=8bca15..30b889 accounts=124,803,826 slots=907,134,814 storage=70.73GiB dangling=0 elapsed=1h34m39.536s eta=3h53m8.293s
Nov 23 14:30:38 ip-10-0-0-11.ec2.internal geth[15060]: INFO [11-23|14:30:38.889] Imported new potential chain segment     number=18,635,039 hash=bf742a..df94a1 blocks=1 txs=152 mgas=12.072 elapsed=746.734ms    mgasps=16.166 snapdiffs=314.50KiB triediffs=163.75MiB triedirty=58.83MiB
Nov 23 14:30:39 ip-10-0-0-11.ec2.internal geth[15060]: INFO [11-23|14:30:39.517] Chain head was updated                   number=18,635,039 hash=bf742a..df94a1 root=ba9646..0d9ab3 elapsed=9.859746ms
Nov 23 14:30:46 ip-10-0-0-11.ec2.internal geth[15060]: INFO [11-23|14:30:46.859] Generating state snapshot                root=8b06b9..02da89 in=8d6888..9b3084 at=7d11a2..0ebe19 accounts=124,843,310 slots=907,478,931 storage=70.75GiB dangling=0 elapsed=1h34m47.559s eta=3h53m9.019s
Nov 23 14:30:49 ip-10-0-0-11.ec2.internal geth[15060]: INFO [11-23|14:30:49.888] Aborting state snapshot generation       root=8b06b9..02da89 at=8d6996..7dd368 accounts=124,846,925 slots=907,682,184 storage=70.77GiB dangling=0 elapsed=1h34m50.588s eta=3h53m14.731s
Nov 23 14:30:49 ip-10-0-0-11.ec2.internal geth[15060]: INFO [11-23|14:30:49.906] Resuming state snapshot generation       root=6597ad..a8e0e2 at=8d6996..7dd368 accounts=124,846,925 slots=907,682,184 storage=70.77GiB dangling=0 elapsed=1h34m50.605s eta=3h53m14.773s

@rjl493456442
Copy link
Member

It's expected, background snapshot generation is aborted due to chain progressing and resumed later.

@molaughlen
Copy link

I observed the Generated state snapshot entry and the Resume/Abort log spew is no longer occurring

@rjl493456442
Copy link
Member

Cool, also I have identified the issue occurred in your case, thanks for your support for debugging.

Regarding your node, since now the state is fully generated and you can still use the current go binary for production usage.

@rjl493456442
Copy link
Member

I will close this ticket as the original issue is already resolved. We can open a new one if this kind of error shows up again, perhaps in other corner cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants