Unexpected trie node
error occurs after initial snap sync #28587
Description
System information
Geth version: geth version
: 1.13.5
Issue description
Ref: original ticket #27983 (comment)
Nov 22 12:33:19 ip-10-0-0-11.ec2.internal geth[30414]: INFO [11-22|12:33:19.850] Initialized transaction indexer limit=2,350,000
Nov 22 12:33:19 ip-10-0-0-11.ec2.internal geth[30414]: INFO [11-22|12:33:19.850] Loaded local transaction journal transactions=0 dropped=0
Nov 22 12:33:19 ip-10-0-0-11.ec2.internal geth[30414]: INFO [11-22|12:33:19.851] Regenerated local transaction journal transactions=0 accounts=0
Nov 22 12:33:20 ip-10-0-0-11.ec2.internal geth[30414]: WARN [11-22|12:33:20.370] Switch sync mode from snap sync to full sync reason="snap sync complete"
Nov 22 12:33:20 ip-10-0-0-11.ec2.internal geth[30414]: INFO [11-22|12:33:20.370] Chain post-merge, sync via beacon client
Nov 22 12:33:20 ip-10-0-0-11.ec2.internal geth[30414]: INFO [11-22|12:33:20.370] Gasprice oracle is ignoring threshold set threshold=2
Nov 22 12:33:20 ip-10-0-0-11.ec2.internal geth[30414]: ERROR[11-22|12:33:20.389] Unexpected trie node in disk owner=5cc0a4..667982 path="[12 5 9 3 7]" expect=8b09b1..e87152 got=99f9a0..b9f78f
Nov 22 12:33:20 ip-10-0-0-11.ec2.internal geth[30414]: ERROR[11-22|12:33:20.389] State snapshotter failed to iterate trie err="missing trie node 8b09b17b3a4e17de5274c52cc6387cf42c1fb25fd97effda757bb9a2cde87152 (owner 5cc0a47442e6bc69eb1ec9e2ff1fe0c9657c26dfa5836f560fd7141038667982) (path 0c05090307) unexpected node, loc: disk, node: (5cc0a47442e6bc69eb1ec9e2ff1fe0c9657c26dfa5836f560fd7141038667982 [12 5 9 3 7]), 8b09b17b3a4e17de5274c52cc6387cf42c1fb25fd97effda757bb9a2cde87152!=99f9a0c9f954cd0d8cf5bb7df9c2b5e529a1652fcc97824ee446ba9300b9f78f, blob: 0xf87180a0df5465feffb831b1f31a6184b1efdf75f10f13b2b4900956c22f41a6108c45c9808080808080a0b1902b4fca66415f63634e3ddeae1bfa7b877a1db5ed4c029730e166ba2031ae808080a02ded9e78076e79e96fcd5562c7951f678d22a167429cc75c17d30a08705bb6e780808080"
The node is reported as invalid, with
- owner:
5cc0a47442e6bc69eb1ec9e2ff1fe0c9657c26dfa5836f560fd7141038667982
, - address:
0x32400084C286CF3E17e7B677ea9583e60a000324
- path:
[12 5 9 3 7]
- content:
0xf87180a0df5465feffb831b1f31a6184b1efdf75f10f13b2b4900956c22f41a6108c45c9808080808080a0b1902b4fca66415f63634e3ddeae1bfa7b877a1db5ed4c029730e166ba2031ae808080a02ded9e78076e79e96fcd5562c7951f678d22a167429cc75c17d30a08705bb6e780808080
- exphash:
8b09b17b3a4e17de5274c52cc6387cf42c1fb25fd97effda757bb9a2cde87152
- gothash
99f9a0c9f954cd0d8cf5bb7df9c2b5e529a1652fcc97824ee446ba9300b9f78f
After retrieving the correct node from our benchmark machine, I rlpdump them
correct node
(base) ➜ ~ rlpdump -hex 0xf8518080808080808080a0b1902b4fca66415f63634e3ddeae1bfa7b877a1db5ed4c029730e166ba2031ae808080a02ded9e78076e79e96fcd5562c7951f678d22a167429cc75c17d30a08705bb6e780808080
[
"",
"",
"",
"",
"",
"",
"",
"",
b1902b4fca66415f63634e3ddeae1bfa7b877a1db5ed4c029730e166ba2031ae,
"",
"",
"",
2ded9e78076e79e96fcd5562c7951f678d22a167429cc75c17d30a08705bb6e7,
"",
"",
"",
"",
]
corrupted node
(base) ➜ ~ rlpdump -hex 0xf87180a0df5465feffb831b1f31a6184b1efdf75f10f13b2b4900956c22f41a6108c45c9808080808080a0b1902b4fca66415f63634e3ddeae1bfa7b877a1db5ed4c029730e166ba2031ae808080a02ded9e78076e79e96fcd5562c7951f678d22a167429cc75c17d30a08705bb6e780808080
[
"",
df5465feffb831b1f31a6184b1efdf75f10f13b2b4900956c22f41a6108c45c9,
"",
"",
"",
"",
"",
"",
b1902b4fca66415f63634e3ddeae1bfa7b877a1db5ed4c029730e166ba2031ae,
"",
"",
"",
2ded9e78076e79e96fcd5562c7951f678d22a167429cc75c17d30a08705bb6e7,
"",
"",
"",
"",
]
The corrupted node has one more child at the index 1.
Also, I dumped out the parent nodes of this one, they are all full nodes with no shortNode in the middle of path, so it's not relevant with the shortNode trick at all.
This storage is quite huge, with 1.8m slots inside.
I analyzed the contract, there are two functions can mutate the states:
finalizeEthWithdrawal (0x6c0960f9)
: example from etherscanrequestL2Transaction (0xeb672419)
: example from etherscan
Both of them only create new storage slot, but never delete storage slot.
There are a few possibilities here for this situation:
-
The state sync target is forked, the transaction which creates the trie node at index 1 is reorged out and never get accepted
I don't think it's the case here. Geth useshead-64
as the sync target, which is very very hard to be reorged in the proof-of-stake network. -
programatic problems??
The log is attached.
Here it is -- I had a few mis-starts in the log but each time I purged the DB