You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The reorganization code is invoked in geth whenever more than one block is added to the chain at a time. One recent PR handled a bug where blocks were missed when the chain was purely extended (eg, the chain was at 1000, extended to 1003, and we skipped statediffing blocks 1001 and 1002).
The other case is a reorganization in the more formal sense, when we have a different sequence of blocks at the same height, and switch from and "old" chain to a "new" chain. For example, we have blocks [1000:0x123, 1001:0xabc] and are told by the consensus client to switch to [1000:0x123, 1001:0x456, 1002:0x789].
Note: I made the second chain longer, because in practice, the reorg is triggered by the consensus client informing us via engine_forkchoiceUpdatedV2 what should be the next head, which is almost always the next block.
This will cause a similar bug to the one mentioned above, but in this case we will have the wrong block stored in the database at 1001, rather than no block at all. The fix for #397 will not catch this, because that is only able to observe the gap in the sequence, and there is no gap in this instance.
I was able to induce reorgs with some regularity in the fixturenet by pausing the statediffing container for two slots, then unpausing it for one, in a loop.
In the DB, we then see that that this allows us to insert blocks without a parent:
cerc_testing=# select count(*) from eth.header_cids h where not exists (select block_number from eth.header_cids where block_hash = h.parent_hash);
count
-------
32
You can see that 16701 (0x18dc) points to 16700 (0x3d0b), and 16703 (0x710d) points to 16702 (0xdb3f), but 16702 points to a parent (0x406d) that does not exist. Yet there are no duplicated block numbers at 16701, it is just a missing entry.
This can be fixed by explicitly calling statediff_writeStateDiffAt(16701) after which the hierarchy is fixed. We see both the old, orphaned 16701 (0x18dc) and the new, canonical 16701 (0x406d).
It is an important detail that we could have been informed about this, as the reorganization issues not only a ChainEvent for the new head block, but also ChainSideEvents for every block removed from the chain. We do not currently listen for ChainSideEvents however.
The text was updated successfully, but these errors were encountered:
telackey
changed the title
Reorgs can lead to invalid statediff record in database.
Reorgs lead to invalid block hierarchy in the statediff database.
Jul 11, 2023
If we double-checked that we had the parent of each block we insert, it should serve as a fix for both the #397 issue and this one, while being more reliable and resilient than relying purely on event notification.
We'd need to handle the empty DB case to prevent chasing the rabbit all the way back to genesis, but otherwise it should make sure we do not allow gaps to open.
If these additional checks can be done in a performant way, by all means, please add them. Relatedly, we should make sure that https://github.com/cerc-io/eth-ipfs-state-validator would have caught and corrected these issues.
The reorganization code is invoked in geth whenever more than one block is added to the chain at a time. One recent PR handled a bug where blocks were missed when the chain was purely extended (eg, the chain was at 1000, extended to 1003, and we skipped statediffing blocks 1001 and 1002).
The other case is a reorganization in the more formal sense, when we have a different sequence of blocks at the same height, and switch from and "old" chain to a "new" chain. For example, we have blocks
[1000:0x123, 1001:0xabc]
and are told by the consensus client to switch to[1000:0x123, 1001:0x456, 1002:0x789]
.This will cause a similar bug to the one mentioned above, but in this case we will have the wrong block stored in the database at 1001, rather than no block at all. The fix for #397 will not catch this, because that is only able to observe the gap in the sequence, and there is no gap in this instance.
I was able to induce reorgs with some regularity in the fixturenet by pausing the statediffing container for two slots, then unpausing it for one, in a loop.
In the DB, we then see that that this allows us to insert blocks without a parent:
Here is a specific example:
You can see that 16701 (0x18dc) points to 16700 (0x3d0b), and 16703 (0x710d) points to 16702 (0xdb3f), but 16702 points to a parent (0x406d) that does not exist. Yet there are no duplicated block numbers at 16701, it is just a missing entry.
This can be fixed by explicitly calling
statediff_writeStateDiffAt(16701)
after which the hierarchy is fixed. We see both the old, orphaned 16701 (0x18dc) and the new, canonical 16701 (0x406d).It is an important detail that we could have been informed about this, as the reorganization issues not only a
ChainEvent
for the new head block, but alsoChainSideEvent
s for every block removed from the chain. We do not currently listen forChainSideEvent
s however.The text was updated successfully, but these errors were encountered: