Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorgs lead to invalid block hierarchy in the statediff database. #398

Closed
telackey opened this issue Jul 11, 2023 · 3 comments · Fixed by #399
Closed

Reorgs lead to invalid block hierarchy in the statediff database. #398

telackey opened this issue Jul 11, 2023 · 3 comments · Fixed by #399
Assignees

Comments

@telackey
Copy link

telackey commented Jul 11, 2023

The reorganization code is invoked in geth whenever more than one block is added to the chain at a time. One recent PR handled a bug where blocks were missed when the chain was purely extended (eg, the chain was at 1000, extended to 1003, and we skipped statediffing blocks 1001 and 1002).

The other case is a reorganization in the more formal sense, when we have a different sequence of blocks at the same height, and switch from and "old" chain to a "new" chain. For example, we have blocks [1000:0x123, 1001:0xabc] and are told by the consensus client to switch to [1000:0x123, 1001:0x456, 1002:0x789].

Note: I made the second chain longer, because in practice, the reorg is triggered by the consensus client informing us via engine_forkchoiceUpdatedV2 what should be the next head, which is almost always the next block.

This will cause a similar bug to the one mentioned above, but in this case we will have the wrong block stored in the database at 1001, rather than no block at all. The fix for #397 will not catch this, because that is only able to observe the gap in the sequence, and there is no gap in this instance.

I was able to induce reorgs with some regularity in the fixturenet by pausing the statediffing container for two slots, then unpausing it for one, in a loop.

In the DB, we then see that that this allows us to insert blocks without a parent:

cerc_testing=# select count(*) from eth.header_cids h where not exists (select block_number from eth.header_cids where block_hash = h.parent_hash);
 count
-------
    32

Here is a specific example:

cerc_testing=# select block_number, block_hash, parent_hash from eth.header_cids where block_number in (16700, 16701, 16702, 16703);
 block_number |                             block_hash                             |                            parent_hash
--------------+--------------------------------------------------------------------+--------------------------------------------------------------------
        16700 | 0x3d0ba0e96a89b24635aa3b3c2682dd8577a308bd5ec359e59fe0be5a3524f7f1 | 0x73aeb9b37322561d21d251cd5d0070bc4ef661ab941b9624359afb8c263ff7f6
        16701 | 0x18dc5fe9c5b8a12e2e22ff33347b73fcd1314f2571c87a8820d20253efac15aa | 0x3d0ba0e96a89b24635aa3b3c2682dd8577a308bd5ec359e59fe0be5a3524f7f1
        16702 | 0xdb3fa74b7d15468fb8619e1b52090c77085d2993be2ebcad0201fcaaa61abe2d | 0x406d1028301eb12ebfca0adf3febd26abbc3d6523e5cf3fa5b9082d9d6c2c1c7
        16703 | 0x710d77076db42ef8e98010e0c131d43add947162441e354c523a987cc8a7e0b6 | 0xdb3fa74b7d15468fb8619e1b52090c77085d2993be2ebcad0201fcaaa61abe2d
(4 rows)

You can see that 16701 (0x18dc) points to 16700 (0x3d0b), and 16703 (0x710d) points to 16702 (0xdb3f), but 16702 points to a parent (0x406d) that does not exist. Yet there are no duplicated block numbers at 16701, it is just a missing entry.

This can be fixed by explicitly calling statediff_writeStateDiffAt(16701) after which the hierarchy is fixed. We see both the old, orphaned 16701 (0x18dc) and the new, canonical 16701 (0x406d).

cerc_testing=# select block_number, block_hash, parent_hash from eth.header_cids where block_number in (16700, 16701, 16702, 16703);
 block_number |                             block_hash                             |                            parent_hash
--------------+--------------------------------------------------------------------+--------------------------------------------------------------------
        16700 | 0x3d0ba0e96a89b24635aa3b3c2682dd8577a308bd5ec359e59fe0be5a3524f7f1 | 0x73aeb9b37322561d21d251cd5d0070bc4ef661ab941b9624359afb8c263ff7f6
        16701 | 0x18dc5fe9c5b8a12e2e22ff33347b73fcd1314f2571c87a8820d20253efac15aa | 0x3d0ba0e96a89b24635aa3b3c2682dd8577a308bd5ec359e59fe0be5a3524f7f1
        16701 | 0x406d1028301eb12ebfca0adf3febd26abbc3d6523e5cf3fa5b9082d9d6c2c1c7 | 0x3d0ba0e96a89b24635aa3b3c2682dd8577a308bd5ec359e59fe0be5a3524f7f1
        16702 | 0xdb3fa74b7d15468fb8619e1b52090c77085d2993be2ebcad0201fcaaa61abe2d | 0x406d1028301eb12ebfca0adf3febd26abbc3d6523e5cf3fa5b9082d9d6c2c1c7
        16703 | 0x710d77076db42ef8e98010e0c131d43add947162441e354c523a987cc8a7e0b6 | 0xdb3fa74b7d15468fb8619e1b52090c77085d2993be2ebcad0201fcaaa61abe2d
(5 rows)

It is an important detail that we could have been informed about this, as the reorganization issues not only a ChainEvent for the new head block, but also ChainSideEvents for every block removed from the chain. We do not currently listen for ChainSideEvents however.

@telackey telackey changed the title Reorgs can lead to invalid statediff record in database. Reorgs lead to invalid block hierarchy in the statediff database. Jul 11, 2023
@telackey telackey self-assigned this Jul 11, 2023
@telackey
Copy link
Author

If we double-checked that we had the parent of each block we insert, it should serve as a fix for both the #397 issue and this one, while being more reliable and resilient than relying purely on event notification.

We'd need to handle the empty DB case to prevent chasing the rabbit all the way back to genesis, but otherwise it should make sure we do not allow gaps to open.

@AFDudley
Copy link

If these additional checks can be done in a performant way, by all means, please add them. Relatedly, we should make sure that https://github.com/cerc-io/eth-ipfs-state-validator would have caught and corrected these issues.

@telackey
Copy link
Author

telackey commented Jul 11, 2023

I can verify that https://github.com/cerc-io/ipld-eth-db-validator/tree/v5 detects this issue, though it cannot correct it.

The reported error looks like:

ERRO[0003] failed to verify state root at block 1889
FATA[0003] sql: no rows in result set

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants