fix(external-node): delete empty unsealed batch on EN initialization #3125

itegulov · 2024-10-18T07:00:44Z

What ❔

This PR reverts #3088 as I have realized it is going to be very hard to make this fix work by going in that direction. Basically initializing with an empty unsealed batch causes a lot of issues and the existing state keeper/external IO flow heavily relies on us having at least one at the start to initialize correctly. Will leave more context in the comments.

Feel free to review individual commits to not see revert changelog.

Why ❔

This bug causes EN to panic

Checklist

PR title corresponds to the body of PR (we generate changelog entries from PRs).
Tests for the changes have been added / updated.
Documentation comments have been added / updated.
Code has been formatted via zkstack dev fmt and zkstack dev lint.

)" This reverts commit bb5d147.

…ta to initialize

core/node/node_sync/src/external_io.rs

slowli

Can a similar situation reproduce on the main node (i.e., an unsealed batch w/o blocks / transactions)? Is it handled OK?

AFAICT, MempoolIO::initialize() can insert an unsealed L1 batch via ensure_unsealed_l1_batch_exists (as off-topic / bikeshedding, I think this method is too high-level to be placed in DAL). OTOH, we only reach the ensure_unsealed_l1_batch_exists call if L1BatchParamsProvider::load_l1_batch_env() above returns Some(_), i.e. there is at least one L2 block persisted for the batch. So it doesn't look like the ensure_unsealed_l1_batch_exists may actually persist a batch without blocks.
In MempoolIO::wait_for_new_batch_params(), a new batch is inserted into the storage proactively. If a node crashes immediately afterwards, it looks like the situation can be reproduced.

core/node/node_sync/src/external_io.rs

core/node/node_sync/src/sync_action.rs

core/node/node_sync/src/external_io.rs

itegulov · 2024-10-18T09:17:24Z

@slowli

Can a similar situation reproduce on the main node (i.e., an unsealed batch w/o blocks / transactions)? Is it handled OK?

I believe that is handled OK as main node has a slightly different mechanism: MempoolIO::wait_for_new_batch_params checks if there is an unsealed batch before doing anything else and returns it proactively if it does. ENs IO logic is a lot more tied to fethcher and relies on a specific order of actions arriving which makes things complicated. Another strange difference on EN is that it expects L1 batches to not have protocol_version populated until a certain point which also complicates things (I still need to figure out what's up with that in a separate PR).

But I will write a test just to validate that everything is ok.

So it doesn't look like the ensure_unsealed_l1_batch_exists may actually persist a batch without blocks.

Ok, so the logic is a bit confusing and I should have left a comment but essentially this is a compatibility mechanism. Imagine that EN starts for the very first time after unsealed batches PR made it in - we have N L2 blocks that belong to the pending batch but the pending batch is not inserted as unsealed. Then this bit makes sure that it is as the rest of the flow depends on it.

as off-topic / bikeshedding, I think this method is too high-level to be placed in DAL

Fair, this was a bit of dirty hack to be honest with you to fix stage which was in crash loop at the time. This entire logic is IMHO something that should've been a Rust migration script but AFAICT we don't have anything like that. Anyway, I'll try to refactor this in a separate PR.

If a node crashes immediately afterwards, it looks like the situation can be reproduced.

As I said there is actually a fail-safe mechanism at the very top of the function that prevents this from happening. Unless you mean something else?

itegulov · 2024-10-18T12:50:15Z

@slowli I am merging this PR but thanks for the critical review and if you still have questions/concern please feel free to continue the discussion here. I will deliver extra tests/improvements in a follow-up PR.

itegulov added 4 commits October 18, 2024 16:58

finally replicate pesky bug via test

f4fd5ac

implement delete unsealed batch logic

676cb1e

Revert "fix(external-node): make fetcher rely on unsealed batches (#3088

30ce8c4

)" This reverts commit bb5d147.

delete empty unsealed batch on EN start as it does not have enough da…

952a2ec

…ta to initialize

itegulov requested review from slowli and perekopskiy October 18, 2024 07:00

itegulov commented Oct 18, 2024

View reviewed changes

core/node/node_sync/src/external_io.rs Show resolved Hide resolved

bring back tagged connections after revert

c58b602

slowli reviewed Oct 18, 2024

View reviewed changes

core/node/node_sync/src/external_io.rs Show resolved Hide resolved

core/node/node_sync/src/sync_action.rs Outdated Show resolved Hide resolved

core/node/node_sync/src/external_io.rs Outdated Show resolved Hide resolved

itegulov added 2 commits October 18, 2024 20:33

remove metrics from test method

d89c0bf

clarify log messages

7b616c0

perekopskiy approved these changes Oct 18, 2024

View reviewed changes

itegulov added 2 commits October 18, 2024 22:15

Merge branch 'main' into daniyar/unsealed/en-bug-2

2689945

Merge branch 'main' into daniyar/unsealed/en-bug-2

dfa4852

itegulov added this pull request to the merge queue Oct 18, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 18, 2024

itegulov added this pull request to the merge queue Oct 18, 2024

Merged via the queue into main with commit 5d5214b Oct 18, 2024
33 checks passed

itegulov deleted the daniyar/unsealed/en-bug-2 branch October 18, 2024 14:37

zksync-era-bot mentioned this pull request Oct 18, 2024

chore(main): release core 25.0.0 #3094

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(external-node): delete empty unsealed batch on EN initialization #3125

fix(external-node): delete empty unsealed batch on EN initialization #3125

itegulov commented Oct 18, 2024

slowli left a comment

itegulov commented Oct 18, 2024

itegulov commented Oct 18, 2024

fix(external-node): delete empty unsealed batch on EN initialization #3125

fix(external-node): delete empty unsealed batch on EN initialization #3125

Conversation

itegulov commented Oct 18, 2024

What ❔

Why ❔

Checklist

slowli left a comment

Choose a reason for hiding this comment

itegulov commented Oct 18, 2024

itegulov commented Oct 18, 2024