Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(external-node): delete empty unsealed batch on EN initialization #3125

Merged
merged 9 commits into from
Oct 18, 2024

Conversation

itegulov
Copy link
Contributor

What ❔

This PR reverts #3088 as I have realized it is going to be very hard to make this fix work by going in that direction. Basically initializing with an empty unsealed batch causes a lot of issues and the existing state keeper/external IO flow heavily relies on us having at least one at the start to initialize correctly. Will leave more context in the comments.

Feel free to review individual commits to not see revert changelog.

Why ❔

This bug causes EN to panic

Checklist

  • PR title corresponds to the body of PR (we generate changelog entries from PRs).
  • Tests for the changes have been added / updated.
  • Documentation comments have been added / updated.
  • Code has been formatted via zkstack dev fmt and zkstack dev lint.

Copy link
Contributor

@slowli slowli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a similar situation reproduce on the main node (i.e., an unsealed batch w/o blocks / transactions)? Is it handled OK?

  • AFAICT, MempoolIO::initialize() can insert an unsealed L1 batch via ensure_unsealed_l1_batch_exists (as off-topic / bikeshedding, I think this method is too high-level to be placed in DAL). OTOH, we only reach the ensure_unsealed_l1_batch_exists call if L1BatchParamsProvider::load_l1_batch_env() above returns Some(_), i.e. there is at least one L2 block persisted for the batch. So it doesn't look like the ensure_unsealed_l1_batch_exists may actually persist a batch without blocks.
  • In MempoolIO::wait_for_new_batch_params(), a new batch is inserted into the storage proactively. If a node crashes immediately afterwards, it looks like the situation can be reproduced.

core/node/node_sync/src/external_io.rs Show resolved Hide resolved
core/node/node_sync/src/sync_action.rs Outdated Show resolved Hide resolved
core/node/node_sync/src/external_io.rs Outdated Show resolved Hide resolved
@itegulov
Copy link
Contributor Author

@slowli

Can a similar situation reproduce on the main node (i.e., an unsealed batch w/o blocks / transactions)? Is it handled OK?

I believe that is handled OK as main node has a slightly different mechanism: MempoolIO::wait_for_new_batch_params checks if there is an unsealed batch before doing anything else and returns it proactively if it does. ENs IO logic is a lot more tied to fethcher and relies on a specific order of actions arriving which makes things complicated. Another strange difference on EN is that it expects L1 batches to not have protocol_version populated until a certain point which also complicates things (I still need to figure out what's up with that in a separate PR).

But I will write a test just to validate that everything is ok.

So it doesn't look like the ensure_unsealed_l1_batch_exists may actually persist a batch without blocks.

Ok, so the logic is a bit confusing and I should have left a comment but essentially this is a compatibility mechanism. Imagine that EN starts for the very first time after unsealed batches PR made it in - we have N L2 blocks that belong to the pending batch but the pending batch is not inserted as unsealed. Then this bit makes sure that it is as the rest of the flow depends on it.

as off-topic / bikeshedding, I think this method is too high-level to be placed in DAL

Fair, this was a bit of dirty hack to be honest with you to fix stage which was in crash loop at the time. This entire logic is IMHO something that should've been a Rust migration script but AFAICT we don't have anything like that. Anyway, I'll try to refactor this in a separate PR.

If a node crashes immediately afterwards, it looks like the situation can be reproduced.

As I said there is actually a fail-safe mechanism at the very top of the function that prevents this from happening. Unless you mean something else?

@itegulov itegulov added this pull request to the merge queue Oct 18, 2024
@itegulov
Copy link
Contributor Author

@slowli I am merging this PR but thanks for the critical review and if you still have questions/concern please feel free to continue the discussion here. I will deliver extra tests/improvements in a follow-up PR.

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 18, 2024
@itegulov itegulov added this pull request to the merge queue Oct 18, 2024
Merged via the queue into main with commit 5d5214b Oct 18, 2024
33 checks passed
@itegulov itegulov deleted the daniyar/unsealed/en-bug-2 branch October 18, 2024 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants