Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(vm): Allow switching between VMs for latest protocol version #2508

Conversation

slowli
Copy link
Contributor

@slowli slowli commented Jul 26, 2024

What ❔

  • Allows using old (latest) VM by default, new VM only or old + new VM in a shadow mode in the MainBatchExecutor.
  • Allows configuring this mode for a new VmRunner-powered component, VM playground.

Why ❔

A separate component seems a relatively safe place to start integration from.

Checklist

  • PR title corresponds to the body of PR (we generate changelog entries from PRs).
  • Tests for the changes have been added / updated.
  • Documentation comments have been added / updated.
  • Code has been formatted via zk fmt and zk lint.

Copy link
Contributor Author

@slowli slowli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A drawback of adding the switch only to BWIP is that there's no way to run integration test with a shadowed VM or to run a loadtest with the new VM. Is this OK, or should I add these options? (Could be dealt with later, of course, but then we could miss divergencies triggered by integration tests, or performance degradation on the load test.)

core/lib/config/src/configs/experimental.rs Outdated Show resolved Hide resolved
@slowli slowli marked this pull request as ready for review July 26, 2024 13:28
Copy link
Member

@popzxc popzxc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU BWIP performance is crucial for the proof generation speed, so probably it's important to not accidentally worsen the performance. Given that VM is single-threaded and isn't too IO-heavy, probably it's fine. But still cc @EmilLuta just in case.

core/lib/config/src/configs/experimental.rs Outdated Show resolved Hide resolved
core/lib/config/src/configs/vm_runner.rs Outdated Show resolved Hide resolved
@slowli slowli requested a review from EmilLuta July 29, 2024 08:02
@slowli
Copy link
Contributor Author

slowli commented Jul 29, 2024

@popzxc @EmilLuta This is a good point regarding BWIP availability. Is it possible to run multiple BWIP instances, or is it a singleton by design? Ideally, we'd want to run an additional instance starting from the first L2 block with the supported protocol version, so that it's not a bottleneck and we don't care much about potential panics caused by VM divergencies. If this is impossible / requires significant changes in the BWIP architecture, it could make sense to go with the initial plan of integrating the new VM into the EN state keeper, because in this case it obviously wouldn't be a bottleneck, and VM panics wouldn't need to be handled ASAP. (And it would also have the benefit of being able to test the new VM on integration tests; see my comment above.) OTOH, it's relatively easy to convert VM panics into logged errors, but in this case we could get meaningless cascading errors in a single VM run.

@popzxc
Copy link
Member

popzxc commented Jul 29, 2024

@slowli from what I understood, BWIP is meant to be a singleton, but can work in a race mode.
I'm not aware of the implications here, and not sure how easy would it be to deploy multiple BWIPs in our infra, so better to ping @EmilLuta and DevOps in chat for that.

@EmilLuta
Copy link
Contributor

Sorry for being late to the party. Let's unpack them 1 by 1. VM Runner is meant to be a singleton, but internally it runs multiple VMs and it can run up to n VMs (I believe the default setup is 3). The exact details of VM Runner, @itegulov would be the best person to run. With respect of what happens if we run multiple BWIPs @Artemka374, is the right person, but AFAIR, it should be fine.

My assumption is that this is a one off integration (turn from off to on -- from old VM to new VM). If that's the case, then there's no problem on experimenting with BWIP. Yeah, it's mission critical for proofs, but remember that proofs have a long time to get proven (8h soft for other teams, 24h for SLA).

I think EN might be more suitable for recurring work, but for a one off (turn it on, wait for x [where x < 4] weeks, then migrate to state keeper), BWIP seems the natural place.

@slowli slowli requested review from popzxc and joonazan July 31, 2024 13:13
#[serde(default)]
pub fast_vm_mode: FastVmMode,
/// Path to the RocksDB cache directory.
#[serde(default = "ExperimentalVmPlaygroundConfig::default_db_path")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: serde(default) doesn't work with file-based config which we already use on stage.

Copy link
Contributor Author

@slowli slowli Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've accounted for defaults in protobuf_config crate, which AFAIU is sufficient to make file-based config work. Will test this locally and address in a follow-up PR if necessary.

Copy link
Contributor

@itegulov itegulov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked through the whole PR, but the VM runner impl looks good to me

@slowli slowli merged commit 77b6d81 into jms-vm2 Aug 1, 2024
46 of 48 checks passed
@slowli slowli deleted the aov-pla-997-allow-switching-between-vms-for-latest-protocol-version branch August 1, 2024 15:37
slowli added a commit that referenced this pull request Aug 5, 2024
…version – follow ups (#2567)

## What ❔

Various follow-ups after
#2508:

- Adds VM playground config to `etc/env`.
- Adds a health check for the VM playground.
- Runs VM playground in server integration tests and checks it on
teardown.

## Why ❔

Improves maintainability and test coverage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants