feat(vm): Allow switching between VMs for latest protocol version #2508

slowli · 2024-07-26T10:38:19Z

What ❔

Allows using old (latest) VM by default, new VM only or old + new VM in a shadow mode in the MainBatchExecutor.
Allows configuring this mode for a new VmRunner-powered component, VM playground.

Why ❔

A separate component seems a relatively safe place to start integration from.

Checklist

PR title corresponds to the body of PR (we generate changelog entries from PRs).
Tests for the changes have been added / updated.
Documentation comments have been added / updated.
Code has been formatted via zk fmt and zk lint.

slowli

A drawback of adding the switch only to BWIP is that there's no way to run integration test with a shadowed VM or to run a loadtest with the new VM. Is this OK, or should I add these options? (Could be dealt with later, of course, but then we could miss divergencies triggered by integration tests, or performance degradation on the load test.)

core/lib/config/src/configs/experimental.rs

popzxc

AFAIU BWIP performance is crucial for the proof generation speed, so probably it's important to not accidentally worsen the performance. Given that VM is single-threaded and isn't too IO-heavy, probably it's fine. But still cc @EmilLuta just in case.

core/lib/config/src/configs/experimental.rs

core/lib/config/src/configs/vm_runner.rs

slowli · 2024-07-29T08:13:17Z

@popzxc @EmilLuta This is a good point regarding BWIP availability. Is it possible to run multiple BWIP instances, or is it a singleton by design? Ideally, we'd want to run an additional instance starting from the first L2 block with the supported protocol version, so that it's not a bottleneck and we don't care much about potential panics caused by VM divergencies. If this is impossible / requires significant changes in the BWIP architecture, it could make sense to go with the initial plan of integrating the new VM into the EN state keeper, because in this case it obviously wouldn't be a bottleneck, and VM panics wouldn't need to be handled ASAP. (And it would also have the benefit of being able to test the new VM on integration tests; see my comment above.) OTOH, it's relatively easy to convert VM panics into logged errors, but in this case we could get meaningless cascading errors in a single VM run.

popzxc · 2024-07-29T09:48:01Z

@slowli from what I understood, BWIP is meant to be a singleton, but can work in a race mode.
I'm not aware of the implications here, and not sure how easy would it be to deploy multiple BWIPs in our infra, so better to ping @EmilLuta and DevOps in chat for that.

core/bin/external_node/src/main.rs

core/lib/config/src/configs/experimental.rs

core/node/node_framework/src/implementations/layers/vm_runner/bwip.rs

EmilLuta · 2024-07-29T19:17:58Z

Sorry for being late to the party. Let's unpack them 1 by 1. VM Runner is meant to be a singleton, but internally it runs multiple VMs and it can run up to n VMs (I believe the default setup is 3). The exact details of VM Runner, @itegulov would be the best person to run. With respect of what happens if we run multiple BWIPs @Artemka374, is the right person, but AFAIR, it should be fine.

My assumption is that this is a one off integration (turn from off to on -- from old VM to new VM). If that's the case, then there's no problem on experimenting with BWIP. Yeah, it's mission critical for proofs, but remember that proofs have a long time to get proven (8h soft for other teams, 24h for SLA).

I think EN might be more suitable for recurring work, but for a one off (turn it on, wait for x [where x < 4] weeks, then migrate to state keeper), BWIP seems the natural place.

popzxc · 2024-08-01T05:06:59Z

core/lib/config/src/configs/experimental.rs

+    #[serde(default)]
+    pub fast_vm_mode: FastVmMode,
+    /// Path to the RocksDB cache directory.
+    #[serde(default = "ExperimentalVmPlaygroundConfig::default_db_path")]


Note: serde(default) doesn't work with file-based config which we already use on stage.

I've accounted for defaults in protobuf_config crate, which AFAIU is sufficient to make file-based config work. Will test this locally and address in a follow-up PR if necessary.

itegulov

Haven't looked through the whole PR, but the VM runner impl looks good to me

…version – follow ups (#2567) ## What ❔ Various follow-ups after #2508: - Adds VM playground config to `etc/env`. - Adds a health check for the VM playground. - Runs VM playground in server integration tests and checks it on teardown. ## Why ❔ Improves maintainability and test coverage.

slowli added 4 commits July 26, 2024 10:26

Sketch fast VM switching

a4310b1

Move fast VM mode to BWIP config

12b80fe

Actually enable fast VM for BWIP

14b717d

Test all VM modes in batch executor

1fda487

slowli commented Jul 26, 2024

View reviewed changes

core/lib/config/src/configs/experimental.rs Outdated Show resolved Hide resolved

slowli requested review from montekki, itegulov, popzxc, joonazan and Artemka374 July 26, 2024 13:28

slowli marked this pull request as ready for review July 26, 2024 13:28

popzxc reviewed Jul 29, 2024

View reviewed changes

core/lib/config/src/configs/experimental.rs Outdated Show resolved Hide resolved

core/lib/config/src/configs/vm_runner.rs Outdated Show resolved Hide resolved

slowli requested a review from EmilLuta July 29, 2024 08:02

joonazan reviewed Jul 29, 2024

View reviewed changes

core/bin/external_node/src/main.rs Show resolved Hide resolved

joonazan reviewed Jul 29, 2024

View reviewed changes

core/bin/external_node/src/main.rs Show resolved Hide resolved

core/lib/config/src/configs/experimental.rs Outdated Show resolved Hide resolved

core/node/node_framework/src/implementations/layers/vm_runner/bwip.rs Outdated Show resolved Hide resolved

slowli mentioned this pull request Jul 29, 2024

perf(vm): Improve snapshot management in batch executor #2513

Merged

4 tasks

slowli added 2 commits July 29, 2024 15:11

Rework FastVmMode

b1470b8

Shorten config arg name

d3c8f3e

slowli added 7 commits July 30, 2024 14:14

Remove VM switch from BWIP

0154738

Sketch VmPlayground

c47247b

Test VmPlayground

94c4974

Add VM playground component for main node

7e4a8f3

Fix VM playground reset

eeac093

Fix non-atomic file writes

e658945

Test resetting playground state

83156f0

slowli requested review from popzxc and joonazan July 31, 2024 13:13

popzxc approved these changes Aug 1, 2024

View reviewed changes

itegulov approved these changes Aug 1, 2024

View reviewed changes

slowli merged commit 77b6d81 into jms-vm2 Aug 1, 2024
46 of 48 checks passed

slowli deleted the aov-pla-997-allow-switching-between-vms-for-latest-protocol-version branch August 1, 2024 15:37

slowli mentioned this pull request Aug 2, 2024

refactor(vm-runner): Allow switching between VMs for latest protocol version – follow ups #2567

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vm): Allow switching between VMs for latest protocol version #2508

feat(vm): Allow switching between VMs for latest protocol version #2508

slowli commented Jul 26, 2024 •

edited

Loading

slowli left a comment

popzxc left a comment

slowli commented Jul 29, 2024

popzxc commented Jul 29, 2024

EmilLuta commented Jul 29, 2024

popzxc Aug 1, 2024

slowli Aug 1, 2024 •

edited

Loading

itegulov left a comment

feat(vm): Allow switching between VMs for latest protocol version #2508

feat(vm): Allow switching between VMs for latest protocol version #2508

Conversation

slowli commented Jul 26, 2024 • edited Loading

What ❔

Why ❔

Checklist

slowli left a comment

Choose a reason for hiding this comment

popzxc left a comment

Choose a reason for hiding this comment

slowli commented Jul 29, 2024

popzxc commented Jul 29, 2024

EmilLuta commented Jul 29, 2024

popzxc Aug 1, 2024

Choose a reason for hiding this comment

slowli Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

itegulov left a comment

Choose a reason for hiding this comment

slowli commented Jul 26, 2024 •

edited

Loading

slowli Aug 1, 2024 •

edited

Loading