Implement parallel trace generation for the system, stack and decoder columns #1839

plafer · 2025-05-28T20:43:26Z

Partially closes #1558

We now expose a new build_trace(Vec<TraceFragmentContext>), which takes the output of FastProcessor::execute_for_trace(), and builds a trace in parallel (only system, stack and decoder columns so far).

Specifically, the strategy is to break up the execution in fixed size fragments of n columns (e.g. 2048 or 4096), with 1 thread generating each corresponding fragment. The fragments are then copied back into a single MainTrace buffer. Hence, the additional complexity of parallel trace generation compared to how Process does it is we need to be able to start generating the trace from any state at any clock cycle.

This is tested exhaustively in the test_trace_generation_at_fragment_boundaries() test (currently 25 different rstest cases). We also now check for every test run with Test::execute() (and similar) that the trace generated by Process is the same as the trace generated by build_trace().

Note to reviewers: in the interest of time, this PR leaves a number of TODO(plafer), for which the plan is to address in a subsequent PR, after the entire trace is generated in parallel. I plan to open an issue for this before merging this one.

TODO: benchmark build_trace()

bobbinth

Thank you! Looks good. Not a review yet, but I left some comments inline.

bobbinth · 2025-06-23T06:08:29Z

processor/src/parallel/replay.rs

+    // Map operations - store the values separately so we can return references
+    pub map_gets: VecDeque<(RpoDigest, bool)>, // (key, found)
+    pub map_get_values: VecDeque<Vec<Felt>>,   // values for successful gets
+    pub map_inserts: VecDeque<(Word, Vec<Felt>)>,


Do we actually need these? The only way to access advice map data is by putting it onto the advice stack first, and so we should be able to safely ignore any updates to the advice map (or reads directly from it).

Ah yes this doesn't make sense - I generated this (with Claude) as placeholder a while back, and didn't come back to clean it up. Basically all of AdviceReplay is placeholder right now

bobbinth · 2025-06-23T06:18:11Z

processor/src/parallel/replay.rs

+    // Merkle store operations
+    pub tree_nodes: VecDeque<(Word, Felt, Felt, Word)>,
+    pub merkle_paths: VecDeque<(Word, Felt, Felt, MerklePath)>,
+    pub leaf_depths: VecDeque<(Word, Felt, Felt, u8)>,
+    pub node_updates: VecDeque<(Word, Felt, Felt, Word, (MerklePath, Word))>,
+    pub root_merges: VecDeque<(Word, Word, Word)>,


Similarly, I think we may get away without keeping track of node_updates and root_merges. Also, do we actually use leaf_depths for anything? Could you double check if it is being used either here or in miden-base somehow, and if not, we could probably get rid of this functionality from the advice provider altogether.

Similarly, I think we may get away without keeping track of node_updates and root_merges.

Also generated placeholder, so yes will be removed.

Could you double check if it is being used either here or in miden-base somehow, and if not, we could probably get rid of this functionality from the advice provider altogether.

It wasn't used anywhere so created #1905

processor/src/parallel/replay.rs

bobbinth · 2025-06-23T06:26:29Z

processor/src/parallel/replay.rs

+#[derive(Debug)]
+pub struct HasherReplay {


I wonder if we actually need this replay: the fragment should have all the data needed to compute the hashes - so, if i understood correctly, by having this reply we:

Save the time in the fragment builder by avoiding hash computation twice.

Make the fast processor a bit slower by requiring it to record the hasher replay.

If so, it may be an acceptable tradeoff to take as being able to execute as fast as possible is probably more important than saving hash computations in a separate thread.

if i understood correctly, by having this reply we:

Save the time in the fragment builder by avoiding hash computation twice.
Make the fast processor a bit slower by requiring it to record the hasher replay.

That's right. And ultimately we'll find the right balance of what to replay and what to recompute only by benchmarks. But my initial reasoning was that hash computations were orders of magnitude more expensive than a Vec::push(), so was probably worth adding a Vec::push() to the FastProcessor so that trace generation can go faster (with the assumption that trace generation will be the bottleneck). But we can revisit when we have benchmark data

bobbinth · 2025-06-23T06:47:37Z

processor/src/parallel/operations/sys_ops.rs

+        if value != ONE {
+            panic!(
+                "Assertion failed: expected 1, got {} at clock {}",
+                value, self.state.system.clk
+            );
+        }


nit: this could be:

assert!(value == ONE, "assertion failed: expected...");

But also, should we use debug_assert! instead? My understanding that this shouldn't happen unless something in the fast processor is messed up.

Yes indeed, switched to debug_assert

bobbinth · 2025-06-23T06:50:49Z

processor/src/parallel/mod.rs

+    BasicBlock {
+        /// Index of the operation batch within the basic block
+        batch_index: usize,
+        /// Index of the operation within the batch
+        op_idx_in_batch: usize,
+        /// Whether a RESPAN operation needs to be added before executing this batch. When true,
+        /// `batch_index` refers to the batch to be executed *after* the RESPAN operation, and
+        /// `op_index_in_batch` MUST be set to 0.
+        needs_respan: bool,
+    },


Should we avoid starting fragments in the middle of an operation group? There are at most 72 operations per group (and usually much fewer), and so fragment length will never vary by more than 72 cycles.

We could even try to never split basic blocks, but a basic block could be pretty long (potentially thousands of cycles) - and so, I'm much less sure about that.

Should we avoid starting fragments in the middle of an operation group?

Yeah I went back and forth on how exactly to design this. The ultimate reason why I went down this route was so that all fragments are exactly NUM_ROWS_PER_FRAGMENT long (except for the last one). The main benefit of this is so that when can easily parallelize writing those fragments back to a single MainTrace buffer (i.e. the exact location in the buffer of where to write each fragment is determined directly from the fragment index).

I think it was also solving another problem I was having but now I'm forgetting (I'll update this thread if/when I remember).

So I'll leave it as-is for now, but we can for sure revisit if the downsides turn out to outweigh the benefits

bobbinth · 2025-06-23T07:04:38Z

processor/src/parallel/mod.rs

+pub struct CoreTraceState {
+    pub system: SystemState,
+    pub stack: StackState,
+    pub block_stack: BlockStack,
+    pub traversal: ExecutionTraversal,
+    pub memory: MemoryReplay,
+    pub advice: AdviceReplay,
+    pub hasher: HasherReplay,
+    pub external_node_replay: ExternalNodeReplay,
+}


This is the struct that the fast processor needs to build for every fragment, right? If so, we should try to minimize its footprint as much as possible, which may include:

Simplifying or eliminating the need for AdviceReplay as mentioned in some other comments.

Potentially getting rid of the HasherReplay.

Maybe combining BlockStack, ExecutionTraversal, and ExternalNodeReplay.

Regarding the last point, I didn't actually understand how they work together. For example, don't we actually need a list of ExecutionTraversals because we can go in and out of various MAST forest during execution (and so keeping track of only a single MAST forest may not be enough).

Simplifying or eliminating the need for AdviceReplay as mentioned in some other comments.

An AdviceReplay (or equivalently an AdviceProvider with some "replay" functionality) is the only way I found to fully isolate the fragment generators from the Host/AdviceProvider, and fully ensure determinism (i.e. that the fragment generators get the same result to every "query" as did the fast processor). Otherwise we have to push that "ensure determinism" complexity into the host/advice provider, which might make sense but I didn't want to tackle it at this time.

Regarding the last point, I didn't actually understand how they work together. For example, don't we actually need a list of ExecutionTraversals because we can go in and out of various MAST forest during execution (and so keeping track of only a single MAST forest may not be enough).

Hmmm, looking back at it now, it does seem like I missed that; the traversal needs to encode any other MastForest used. But the basic/rough idea was:

we need the BlockStack to know how to populate END rows

we need the ExecutionTraversal to know how to "go back up" the MastForest

where we didn't need this in Process since that information is basically on the "Rust stack"

ExternalNodeReplay: similar to AdviceProviderReplay, we need this to ensure determinism (and to avoid needing to pass the Host to trace fragment generators)

This is also pre-cleanup so yeah indeed they should probably end up being together in a struct or something.

Al-Kindi-0

Did a second pass and left some questions and nits

core/src/mast/node/basic_block_node/op_batch.rs

crates/utils/testing/src/lib.rs

miden-vm/benches/program_execution_for_trace.rs

processor/src/parallel/basic_block.rs

processor/src/parallel/mod.rs

Al-Kindi-0

I have more or less reviewed everything, except for some test modules, which I will do in a final pass

processor/src/parallel/mod.rs

processor/src/parallel/trace_builder.rs

Al-Kindi-0

Great work again!
Left some additional questions and nits which are non-blocking

processor/src/parallel/mod.rs

processor/src/parallel/tests.rs

Al-Kindi-0 · 2025-09-16T05:33:18Z

processor/src/parallel/tests.rs

+//  5:   SPLIT
+//  6:     BLOCK ADD END
+//  9:   END


Is this correct? Should be the same as above with SPLIT instead of JOIN in the inner part

Yes it's correct, since what I'm showing here is the execution, not the MAST. So since we have ONE on the stack, only the "true" branch is executed (i.e. BLOCK ADD END).

Al-Kindi-0 · 2025-09-16T05:34:10Z

processor/src/parallel/tests.rs

+//  5:   SPLIT
+//  6:     BLOCK SWAP END
+//  9:   END


Same idea, but "false" branch is executed

huitseeker

Not a full review, but leaving a few notes inline as they may help posterior work.

core/src/mast/node/basic_block_node/op_batch.rs

huitseeker · 2025-09-16T07:14:09Z

processor/src/parallel/mod.rs

+        for (op_idx_in_batch, &op) in batch.ops().iter().enumerate().skip(start_op_idx) {
+            let (op_group_idx, op_idx_in_group) =
+                batch.op_idx_in_batch_to_group(op_idx_in_batch).expect("invalid op batch");


Ah, I would recommend leaving a TODO to add an operation iterator that gives you those group indexes, since the complexity of performing that is O(1) per operation (as opposed to the current O(log N) where N is the batch length).

It's because that iterator is easy to make (and that group indexes weren't really used elsewhere) that I didn't add a group indexation api already. You can see an example with group_chunks, which can be tweaked to support your next_op_group_index approach as well.

This doesn't have to be done in this PR, of course.

Created #2170

huitseeker · 2025-09-16T07:17:41Z

processor/src/parallel/mod.rs

+            // if we executed all operations in a group and haven't reached the end of the batch
+            // yet, set up the decoder for decoding the next operation group
+            if op_idx_in_batch + 1 == end_indices[op_group_idx]
+                && let Some(next_op_group_idx) = batch.next_op_group_index(op_group_idx)


This cost a linear scan per group, I'd precompute a “next op-group” table once, or use the above suggestion to refactor to iterators.

Also referenced in #2170

adr1anh

Looks really good! Left a couple of nits, but otherwise it's good to go from my end

core/src/mast/node/basic_block_node/op_batch.rs

processor/src/fast/trace_state.rs

processor/src/parallel/mod.rs

plafer · 2025-09-22T13:02:35Z

@bobbinth I think we were waiting for 0.18 to be released before merging this right? Then it should be good to go.

plafer · 2025-09-22T13:17:05Z

test_secp256k1_point_multiplication() (and similar) tests are getting SIGTERM'd, probably because they run out of memory, since with this PR we generate the trace twice (one from Process, and one from build_trace()) and make sure that they match. Otherwise the tests pass locally.

Is it possible to increase the amount of memory accessible to the CI? If not I can try to think of another workaround.

plafer · 2025-10-03T14:50:54Z

In 1f5bc8b I ended up limiting the size of traces built by build_trace() that we'll check for correctness (CI only); the current max is 2^21.

Previously, we would only do it for external nodes. Accordingly, we adjusted the naming in the `Tracer` trait to reflect that we are no longer only recording external node resolutions.

These will be useful for managing batch/group/operation indices in trace generation.

This will be useful in parallel trace generation, when using `Processor::execute_sync_op` that requires a Host, but we are not actually getting any data from a host (we are getting it from replays instead).

plafer changed the title ~~WIP parallel processor~~ Implement parallel trace generation May 28, 2025

plafer force-pushed the plafer-1558-parallel-tracegen branch 6 times, most recently from 909d72d to 23ce6bf Compare June 3, 2025 21:41

plafer force-pushed the plafer-1558-parallel-tracegen branch 7 times, most recently from cb9ad2a to e8ca5c9 Compare June 11, 2025 20:40

plafer force-pushed the plafer-1558-parallel-tracegen branch 2 times, most recently from cf18eaf to d14734d Compare June 12, 2025 21:12

bobbinth mentioned this pull request Jun 21, 2025

Remove foreign_account_inputs field from the TransactionArgs 0xMiden/miden-base#1473

Open

plafer force-pushed the plafer-1558-parallel-tracegen branch from d14734d to 58346f1 Compare June 22, 2025 21:28

bobbinth reviewed Jun 23, 2025

View reviewed changes

This was referenced Jun 23, 2025

Convert AdviceProvider into a struct #1904

Closed

Remove AdviceProvider::get_leaf_depth() #1905

Closed

plafer force-pushed the plafer-1558-parallel-tracegen branch 2 times, most recently from 098baca to 8dd5d23 Compare June 24, 2025 14:56

plafer force-pushed the plafer-1558-parallel-tracegen branch 2 times, most recently from 5508afe to cb0dd6f Compare July 9, 2025 19:37

plafer mentioned this pull request Jul 11, 2025

Convert FastProcessor to iterative execution #1990

Merged

plafer mentioned this pull request Jul 22, 2025

FastProcessor::execute should return a list of checkpoints #2022

Closed

plafer force-pushed the plafer-1558-parallel-tracegen branch 3 times, most recently from d96cb43 to b7f5026 Compare July 24, 2025 20:42

plafer requested review from bobbinth and Al-Kindi-0 September 11, 2025 20:48

plafer mentioned this pull request Sep 15, 2025

Parallel trace generation: pre-allocate main trace buffer #2160

Open

Al-Kindi-0 reviewed Sep 15, 2025

View reviewed changes

plafer mentioned this pull request Sep 15, 2025

Address all TODOs in parallel trace generation #2162

Open

Al-Kindi-0 reviewed Sep 15, 2025

View reviewed changes

plafer mentioned this pull request Sep 15, 2025

Parallel trace generation improvements #2163

Open

Al-Kindi-0 approved these changes Sep 16, 2025

View reviewed changes

huitseeker reviewed Sep 16, 2025

View reviewed changes

adr1anh approved these changes Sep 16, 2025

View reviewed changes

core/src/mast/node/basic_block_node/op_batch.rs Show resolved Hide resolved

processor/src/fast/trace_state.rs Show resolved Hide resolved

processor/src/parallel/mod.rs Outdated Show resolved Hide resolved

processor/src/parallel/mod.rs Outdated Show resolved Hide resolved

plafer mentioned this pull request Sep 16, 2025

CoreTraceFragmentGenerator::execute_op_batch: avoid linear scan on every op #2170

Open

plafer force-pushed the plafer-1558-parallel-tracegen branch from 861f245 to ce1ccc1 Compare September 16, 2025 14:51

plafer mentioned this pull request Sep 18, 2025

Plafer full parallel trace gen #2188

Merged

plafer force-pushed the plafer-1558-parallel-tracegen branch from ce1ccc1 to d34362b Compare September 22, 2025 11:48

plafer force-pushed the plafer-1558-parallel-tracegen branch 3 times, most recently from e31d96b to 1f5bc8b Compare October 3, 2025 14:29

huitseeker mentioned this pull request Oct 7, 2025

Parallel trace generation tweaks #2264

Draft

plafer added 6 commits October 7, 2025 11:42

chore(fast processor): record mast forest resolution in dyn operations

f975c97

Previously, we would only do it for external nodes. Accordingly, we adjusted the naming in the `Tracer` trait to reflect that we are no longer only recording external node resolutions.

chore(core): add convenience methods to OpBatch

6b6be41

These will be useful for managing batch/group/operation indices in trace generation.

feat: Introduce NoopHost

c5b7e3d

This will be useful in parallel trace generation, when using `Processor::execute_sync_op` that requires a Host, but we are not actually getting any data from a host (we are getting it from replays instead).

feat: initial implementation of parallel processor (core trace only)

0116a90

chore: bump slow test timeout period

29a6557

chore(test): skip large traces in CI

bba4e21

plafer force-pushed the plafer-1558-parallel-tracegen branch from 1f5bc8b to bba4e21 Compare October 7, 2025 15:46

plafer merged commit b3fa2a3 into next Oct 7, 2025
15 checks passed

plafer deleted the plafer-1558-parallel-tracegen branch October 7, 2025 15:57

Implement parallel trace generation for the system, stack and decoder columns #1839

Implement parallel trace generation for the system, stack and decoder columns #1839

Uh oh!

Conversation

plafer commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bobbinth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Al-Kindi-0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Al-Kindi-0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Al-Kindi-0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huitseeker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

plafer commented May 28, 2025 •

edited

Loading

plafer commented Sep 22, 2025 •

edited

Loading