feat: Introduce `FastProcessor` checkpoints #2023

plafer · 2025-07-22T21:30:58Z

Partially closes #2022 (i.e. only the core trace checkpoints are generated)

Note that I didn't split the FastProcessor's mod.rs file just yet, since this will cause pretty bad conflicts when merging the async-experimental branch. I would prefer to wait for that before we split the file up.

One of the challenges with this PR was in accommodating the 2 modes that we want the FastProcessor to run in:

no trace states (i.e. run as fast as possible), and
generating trace states for future parallel trace generation.

Mode 1 requires storing a subset of what is needed for mode 2. For example, for the overflow stack, mode 1 requires just the overflowed stack elements, whereas mode 2 requires the extra features provided by OverflowTable. Similarly, for execution context management, mode 1 requires the lightweight ExecutionContextInfo (defined in FastProcessor's mod.rs), whereas mode 2 requires the beefier BlockStack. In order for mode 1 to indeed be as fast as it can, I decided to include both data structures. That is, in mode 1, we use the fastest version; however in mode 2, we use both versions "simultaneously" (and hence doing some duplicate work). I found that to be the best of both worlds. Note that the data structures needed for mode 2 are all stored in CoreTraceStateBuilder.

Finally, FastProcessor::execute_op_batch() is quite complex now due to the 2 simultaneous definitions of "operation batch index":

the decorators still use the old definition that doesn't account for inserted NOOPs
the NodeExecutionPhase::BasicBlock variant (stored in CoreTraceState) uses the new definition that takes the inserted NOOPs into account

This should be naturally cleaned up with #1815.

plafer · 2025-08-01T18:41:18Z

@bobbinth this is ready for a high-level review. I'm still debugging the end-to-end tests with parallel trace generation, which provides interesting insights for this PR. My plan is to

Make all end-to-end trace generation tests pass (going well so far)
Cleanup this PR in light of any learnings, and add documentation

Specifically there are interesting edge cases & potential off-by-one errors in the interface between the FastProcessor and trace generators that need to be properly documented. But you can still look into it to make sure we're aligned on the high-level direction, and highlight if anything is missing.

bobbinth

Thank you! Looks good! I took a very high-level look (and still not fully understanding everything), but overall, the approach should work. I left a few small comments inline.

I think the main thing is not too clear to me yet is whether the approach we've taken to record the execution of the program itself is optimal. For example, could we have something like Vec<(RowIndex, Continuation)> in CoreTraceStateBuilder instead of BlockStack? We'll probably need to track a bit more info with it, but it should be sufficient to create start/end rows for MAST nodes in the trace.

bobbinth · 2025-08-04T08:41:29Z

processor/src/fast/trace_state_builder.rs

+    pub overflow: OverflowTable,
+    pub block_stack: BlockStack,


Cloning these every 1024 cycles may get quite expensive (especially if the MAST tree or overflow tables are deep). Not sure if these will work, but here are potential alternatives:

For the overflow table, we could probably keep a "replay" of values being moved back onto the stack top from the overflow table (similar to how we have memory, advice etc. replays).

For the block stack table, maybe we can keep a "transcript" of executed nodes - something similar to NodeExecutionPhase but with hasher addresses added to the internal variants.

bobbinth · 2025-08-04T08:53:04Z

processor/src/fast/trace_state_builder.rs

+    // State stored at the start of a trace fragment
+    snapshot_start: Option<SnapshotStart>,


It feels like there is some redundancy between the data in snapshot_start and block_stack (especially between block_stack and continuation_stack) - though, I don't know yet what's the best way to reconcile them.

Also question: why is snapshot_start optional? i.e., under what circumstances it is None?

processor/src/fast/trace_state_builder.rs

bobbinth · 2025-08-04T08:59:49Z

processor/src/fast/checkpoints.rs

+    BasicBlock {
+        /// Node ID of the basic block being executed
+        node_id: MastNodeId,
+        /// Index of the operation batch within the basic block
+        batch_index: usize,
+        /// Index of the operation within the batch
+        op_idx_in_batch: usize,
+        /// Whether a RESPAN operation needs to be added before executing this batch. When true,
+        /// `batch_index` refers to the batch to be executed *after* the RESPAN operation, and
+        /// `op_index_in_batch` MUST be set to 0.
+        needs_respan: bool,
+    },


In theory, we should be able to reduce this just to a tuple of values (node_id, op_index) where op_index is the first operation in this block to start executing from. But this would be much easier to implement after the batch execution refactoring.

We could combine batch_index and op_idx_in_batch into op_index (to the benefit of the fast processor after the refactoring), but I think needs_respan (or equivalent) needs to stay, because otherwise we can't distinguish between the 2 states

the first operation in batch i

the RESPAN right before the first operation in batch i

bobbinth · 2025-08-04T09:01:53Z

processor/src/fast/checkpoints.rs

+/// Specifies the execution phase when starting fragment generation.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub enum NodeExecutionPhase {


As mentioned in one of the previous comments, this feels very similar to the Continuation enum and maybe there is a way to combine them.

They're different in a subtle way. ContinuationStack works at the granularity of MastNodeIds in 2 states: start or finish. It encodes what comes after the current node (not explicitly encoded in the ContinuationStack). NodeExecutionPhase is there to encode what the ContinuationStack doesn't: exactly where we're at in the execution of the current node.

The reason we don't want to merge them is that we only want to create a NodeExecutionPhase at a fragment boundary - otherwise, for example, we'd be creating a new NodeExecutionPhase at each operation. Also they're different at a fundamental level, where NodeExecutionPhase doesn't encode a continuation - it encodes the "current state".

CoreTraceFragmentGenerator::execute_fragment_generation() in #1839 can provide more intuition for how the NodeExecutionPhase is used: we basically start a fragment by finishing the current node, and then keep going by looping over the ContinuationStack (in a way similar to the fast processor).

Maybe NodeExecutionState would be a better name

Renamed to NodeExecutionState, and added more docs

plafer · 2025-08-20T16:04:58Z

This is now in a reviewable state - I will stop force pushing, but intend to clean it up further. Some things to be cleaned up still (which I'd gladly take suggestions for):

reorganize CoreTraceStateBuilder (hasn't been cleaned up yet)
naming
docs
maybe a bit of moving things around

@adr1anh since your last review, I removed OverflowTable and BlockStack from CoreTraceState in favor of dedicated replays (in order to avoid expensive clones on the FastProcessor side, as well as passing just the data that the core trace fragment builders will need).

Optionally, you can look at #1839 (stacked on top of this one) to look at how CoreTraceState is being used to generate the trace - or you can just ask me if you don't feel like sifting through thousands of lines of code 🙂

Al-Kindi-0

Looks great to me!
Left a few comments (mostly nits).
My main question would be related to the branching in check_extract_trace_state which is called quite frequently and if we can do anything about it. I can't immediately see a global solution but maybe it is worth thinking about some solution nevertheless

processor/src/fast/trace_state_builder.rs

processor/src/fast/mod.rs

processor/src/fast/checkpoints.rs

processor/src/fast/circuit_eval.rs

plafer · 2025-08-21T13:43:35Z

I can't immediately see a global solution but maybe it is worth thinking about some solution nevertheless

Yes we have a solution for this in a following PR (TLDR that involves generics) - should have mentioned it earlier.

adr1anh · 2025-08-22T13:19:53Z

This could be left for a future PR, but I noticed there are many instances where we use Felt to represent an address. We could instead use RowIndex which, as a u32 could be slightly more optimal.

plafer · 2025-08-25T12:50:07Z

I noticed there are many instances where we use Felt to represent an address. We could instead use RowIndex which, as a u32 could be slightly more optimal.

The reason they're Felt (at least the ones in CoreTraceState) is that we rarely perform any operations on them (i.e. pretty much only on a RESPAN), and otherwise they're stored directly in the trace. So having them already as a Felt saves a Montgomery reduction during Felt::new() for each address on the trace generator side. And the HasherChiplet is where we do computations over addresses, so they're stored as a u32 there.

plafer · 2025-08-26T14:02:31Z

Rebased on the latest next and squashed the "PR fix" commits

adr1anh

Discussed async, LGTM!

bobbinth

Looks great! Thank you! I left some comments inline - most of them are about improving code organization and many could be done in follow-up PRs.

Also, it may have been somewhere in the comments, but I'm curious what does the current performance look like for:

Pure execution (with the NoopTracer).
Execution with trace building.

processor/src/fast/u32_ops.rs

processor/src/fast/io_ops.rs

bobbinth · 2025-09-03T05:08:56Z

processor/src/fast/circuit_eval.rs

+    // Note: we pass in a `NoopTracer`, because the parallel trace generation skips the circuit
+    // evaluation completely


Is this temporary (i.e., until the chiplets trace is also being built)? Or do we not need to capture the memory read here at all?

This is not temporary - the parallel trace generators assume that the program was run correctly, and hence that the circuit evaluated to 0. So re-running the circuit evaluation would be wasteful, and not provide any useful data for trace generation purposes. Therefore they'll never make the memory queries, and so it's important that we don't record them here.

bobbinth · 2025-09-03T05:15:40Z

processor/src/fast/trace_state_builder.rs

nit: I would probably have a separate "trace" module and move trace-related structs there. For example:

This file could be under src/fast/trace/state_builder.rs.

We could also have src/fast/trace/state.rs and in the future probably more sub-modules.

processor/src/fast/mod.rs

bobbinth · 2025-09-03T06:43:28Z

processor/src/fast/mod.rs

        Ok((stack_outputs, self.advice))
    }

    async fn execute_impl(


Do we still need execute_impl()? It seems like now it could be replaced with execute_with_tracer(). If we do still need it, may be a good idea to add a brief doc comment explaining why.

We need it so that tests can access the processor state after execution - added that to a docstring.

bobbinth · 2025-09-03T06:47:23Z

processor/src/fast/mod.rs

+        current_forest: &Arc<MastForest>,
        continuation_stack: &mut ContinuationStack,
        host: &mut impl AsyncHost,
+        tracer: &mut impl Tracer,


Not for this PR, but I wonder if it would make sene to create a wrapper struct to group some of these parameters together - e.g., something like ExecutionContext. This way, we could make these method signatures (and their call-sites) a bit more concise.

Agreed - but let's leave this for a subsequent PR

bobbinth · 2025-09-03T06:50:17Z

processor/src/fast/mod.rs

Not for this PR, but we should try to split this file up into several smaller files to make it easier to follow.

Gave it a first pass - we can always refine later

processor/src/fast/mod.rs

To be used as the main checkpoint struct in the parallel processor

…he `FastProcessor`

plafer force-pushed the plafer-2022-fast-processor-checkpoints branch 6 times, most recently from b6fe3b0 to 94be920 Compare July 29, 2025 20:55

plafer marked this pull request as ready for review July 29, 2025 20:55

plafer requested review from bobbinth and adr1anh July 29, 2025 20:55

plafer force-pushed the plafer-2022-fast-processor-checkpoints branch 5 times, most recently from 0d51839 to 1fb7d99 Compare July 31, 2025 16:57

plafer mentioned this pull request Jul 31, 2025

Implement parallel trace generation for the system, stack and decoder columns #1839

Merged

1 task

plafer marked this pull request as draft July 31, 2025 20:52

plafer force-pushed the plafer-2022-fast-processor-checkpoints branch from 1fb7d99 to ec93d8b Compare August 1, 2025 18:36

bobbinth reviewed Aug 4, 2025

View reviewed changes

plafer force-pushed the plafer-2022-fast-processor-checkpoints branch 9 times, most recently from fd22d69 to a0946f0 Compare August 9, 2025 16:37

plafer force-pushed the plafer-2022-fast-processor-checkpoints branch 2 times, most recently from 3888d58 to 0d29227 Compare August 12, 2025 15:30

plafer force-pushed the plafer-2022-fast-processor-checkpoints branch 2 times, most recently from 04562b3 to 73c98a1 Compare August 20, 2025 15:56

plafer marked this pull request as ready for review August 20, 2025 15:57

Al-Kindi-0 reviewed Aug 21, 2025

View reviewed changes

plafer mentioned this pull request Aug 21, 2025

Introduce Tracer trait for FastProcessor performance and debugging #2089

Closed

adr1anh self-requested a review August 22, 2025 12:32

plafer mentioned this pull request Aug 24, 2025

Sparse iteration for OpBatches #2092

Closed

plafer force-pushed the plafer-2022-fast-processor-checkpoints branch from f7881ba to 92bc273 Compare August 26, 2025 14:02

plafer force-pushed the plafer-2022-fast-processor-checkpoints branch from 92bc273 to 08776c5 Compare August 26, 2025 14:09

adr1anh approved these changes Aug 26, 2025

View reviewed changes

plafer mentioned this pull request Aug 27, 2025

Introduce Tracer trait #2101

Merged

plafer force-pushed the plafer-2022-fast-processor-checkpoints branch from ea20c25 to b3b9e33 Compare August 29, 2025 16:22

bobbinth approved these changes Sep 3, 2025

View reviewed changes

plafer force-pushed the plafer-2022-fast-processor-checkpoints branch from 3c51720 to 6973f1a Compare September 3, 2025 19:03

This was referenced Sep 3, 2025

Fix all remaining uses of "span" (renamed to "basic block") #2126

Closed

Investigate converting TraceFragmentContext.continuation into a replay #2128

Open

plafer added 2 commits September 4, 2025 11:16

feat: introduce TraceFragmentContext

3eb149b

To be used as the main checkpoint struct in the parallel processor

feat: FastProcessor outputs a list of checkpoints

9c6941b

plafer force-pushed the plafer-2022-fast-processor-checkpoints branch from 6973f1a to 63a9c9a Compare September 4, 2025 15:28

plafer and others added 2 commits September 4, 2025 11:29

feat: Introduce Tracer trait and abstract away ExecutionTracer from t…

77a919d

…he `FastProcessor`

chore: split miden_processor::fast into multiple files

63bcf81

plafer force-pushed the plafer-2022-fast-processor-checkpoints branch from 63a9c9a to 63bcf81 Compare September 4, 2025 15:29

plafer merged commit 46c1974 into next Sep 4, 2025
11 checks passed

plafer deleted the plafer-2022-fast-processor-checkpoints branch September 4, 2025 15:35

plafer mentioned this pull request Sep 4, 2025

Create program_execution_for_trace benchmark #2131

Merged

		// State stored at the start of a trace fragment
		snapshot_start: Option<SnapshotStart>,

		// Note: we pass in a `NoopTracer`, because the parallel trace generation skips the circuit
		// evaluation completely

feat: Introduce FastProcessor checkpoints #2023

feat: Introduce FastProcessor checkpoints #2023

Conversation

plafer commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

plafer commented Aug 1, 2025

Uh oh!

bobbinth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

plafer commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Al-Kindi-0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

plafer commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adr1anh commented Aug 22, 2025

Uh oh!

plafer commented Aug 25, 2025

Uh oh!

plafer commented Aug 26, 2025

Uh oh!

adr1anh left a comment

Choose a reason for hiding this comment

Uh oh!

bobbinth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

feat: Introduce `FastProcessor` checkpoints #2023

feat: Introduce `FastProcessor` checkpoints #2023

plafer commented Jul 22, 2025 •

edited

Loading

plafer commented Aug 20, 2025 •

edited

Loading

plafer commented Aug 21, 2025 •

edited

Loading