Skip to content

Conversation

plafer
Copy link
Contributor

@plafer plafer commented Jul 22, 2025

Partially closes #2022 (i.e. only the core trace checkpoints are generated)

Note that I didn't split the FastProcessor's mod.rs file just yet, since this will cause pretty bad conflicts when merging the async-experimental branch. I would prefer to wait for that before we split the file up.

One of the challenges with this PR was in accommodating the 2 modes that we want the FastProcessor to run in:

  1. no trace states (i.e. run as fast as possible), and
  2. generating trace states for future parallel trace generation.

Mode 1 requires storing a subset of what is needed for mode 2. For example, for the overflow stack, mode 1 requires just the overflowed stack elements, whereas mode 2 requires the extra features provided by OverflowTable. Similarly, for execution context management, mode 1 requires the lightweight ExecutionContextInfo (defined in FastProcessor's mod.rs), whereas mode 2 requires the beefier BlockStack. In order for mode 1 to indeed be as fast as it can, I decided to include both data structures. That is, in mode 1, we use the fastest version; however in mode 2, we use both versions "simultaneously" (and hence doing some duplicate work). I found that to be the best of both worlds. Note that the data structures needed for mode 2 are all stored in CoreTraceStateBuilder.

Finally, FastProcessor::execute_op_batch() is quite complex now due to the 2 simultaneous definitions of "operation batch index":

  • the decorators still use the old definition that doesn't account for inserted NOOPs
  • the NodeExecutionPhase::BasicBlock variant (stored in CoreTraceState) uses the new definition that takes the inserted NOOPs into account

This should be naturally cleaned up with #1815.

@plafer plafer force-pushed the plafer-2022-fast-processor-checkpoints branch 6 times, most recently from b6fe3b0 to 94be920 Compare July 29, 2025 20:55
@plafer plafer marked this pull request as ready for review July 29, 2025 20:55
@plafer plafer requested review from bobbinth and adr1anh July 29, 2025 20:55
@plafer plafer force-pushed the plafer-2022-fast-processor-checkpoints branch 5 times, most recently from 0d51839 to 1fb7d99 Compare July 31, 2025 16:57
@plafer plafer marked this pull request as draft July 31, 2025 20:52
@plafer plafer force-pushed the plafer-2022-fast-processor-checkpoints branch from 1fb7d99 to ec93d8b Compare August 1, 2025 18:36
@plafer
Copy link
Contributor Author

plafer commented Aug 1, 2025

@bobbinth this is ready for a high-level review. I'm still debugging the end-to-end tests with parallel trace generation, which provides interesting insights for this PR. My plan is to

  1. Make all end-to-end trace generation tests pass (going well so far)
  2. Cleanup this PR in light of any learnings, and add documentation

Specifically there are interesting edge cases & potential off-by-one errors in the interface between the FastProcessor and trace generators that need to be properly documented. But you can still look into it to make sure we're aligned on the high-level direction, and highlight if anything is missing.

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks good! I took a very high-level look (and still not fully understanding everything), but overall, the approach should work. I left a few small comments inline.

I think the main thing is not too clear to me yet is whether the approach we've taken to record the execution of the program itself is optimal. For example, could we have something like Vec<(RowIndex, Continuation)> in CoreTraceStateBuilder instead of BlockStack? We'll probably need to track a bit more info with it, but it should be sufficient to create start/end rows for MAST nodes in the trace.

Comment on lines 36 to 62
pub overflow: OverflowTable,
pub block_stack: BlockStack,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cloning these every 1024 cycles may get quite expensive (especially if the MAST tree or overflow tables are deep). Not sure if these will work, but here are potential alternatives:

  • For the overflow table, we could probably keep a "replay" of values being moved back onto the stack top from the overflow table (similar to how we have memory, advice etc. replays).
  • For the block stack table, maybe we can keep a "transcript" of executed nodes - something similar to NodeExecutionPhase but with hasher addresses added to the internal variants.

Comment on lines 45 to 49
// State stored at the start of a trace fragment
snapshot_start: Option<SnapshotStart>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like there is some redundancy between the data in snapshot_start and block_stack (especially between block_stack and continuation_stack) - though, I don't know yet what's the best way to reconcile them.

Also question: why is snapshot_start optional? i.e., under what circumstances it is None?

Comment on lines 23 to 27
BasicBlock {
/// Node ID of the basic block being executed
node_id: MastNodeId,
/// Index of the operation batch within the basic block
batch_index: usize,
/// Index of the operation within the batch
op_idx_in_batch: usize,
/// Whether a RESPAN operation needs to be added before executing this batch. When true,
/// `batch_index` refers to the batch to be executed *after* the RESPAN operation, and
/// `op_index_in_batch` MUST be set to 0.
needs_respan: bool,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, we should be able to reduce this just to a tuple of values (node_id, op_index) where op_index is the first operation in this block to start executing from. But this would be much easier to implement after the batch execution refactoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could combine batch_index and op_idx_in_batch into op_index (to the benefit of the fast processor after the refactoring), but I think needs_respan (or equivalent) needs to stay, because otherwise we can't distinguish between the 2 states

  1. the first operation in batch i
  2. the RESPAN right before the first operation in batch i

Comment on lines 18 to 17
/// Specifies the execution phase when starting fragment generation.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum NodeExecutionPhase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in one of the previous comments, this feels very similar to the Continuation enum and maybe there is a way to combine them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're different in a subtle way. ContinuationStack works at the granularity of MastNodeIds in 2 states: start or finish. It encodes what comes after the current node (not explicitly encoded in the ContinuationStack). NodeExecutionPhase is there to encode what the ContinuationStack doesn't: exactly where we're at in the execution of the current node.

The reason we don't want to merge them is that we only want to create a NodeExecutionPhase at a fragment boundary - otherwise, for example, we'd be creating a new NodeExecutionPhase at each operation. Also they're different at a fundamental level, where NodeExecutionPhase doesn't encode a continuation - it encodes the "current state".

CoreTraceFragmentGenerator::execute_fragment_generation() in #1839 can provide more intuition for how the NodeExecutionPhase is used: we basically start a fragment by finishing the current node, and then keep going by looping over the ContinuationStack (in a way similar to the fast processor).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe NodeExecutionState would be a better name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to NodeExecutionState, and added more docs

@plafer plafer force-pushed the plafer-2022-fast-processor-checkpoints branch 9 times, most recently from fd22d69 to a0946f0 Compare August 9, 2025 16:37
@plafer plafer force-pushed the plafer-2022-fast-processor-checkpoints branch 2 times, most recently from 3888d58 to 0d29227 Compare August 12, 2025 15:30
@plafer plafer force-pushed the plafer-2022-fast-processor-checkpoints branch 2 times, most recently from 04562b3 to 73c98a1 Compare August 20, 2025 15:56
@plafer plafer marked this pull request as ready for review August 20, 2025 15:57
@plafer
Copy link
Contributor Author

plafer commented Aug 20, 2025

This is now in a reviewable state - I will stop force pushing, but intend to clean it up further. Some things to be cleaned up still (which I'd gladly take suggestions for):

  • reorganize CoreTraceStateBuilder (hasn't been cleaned up yet)
  • naming
  • docs
  • maybe a bit of moving things around

@adr1anh since your last review, I removed OverflowTable and BlockStack from CoreTraceState in favor of dedicated replays (in order to avoid expensive clones on the FastProcessor side, as well as passing just the data that the core trace fragment builders will need).

Optionally, you can look at #1839 (stacked on top of this one) to look at how CoreTraceState is being used to generate the trace - or you can just ask me if you don't feel like sifting through thousands of lines of code 🙂

Copy link
Contributor

@Al-Kindi-0 Al-Kindi-0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me!
Left a few comments (mostly nits).
My main question would be related to the branching in check_extract_trace_state which is called quite frequently and if we can do anything about it. I can't immediately see a global solution but maybe it is worth thinking about some solution nevertheless

@plafer
Copy link
Contributor Author

plafer commented Aug 21, 2025

I can't immediately see a global solution but maybe it is worth thinking about some solution nevertheless

Yes we have a solution for this in a following PR (TLDR that involves generics) - should have mentioned it earlier.

@adr1anh
Copy link
Contributor

adr1anh commented Aug 22, 2025

This could be left for a future PR, but I noticed there are many instances where we use Felt to represent an address. We could instead use RowIndex which, as a u32 could be slightly more optimal.

@plafer
Copy link
Contributor Author

plafer commented Aug 25, 2025

I noticed there are many instances where we use Felt to represent an address. We could instead use RowIndex which, as a u32 could be slightly more optimal.

The reason they're Felt (at least the ones in CoreTraceState) is that we rarely perform any operations on them (i.e. pretty much only on a RESPAN), and otherwise they're stored directly in the trace. So having them already as a Felt saves a Montgomery reduction during Felt::new() for each address on the trace generator side. And the HasherChiplet is where we do computations over addresses, so they're stored as a u32 there.

@plafer plafer force-pushed the plafer-2022-fast-processor-checkpoints branch from f7881ba to 92bc273 Compare August 26, 2025 14:02
@plafer
Copy link
Contributor Author

plafer commented Aug 26, 2025

Rebased on the latest next and squashed the "PR fix" commits

@plafer plafer force-pushed the plafer-2022-fast-processor-checkpoints branch from 92bc273 to 08776c5 Compare August 26, 2025 14:09
Copy link
Contributor

@adr1anh adr1anh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed async, LGTM!

@plafer plafer mentioned this pull request Aug 27, 2025
@plafer plafer force-pushed the plafer-2022-fast-processor-checkpoints branch from ea20c25 to b3b9e33 Compare August 29, 2025 16:22
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you! I left some comments inline - most of them are about improving code organization and many could be done in follow-up PRs.

Also, it may have been somewhere in the comments, but I'm curious what does the current performance look like for:

  • Pure execution (with the NoopTracer).
  • Execution with trace building.

// Note: we pass in a `NoopTracer`, because the parallel trace generation skips the circuit
// evaluation completely
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this temporary (i.e., until the chiplets trace is also being built)? Or do we not need to capture the memory read here at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not temporary - the parallel trace generators assume that the program was run correctly, and hence that the circuit evaluated to 0. So re-running the circuit evaluation would be wasteful, and not provide any useful data for trace generation purposes. Therefore they'll never make the memory queries, and so it's important that we don't record them here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would probably have a separate "trace" module and move trace-related structs there. For example:

  • This file could be under src/fast/trace/state_builder.rs.
  • We could also have src/fast/trace/state.rs and in the future probably more sub-modules.

Ok((stack_outputs, self.advice))
}

async fn execute_impl(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need execute_impl()? It seems like now it could be replaced with execute_with_tracer(). If we do still need it, may be a good idea to add a brief doc comment explaining why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need it so that tests can access the processor state after execution - added that to a docstring.

Comment on lines 499 to 502
current_forest: &Arc<MastForest>,
continuation_stack: &mut ContinuationStack,
host: &mut impl AsyncHost,
tracer: &mut impl Tracer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but I wonder if it would make sene to create a wrapper struct to group some of these parameters together - e.g., something like ExecutionContext. This way, we could make these method signatures (and their call-sites) a bit more concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - but let's leave this for a subsequent PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but we should try to split this file up into several smaller files to make it easier to follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave it a first pass - we can always refine later

To be used as the main checkpoint struct in the parallel processor
@plafer plafer force-pushed the plafer-2022-fast-processor-checkpoints branch from 6973f1a to 63a9c9a Compare September 4, 2025 15:28
@plafer plafer force-pushed the plafer-2022-fast-processor-checkpoints branch from 63a9c9a to 63bcf81 Compare September 4, 2025 15:29
@plafer plafer merged commit 46c1974 into next Sep 4, 2025
11 checks passed
@plafer plafer deleted the plafer-2022-fast-processor-checkpoints branch September 4, 2025 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FastProcessor::execute should return a list of checkpoints
4 participants