Skip to content

Conversation

plafer
Copy link
Contributor

@plafer plafer commented May 28, 2025

Partially closes #1558

We now expose a new build_trace(Vec<TraceFragmentContext>), which takes the output of FastProcessor::execute_for_trace(), and builds a trace in parallel (only system, stack and decoder columns so far).

Specifically, the strategy is to break up the execution in fixed size fragments of n columns (e.g. 2048 or 4096), with 1 thread generating each corresponding fragment. The fragments are then copied back into a single MainTrace buffer. Hence, the additional complexity of parallel trace generation compared to how Process does it is we need to be able to start generating the trace from any state at any clock cycle.

This is tested exhaustively in the test_trace_generation_at_fragment_boundaries() test (currently 25 different rstest cases). We also now check for every test run with Test::execute() (and similar) that the trace generated by Process is the same as the trace generated by build_trace().

Note to reviewers: in the interest of time, this PR leaves a number of TODO(plafer), for which the plan is to address in a subsequent PR, after the entire trace is generated in parallel. I plan to open an issue for this before merging this one.

  • TODO: benchmark build_trace()

@plafer plafer changed the title WIP parallel processor Implement parallel trace generation May 28, 2025
@plafer plafer force-pushed the plafer-1558-parallel-tracegen branch 6 times, most recently from 909d72d to 23ce6bf Compare June 3, 2025 21:41
@plafer plafer force-pushed the plafer-1558-parallel-tracegen branch 7 times, most recently from cb9ad2a to e8ca5c9 Compare June 11, 2025 20:40
@plafer plafer force-pushed the plafer-1558-parallel-tracegen branch 2 times, most recently from cf18eaf to d14734d Compare June 12, 2025 21:12
@plafer plafer force-pushed the plafer-1558-parallel-tracegen branch from d14734d to 58346f1 Compare June 22, 2025 21:28
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks good. Not a review yet, but I left some comments inline.

Comment on lines 117 to 122
// Map operations - store the values separately so we can return references
pub map_gets: VecDeque<(RpoDigest, bool)>, // (key, found)
pub map_get_values: VecDeque<Vec<Felt>>, // values for successful gets
pub map_inserts: VecDeque<(Word, Vec<Felt>)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need these? The only way to access advice map data is by putting it onto the advice stack first, and so we should be able to safely ignore any updates to the advice map (or reads directly from it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes this doesn't make sense - I generated this (with Claude) as placeholder a while back, and didn't come back to clean it up. Basically all of AdviceReplay is placeholder right now

Comment on lines 122 to 129
// Merkle store operations
pub tree_nodes: VecDeque<(Word, Felt, Felt, Word)>,
pub merkle_paths: VecDeque<(Word, Felt, Felt, MerklePath)>,
pub leaf_depths: VecDeque<(Word, Felt, Felt, u8)>,
pub node_updates: VecDeque<(Word, Felt, Felt, Word, (MerklePath, Word))>,
pub root_merges: VecDeque<(Word, Word, Word)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I think we may get away without keeping track of node_updates and root_merges. Also, do we actually use leaf_depths for anything? Could you double check if it is being used either here or in miden-base somehow, and if not, we could probably get rid of this functionality from the advice provider altogether.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I think we may get away without keeping track of node_updates and root_merges.

Also generated placeholder, so yes will be removed.

Could you double check if it is being used either here or in miden-base somehow, and if not, we could probably get rid of this functionality from the advice provider altogether.

It wasn't used anywhere so created #1905

Comment on lines 416 to 211
#[derive(Debug)]
pub struct HasherReplay {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we actually need this replay: the fragment should have all the data needed to compute the hashes - so, if i understood correctly, by having this reply we:

  • Save the time in the fragment builder by avoiding hash computation twice.
  • Make the fast processor a bit slower by requiring it to record the hasher replay.

If so, it may be an acceptable tradeoff to take as being able to execute as fast as possible is probably more important than saving hash computations in a separate thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i understood correctly, by having this reply we:

Save the time in the fragment builder by avoiding hash computation twice.
Make the fast processor a bit slower by requiring it to record the hasher replay.

That's right. And ultimately we'll find the right balance of what to replay and what to recompute only by benchmarks. But my initial reasoning was that hash computations were orders of magnitude more expensive than a Vec::push(), so was probably worth adding a Vec::push() to the FastProcessor so that trace generation can go faster (with the assumption that trace generation will be the bottleneck). But we can revisit when we have benchmark data

Comment on lines 10 to 15
if value != ONE {
panic!(
"Assertion failed: expected 1, got {} at clock {}",
value, self.state.system.clk
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this could be:

assert!(value == ONE, "assertion failed: expected...");

But also, should we use debug_assert! instead? My understanding that this shouldn't happen unless something in the fast processor is messed up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed, switched to debug_assert

Comment on lines 69 to 73
BasicBlock {
/// Index of the operation batch within the basic block
batch_index: usize,
/// Index of the operation within the batch
op_idx_in_batch: usize,
/// Whether a RESPAN operation needs to be added before executing this batch. When true,
/// `batch_index` refers to the batch to be executed *after* the RESPAN operation, and
/// `op_index_in_batch` MUST be set to 0.
needs_respan: bool,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we avoid starting fragments in the middle of an operation group? There are at most 72 operations per group (and usually much fewer), and so fragment length will never vary by more than 72 cycles.

We could even try to never split basic blocks, but a basic block could be pretty long (potentially thousands of cycles) - and so, I'm much less sure about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we avoid starting fragments in the middle of an operation group?

Yeah I went back and forth on how exactly to design this. The ultimate reason why I went down this route was so that all fragments are exactly NUM_ROWS_PER_FRAGMENT long (except for the last one). The main benefit of this is so that when can easily parallelize writing those fragments back to a single MainTrace buffer (i.e. the exact location in the buffer of where to write each fragment is determined directly from the fragment index).

I think it was also solving another problem I was having but now I'm forgetting (I'll update this thread if/when I remember).

So I'll leave it as-is for now, but we can for sure revisit if the downsides turn out to outweigh the benefits

Comment on lines 184 to 532
pub struct CoreTraceState {
pub system: SystemState,
pub stack: StackState,
pub block_stack: BlockStack,
pub traversal: ExecutionTraversal,
pub memory: MemoryReplay,
pub advice: AdviceReplay,
pub hasher: HasherReplay,
pub external_node_replay: ExternalNodeReplay,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the struct that the fast processor needs to build for every fragment, right? If so, we should try to minimize its footprint as much as possible, which may include:

  • Simplifying or eliminating the need for AdviceReplay as mentioned in some other comments.
  • Potentially getting rid of the HasherReplay.
  • Maybe combining BlockStack, ExecutionTraversal, and ExternalNodeReplay.

Regarding the last point, I didn't actually understand how they work together. For example, don't we actually need a list of ExecutionTraversals because we can go in and out of various MAST forest during execution (and so keeping track of only a single MAST forest may not be enough).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplifying or eliminating the need for AdviceReplay as mentioned in some other comments.

An AdviceReplay (or equivalently an AdviceProvider with some "replay" functionality) is the only way I found to fully isolate the fragment generators from the Host/AdviceProvider, and fully ensure determinism (i.e. that the fragment generators get the same result to every "query" as did the fast processor). Otherwise we have to push that "ensure determinism" complexity into the host/advice provider, which might make sense but I didn't want to tackle it at this time.

Regarding the last point, I didn't actually understand how they work together. For example, don't we actually need a list of ExecutionTraversals because we can go in and out of various MAST forest during execution (and so keeping track of only a single MAST forest may not be enough).

Hmmm, looking back at it now, it does seem like I missed that; the traversal needs to encode any other MastForest used. But the basic/rough idea was:

  • we need the BlockStack to know how to populate END rows
  • we need the ExecutionTraversal to know how to "go back up" the MastForest
    • where we didn't need this in Process since that information is basically on the "Rust stack"
  • ExternalNodeReplay: similar to AdviceProviderReplay, we need this to ensure determinism (and to avoid needing to pass the Host to trace fragment generators)

This is also pre-cleanup so yeah indeed they should probably end up being together in a struct or something.

@plafer plafer force-pushed the plafer-1558-parallel-tracegen branch 2 times, most recently from 098baca to 8dd5d23 Compare June 24, 2025 14:56
@plafer plafer force-pushed the plafer-1558-parallel-tracegen branch 2 times, most recently from 5508afe to cb0dd6f Compare July 9, 2025 19:37
@plafer plafer force-pushed the plafer-1558-parallel-tracegen branch 3 times, most recently from d96cb43 to b7f5026 Compare July 24, 2025 20:42
Copy link
Contributor

@Al-Kindi-0 Al-Kindi-0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a second pass and left some questions and nits

Copy link
Contributor

@Al-Kindi-0 Al-Kindi-0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have more or less reviewed everything, except for some test modules, which I will do in a final pass

Copy link
Contributor

@Al-Kindi-0 Al-Kindi-0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work again!
Left some additional questions and nits which are non-blocking

Comment on lines +56 to +58
// 5: SPLIT
// 6: BLOCK ADD END
// 9: END
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? Should be the same as above with SPLIT instead of JOIN in the inner part

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's correct, since what I'm showing here is the execution, not the MAST. So since we have ONE on the stack, only the "true" branch is executed (i.e. BLOCK ADD END).

Comment on lines +65 to +67
// 5: SPLIT
// 6: BLOCK SWAP END
// 9: END
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same idea, but "false" branch is executed

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a full review, but leaving a few notes inline as they may help posterior work.

Comment on lines +1149 to +1151
for (op_idx_in_batch, &op) in batch.ops().iter().enumerate().skip(start_op_idx) {
let (op_group_idx, op_idx_in_group) =
batch.op_idx_in_batch_to_group(op_idx_in_batch).expect("invalid op batch");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I would recommend leaving a TODO to add an operation iterator that gives you those group indexes, since the complexity of performing that is O(1) per operation (as opposed to the current O(log N) where N is the batch length).

It's because that iterator is easy to make (and that group indexes weren't really used elsewhere) that I didn't add a group indexation api already. You can see an example with group_chunks, which can be tweaked to support your next_op_group_index approach as well.

This doesn't have to be done in this PR, of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #2170

// if we executed all operations in a group and haven't reached the end of the batch
// yet, set up the decoder for decoding the next operation group
if op_idx_in_batch + 1 == end_indices[op_group_idx]
&& let Some(next_op_group_idx) = batch.next_op_group_index(op_group_idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cost a linear scan per group, I'd precompute a “next op-group” table once, or use the above suggestion to refactor to iterators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also referenced in #2170

Copy link
Contributor

@adr1anh adr1anh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good! Left a couple of nits, but otherwise it's good to go from my end

@plafer
Copy link
Contributor Author

plafer commented Sep 22, 2025

@bobbinth I think we were waiting for 0.18 to be released before merging this right? Then it should be good to go.

@plafer
Copy link
Contributor Author

plafer commented Sep 22, 2025

test_secp256k1_point_multiplication() (and similar) tests are getting SIGTERM'd, probably because they run out of memory, since with this PR we generate the trace twice (one from Process, and one from build_trace()) and make sure that they match. Otherwise the tests pass locally.

Is it possible to increase the amount of memory accessible to the CI? If not I can try to think of another workaround.

@plafer plafer force-pushed the plafer-1558-parallel-tracegen branch 3 times, most recently from e31d96b to 1f5bc8b Compare October 3, 2025 14:29
@plafer
Copy link
Contributor Author

plafer commented Oct 3, 2025

In 1f5bc8b I ended up limiting the size of traces built by build_trace() that we'll check for correctness (CI only); the current max is 2^21.

plafer added 6 commits October 7, 2025 11:42
Previously, we would only do it for external nodes. Accordingly, we adjusted the naming in the `Tracer` trait to reflect
that we are no longer only recording external node resolutions.
These will be useful for managing batch/group/operation indices
in trace generation.
This will be useful in parallel trace generation, when using `Processor::execute_sync_op`
that requires a Host, but we are not actually getting any data from a host (we are
getting it from replays instead).
@plafer plafer force-pushed the plafer-1558-parallel-tracegen branch from 1f5bc8b to bba4e21 Compare October 7, 2025 15:46
@plafer plafer merged commit b3fa2a3 into next Oct 7, 2025
15 checks passed
@plafer plafer deleted the plafer-1558-parallel-tracegen branch October 7, 2025 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize trace generation
5 participants