Skip to content

Conversation

plafer
Copy link
Contributor

@plafer plafer commented Sep 18, 2025

Closes #1558

Stacked on #1839

Implements the rest of parallel trace generation. Some notable changes include:

  • Similar to Implement parallel trace generation for the system, stack and decoder columns #1839, tests fail because the CI runs out of memory (at least I assume that's why the process get's SIGTERM'd). All tests pass locally though. The fix is probably just not to generate both traces (from Process and build_trace()) for large traces. This is a temporary problem, in that this will go away when we get rid of Process (and migrate its trace building tests of build_trace()).
  • We now record the memory reads in ExecutionTracer when executing the EvalCircuit operation. This is now required so that we can populate the Memory chiplet properly, as well as the RangeChecker (since all memory reads come with a range check).
    • We currently also store the CircuitEvaluation struct in AceReplay (which could be computed during parallel trace generation instead). In principle this allows us to build the chiplets concurrently with the core trace, but at the cost of a larger message to send between the processor and trace generator. We can change this in a subsequent PR if needed.
  • We currently have a naive strategy for basic blocks: Hasher::hash_basic_block() requires us to pass a &[OpBatch], i.e. all the batches of a basic block every time enter it. We currently naively clone in all the operation batches of a basic block into the HasherReplayForChiplet (the replay that records the data needed to construct the Hasher chiplet trace). A better approach would recognize that the same basic block can be entered multiple times, and so would store its op batches at most once.
  • As identified in VM should ensure that the last operation is a HALT #1383, Process doesn't always include a HALT operation, specifically in the edge case where the last END (before the program halting) is right before a power-of-2 boundary, e.g. 63. Then when we insert the random row, this pushes the number of rows to 64, and therefore no padding is inserted (which is the only case where HALT is inserted). We mirror that bug into build_trace() so that tests pass - there's a comment to remind us to remove that bug once we get rid of Process.
  • We currently build the chiplets and range checker serially with the core trace generation (from fragments). This was done for convenience of computing the final trace length. I suspect that filling the chiplets struct is probably pretty expensive (especially the Hasher chiplet), so a quick win will be to run this in parallel with the core trace generation.
  • The processor/src/parallel/mod.rs file is getting big and disorganized, but I left it as is for now to avoid blowing up the diff. This can be done with Parallel trace generation: pre-allocate main trace buffer #2160.

Benchmark results

The current benchmark results (blake3_1to1 benchmark) are on par with Process::execute():

  • build_trace(): 175ms (with the FastProcessor::execute_for_trace() taking roughly 5ms prior to that)
  • Process::execute(): 182ms

After further investigation, we spend 141ms of those 175ms in the (unoptimized) combine_fragments() copying data from the core trace fragments into the MainTrace column vectors serially; generating the core trace fragments in parallel takes about 18-20ms on 10 cores. We already have #2160 that will basically fix this issue.

So with #2160, assuming that we are able to remove that memory copy without adding additional overhead, we can expect generating the trace to go down from ~180ms to ~40ms (6ms for fast processor, and 34ms for build_trace()), a ~5x improvement. So this first version of our processor with trace generation would run at about 12 MHz, up from 2.7MHz.

And that's without any further possible optimizations, such as truly parallelizing everything that is parallelizable in build_trace() (a lot is sequential today for convenience). Moving to a row-major representation of the trace might also help.

@plafer plafer force-pushed the plafer-full-parallel-trace-gen branch 2 times, most recently from ea4ca80 to ac44a2f Compare September 22, 2025 11:48
@plafer plafer force-pushed the plafer-1558-parallel-tracegen branch from ce1ccc1 to d34362b Compare September 22, 2025 11:48
@plafer plafer force-pushed the plafer-full-parallel-trace-gen branch 4 times, most recently from 1958c15 to 5faea8e Compare September 24, 2025 20:56
@plafer plafer marked this pull request as ready for review September 24, 2025 21:19
Copy link
Contributor

@Al-Kindi-0 Al-Kindi-0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial pass and looks great!
Will do another one once the TODOs are addressed or converted

Comment on lines 129 to 137
let main_trace_len = {
// Get the trace length required to hold all execution trace steps
let max_len = range_table_len.max(core_trace_len).max(chiplets.trace_len());

// Pad the trace length to the next power of two and ensure that there is space for random
// rows
let trace_len = (max_len + NUM_RAND_ROWS).next_power_of_two();
core::cmp::max(trace_len, MIN_TRACE_LEN)
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would encapsulate this logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// HELPERS
// ================================================================================================

// TODO(plafer): If we want to keep this strategy, then move the `op_eval_circuit()` method
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Could you elaborate on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline that the fast processor could just record all the memory reads, and let the trace generator handle the evaluation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO is related to the second bullet point in the issue description:

We now record the memory reads in ExecutionTracer when executing the EvalCircuit operation. This is now required so that we can populate the Memory chiplet properly, as well as the RangeChecker (since all memory reads come with a range check).

We currently also store the CircuitEvaluation struct in AceReplay (which could be computed during parallel trace generation instead). In principle this allows us to build the chiplets concurrently with the core trace, but at the cost of a larger message to send between the processor and trace generator. We can change this in a subsequent PR if needed.

We discussed offline that the fast processor could just record all the memory reads, and let the trace generator handle the evaluation

This is currently what the fast processor does. To elaborate further, the current way we handle the EvalCircuit operation is two-fold:

  1. The FastProcessor records all memory reads that occur during the operation
  • This is necessary in order for the Memory chiplet to have all the reads that occurred during program execution.
  1. We store separately the CircuitEvaluation struct that is fed to the Ace chiplet

(1) is uncontroversial; we need to store that information one way or another. But (2) is not strictly necessary. As @adr1anh was alluding to, we could have the core trace generators build the CircuitEvaluation struct to be passed to the Ace chiplet. The downside of that approach though is that in order to build the Ace chiplet, we need to wait for the core trace generation to be done (so that they can return all the CircuitEvaluation structs that were built when re-executing the program in parallel).

So the upside of the current approach is that the ace chiplet can be built in parallel with the core trace fragments, at the cost of a larger TraceGenerationContext being sent from the fast processor. But this needs to be experimented with/benchmarked; thinking about it more, we expect the core trace generation to fully saturate the available cores, and so there shouldn't be a big cost (if any) to waiting for that to be done in order to start Ace chiplet trace generation.

Conclusion: For this PR, I think we should keep the current strategy just because it's already implemented, and we intend to improve on the current PR anyways. I will look into cleaning up that TODO with that in mind (by not having to copy/paste the eval_circuit_fast_() implementation between the fast processor and core trace generators)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there's a few ways to do fix this TODO, which requires further investigation. I'll leave it for a subsequent PR (opened #2255).

Copy link
Contributor

@adr1anh adr1anh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just left a couple of nits.

// HELPERS
// ================================================================================================

// TODO(plafer): If we want to keep this strategy, then move the `op_eval_circuit()` method
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline that the fast processor could just record all the memory reads, and let the trace generator handle the evaluation

@plafer plafer force-pushed the plafer-1558-parallel-tracegen branch from e31d96b to 1f5bc8b Compare October 3, 2025 14:29
@plafer plafer force-pushed the plafer-full-parallel-trace-gen branch from 72effde to 7c002a7 Compare October 3, 2025 14:49
@plafer plafer requested review from Al-Kindi-0 and adr1anh October 3, 2025 15:29
@plafer
Copy link
Contributor Author

plafer commented Oct 3, 2025

This is ready for final review. I

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants