Plafer full parallel trace gen #2188

plafer · 2025-09-18T21:52:21Z

Stacked on #1839

Implements the rest of parallel trace generation. Some notable changes include:

Similar to Implement parallel trace generation for the system, stack and decoder columns #1839, tests fail because the CI runs out of memory (at least I assume that's why the process get's SIGTERM'd). All tests pass locally though. The fix is probably just not to generate both traces (from Process and build_trace()) for large traces. This is a temporary problem, in that this will go away when we get rid of Process (and migrate its trace building tests of build_trace()).
We now record the memory reads in ExecutionTracer when executing the EvalCircuit operation. This is now required so that we can populate the Memory chiplet properly, as well as the RangeChecker (since all memory reads come with a range check).
- We currently also store the CircuitEvaluation struct in AceReplay (which could be computed during parallel trace generation instead). In principle this allows us to build the chiplets concurrently with the core trace, but at the cost of a larger message to send between the processor and trace generator. We can change this in a subsequent PR if needed.
We currently have a naive strategy for basic blocks: Hasher::hash_basic_block() requires us to pass a &[OpBatch], i.e. all the batches of a basic block every time enter it. We currently naively clone in all the operation batches of a basic block into the HasherReplayForChiplet (the replay that records the data needed to construct the Hasher chiplet trace). A better approach would recognize that the same basic block can be entered multiple times, and so would store its op batches at most once.
- Opened HasherOp: improve handling of basic blocks #2250
As identified in VM should ensure that the last operation is a HALT #1383, Process doesn't always include a HALT operation, specifically in the edge case where the last END (before the program halting) is right before a power-of-2 boundary, e.g. 63. Then when we insert the random row, this pushes the number of rows to 64, and therefore no padding is inserted (which is the only case where HALT is inserted). We mirror that bug into build_trace() so that tests pass - there's a comment to remind us to remove that bug once we get rid of Process.
We currently build the chiplets and range checker serially with the core trace generation (from fragments). This was done for convenience of computing the final trace length. I suspect that filling the chiplets struct is probably pretty expensive (especially the Hasher chiplet), so a quick win will be to run this in parallel with the core trace generation.
The processor/src/parallel/mod.rs file is getting big and disorganized, but I left it as is for now to avoid blowing up the diff. This can be done with Parallel trace generation: pre-allocate main trace buffer #2160.

Benchmark results

The current benchmark results (blake3_1to1 benchmark) are on par with Process::execute():

build_trace(): 175ms (with the FastProcessor::execute_for_trace() taking roughly 5ms prior to that)
Process::execute(): 182ms

After further investigation, we spend 141ms of those 175ms in the (unoptimized) combine_fragments() copying data from the core trace fragments into the MainTrace column vectors serially; generating the core trace fragments in parallel takes about 18-20ms on 10 cores. We already have #2160 that will basically fix this issue.

So with #2160, assuming that we are able to remove that memory copy without adding additional overhead, we can expect generating the trace to go down from ~180ms to ~40ms (6ms for fast processor, and 34ms for build_trace()), a ~5x improvement. So this first version of our processor with trace generation would run at about 12 MHz, up from 2.7MHz.

And that's without any further possible optimizations, such as truly parallelizing everything that is parallelizable in build_trace() (a lot is sequential today for convenience). Moving to a row-major representation of the trace might also help.

Al-Kindi-0

Did an initial pass and looks great!
Will do another one once the TODOs are addressed or converted

Al-Kindi-0 · 2025-10-02T08:33:24Z

processor/src/parallel/mod.rs

+    let main_trace_len = {
+        // Get the trace length required to hold all execution trace steps
+        let max_len = range_table_len.max(core_trace_len).max(chiplets.trace_len());
+
+        // Pad the trace length to the next power of two and ensure that there is space for random
+        // rows
+        let trace_len = (max_len + NUM_RAND_ROWS).next_power_of_two();
+        core::cmp::max(trace_len, MIN_TRACE_LEN)
+    };


nit: I would encapsulate this logic

Al-Kindi-0 · 2025-10-02T08:35:53Z

processor/src/parallel/mod.rs

+// HELPERS
+// ================================================================================================
+
+// TODO(plafer): If we want to keep this strategy, then move the `op_eval_circuit()` method


Q: Could you elaborate on this?

We discussed offline that the fast processor could just record all the memory reads, and let the trace generator handle the evaluation

This TODO is related to the second bullet point in the issue description:

We now record the memory reads in ExecutionTracer when executing the EvalCircuit operation. This is now required so that we can populate the Memory chiplet properly, as well as the RangeChecker (since all memory reads come with a range check).

We currently also store the CircuitEvaluation struct in AceReplay (which could be computed during parallel trace generation instead). In principle this allows us to build the chiplets concurrently with the core trace, but at the cost of a larger message to send between the processor and trace generator. We can change this in a subsequent PR if needed.

We discussed offline that the fast processor could just record all the memory reads, and let the trace generator handle the evaluation

This is currently what the fast processor does. To elaborate further, the current way we handle the EvalCircuit operation is two-fold:

The FastProcessor records all memory reads that occur during the operation

This is necessary in order for the Memory chiplet to have all the reads that occurred during program execution.

We store separately the CircuitEvaluation struct that is fed to the Ace chiplet

(1) is uncontroversial; we need to store that information one way or another. But (2) is not strictly necessary. As @adr1anh was alluding to, we could have the core trace generators build the CircuitEvaluation struct to be passed to the Ace chiplet. The downside of that approach though is that in order to build the Ace chiplet, we need to wait for the core trace generation to be done (so that they can return all the CircuitEvaluation structs that were built when re-executing the program in parallel).

So the upside of the current approach is that the ace chiplet can be built in parallel with the core trace fragments, at the cost of a larger TraceGenerationContext being sent from the fast processor. But this needs to be experimented with/benchmarked; thinking about it more, we expect the core trace generation to fully saturate the available cores, and so there shouldn't be a big cost (if any) to waiting for that to be done in order to start Ace chiplet trace generation.

Conclusion: For this PR, I think we should keep the current strategy just because it's already implemented, and we intend to improve on the current PR anyways. I will look into cleaning up that TODO with that in mind (by not having to copy/paste the eval_circuit_fast_() implementation between the fast processor and core trace generators)

Actually there's a few ways to do fix this TODO, which requires further investigation. I'll leave it for a subsequent PR (opened #2255).

adr1anh

Looks great! Just left a couple of nits.

processor/src/fast/execution_tracer.rs

processor/src/fast/trace_state.rs

adr1anh · 2025-10-02T14:28:22Z

processor/src/parallel/mod.rs

+// HELPERS
+// ================================================================================================
+
+// TODO(plafer): If we want to keep this strategy, then move the `op_eval_circuit()` method


We discussed offline that the fast processor could just record all the memory reads, and let the trace generator handle the evaluation

… to u64

plafer · 2025-10-03T15:31:27Z

This is ready for final review. I

updated the PR description with the benchmark results
Opened a bunch of follow-up issues
fixed the CI problem in Implement parallel trace generation for the system, stack and decoder columns #1839 (see Implement parallel trace generation for the system, stack and decoder columns #1839 (comment))
addressed all remaining TODO(plafer), except for one (see Plafer full parallel trace gen #2188 (comment))

plafer force-pushed the plafer-full-parallel-trace-gen branch 2 times, most recently from ea4ca80 to ac44a2f Compare September 22, 2025 11:48

plafer force-pushed the plafer-1558-parallel-tracegen branch from ce1ccc1 to d34362b Compare September 22, 2025 11:48

plafer force-pushed the plafer-full-parallel-trace-gen branch 4 times, most recently from 1958c15 to 5faea8e Compare September 24, 2025 20:56

plafer marked this pull request as ready for review September 24, 2025 21:19

plafer requested review from bobbinth, Al-Kindi-0, huitseeker and adr1anh September 24, 2025 21:19

Al-Kindi-0 reviewed Oct 2, 2025

View reviewed changes

adr1anh reviewed Oct 2, 2025

View reviewed changes

This was referenced Oct 2, 2025

HasherOp: improve handling of basic blocks #2250

Open

Parallel trace generation: remove HALT bug #2251

Open

Implement serialization of TraceGenerationContext #2252

Open

plafer force-pushed the plafer-full-parallel-trace-gen branch from 60635f2 to c4987c0 Compare October 2, 2025 19:28

plafer force-pushed the plafer-1558-parallel-tracegen branch from 982f097 to 5bcb789 Compare October 2, 2025 19:28

plafer force-pushed the plafer-full-parallel-trace-gen branch from c4987c0 to 72effde Compare October 2, 2025 21:50

plafer added 4 commits October 3, 2025 10:29

feat(processor): build entire trace in parallel's build_trace()

423531b

chore(processor): require_u32_operands! no longer converts operations…

438d20d

… to u64

chore: changelog for PR 2188

fceb772

chore: add build_trace benchmark

7c002a7

plafer force-pushed the plafer-1558-parallel-tracegen branch from e31d96b to 1f5bc8b Compare October 3, 2025 14:29

plafer force-pushed the plafer-full-parallel-trace-gen branch from 72effde to 7c002a7 Compare October 3, 2025 14:49

plafer mentioned this pull request Oct 3, 2025

Investigate removing AceReplay from TraceGenerationContext #2255

Open

plafer requested review from Al-Kindi-0 and adr1anh October 3, 2025 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Plafer full parallel trace gen #2188

Plafer full parallel trace gen #2188

Uh oh!

plafer commented Sep 18, 2025 •

edited

Loading

Uh oh!

Al-Kindi-0 left a comment

Uh oh!

Al-Kindi-0 Oct 2, 2025

Uh oh!

plafer Oct 2, 2025

Uh oh!

Al-Kindi-0 Oct 2, 2025

Uh oh!

adr1anh Oct 2, 2025

Uh oh!

plafer Oct 2, 2025

Uh oh!

plafer Oct 3, 2025

Uh oh!

adr1anh left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adr1anh Oct 2, 2025

Uh oh!

plafer commented Oct 3, 2025

Uh oh!

Uh oh!

Plafer full parallel trace gen #2188

Are you sure you want to change the base?

Plafer full parallel trace gen #2188

Uh oh!

Conversation

plafer commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark results

Uh oh!

Al-Kindi-0 left a comment

Choose a reason for hiding this comment

Uh oh!

Al-Kindi-0 Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

plafer Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Al-Kindi-0 Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

adr1anh Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

plafer Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

plafer Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

adr1anh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adr1anh Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

plafer commented Oct 3, 2025

Uh oh!

Uh oh!

plafer commented Sep 18, 2025 •

edited

Loading