-
Notifications
You must be signed in to change notification settings - Fork 232
Plafer full parallel trace gen #2188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: plafer-1558-parallel-tracegen
Are you sure you want to change the base?
Plafer full parallel trace gen #2188
Conversation
ea4ca80
to
ac44a2f
Compare
ce1ccc1
to
d34362b
Compare
1958c15
to
5faea8e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did an initial pass and looks great!
Will do another one once the TODOs are addressed or converted
processor/src/parallel/mod.rs
Outdated
let main_trace_len = { | ||
// Get the trace length required to hold all execution trace steps | ||
let max_len = range_table_len.max(core_trace_len).max(chiplets.trace_len()); | ||
|
||
// Pad the trace length to the next power of two and ensure that there is space for random | ||
// rows | ||
let trace_len = (max_len + NUM_RAND_ROWS).next_power_of_two(); | ||
core::cmp::max(trace_len, MIN_TRACE_LEN) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would encapsulate this logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// HELPERS | ||
// ================================================================================================ | ||
|
||
// TODO(plafer): If we want to keep this strategy, then move the `op_eval_circuit()` method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Could you elaborate on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed offline that the fast processor could just record all the memory reads, and let the trace generator handle the evaluation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO
is related to the second bullet point in the issue description:
We now record the memory reads in ExecutionTracer when executing the EvalCircuit operation. This is now required so that we can populate the Memory chiplet properly, as well as the RangeChecker (since all memory reads come with a range check).
We currently also store the CircuitEvaluation struct in AceReplay (which could be computed during parallel trace generation instead). In principle this allows us to build the chiplets concurrently with the core trace, but at the cost of a larger message to send between the processor and trace generator. We can change this in a subsequent PR if needed.
We discussed offline that the fast processor could just record all the memory reads, and let the trace generator handle the evaluation
This is currently what the fast processor does. To elaborate further, the current way we handle the EvalCircuit
operation is two-fold:
- The
FastProcessor
records all memory reads that occur during the operation
- This is necessary in order for the Memory chiplet to have all the reads that occurred during program execution.
- We store separately the
CircuitEvaluation
struct that is fed to theAce
chiplet
(1) is uncontroversial; we need to store that information one way or another. But (2) is not strictly necessary. As @adr1anh was alluding to, we could have the core trace generators build the CircuitEvaluation
struct to be passed to the Ace
chiplet. The downside of that approach though is that in order to build the Ace
chiplet, we need to wait for the core trace generation to be done (so that they can return all the CircuitEvaluation
structs that were built when re-executing the program in parallel).
So the upside of the current approach is that the ace chiplet can be built in parallel with the core trace fragments, at the cost of a larger TraceGenerationContext
being sent from the fast processor. But this needs to be experimented with/benchmarked; thinking about it more, we expect the core trace generation to fully saturate the available cores, and so there shouldn't be a big cost (if any) to waiting for that to be done in order to start Ace chiplet trace generation.
Conclusion: For this PR, I think we should keep the current strategy just because it's already implemented, and we intend to improve on the current PR anyways. I will look into cleaning up that TODO
with that in mind (by not having to copy/paste the eval_circuit_fast_()
implementation between the fast processor and core trace generators)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there's a few ways to do fix this TODO, which requires further investigation. I'll leave it for a subsequent PR (opened #2255).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just left a couple of nits.
// HELPERS | ||
// ================================================================================================ | ||
|
||
// TODO(plafer): If we want to keep this strategy, then move the `op_eval_circuit()` method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed offline that the fast processor could just record all the memory reads, and let the trace generator handle the evaluation
60635f2
to
c4987c0
Compare
982f097
to
5bcb789
Compare
c4987c0
to
72effde
Compare
e31d96b
to
1f5bc8b
Compare
72effde
to
7c002a7
Compare
This is ready for final review. I
|
Closes #1558
Stacked on #1839
Implements the rest of parallel trace generation. Some notable changes include:
SIGTERM
'd). All tests pass locally though. The fix is probably just not to generate both traces (fromProcess
andbuild_trace()
) for large traces. This is a temporary problem, in that this will go away when we get rid ofProcess
(and migrate its trace building tests ofbuild_trace()
).ExecutionTracer
when executing theEvalCircuit
operation. This is now required so that we can populate theMemory
chiplet properly, as well as theRangeChecker
(since all memory reads come with a range check).CircuitEvaluation
struct inAceReplay
(which could be computed during parallel trace generation instead). In principle this allows us to build the chiplets concurrently with the core trace, but at the cost of a larger message to send between the processor and trace generator. We can change this in a subsequent PR if needed.Hasher::hash_basic_block()
requires us to pass a&[OpBatch]
, i.e. all the batches of a basic block every time enter it. We currently naively clone in all the operation batches of a basic block into theHasherReplayForChiplet
(the replay that records the data needed to construct theHasher
chiplet trace). A better approach would recognize that the same basic block can be entered multiple times, and so would store its op batches at most once.HasherOp
: improve handling of basic blocks #2250HALT
#1383,Process
doesn't always include aHALT
operation, specifically in the edge case where the lastEND
(before the program halting) is right before a power-of-2 boundary, e.g. 63. Then when we insert the random row, this pushes the number of rows to 64, and therefore no padding is inserted (which is the only case whereHALT
is inserted). We mirror that bug intobuild_trace()
so that tests pass - there's a comment to remind us to remove that bug once we get rid ofProcess
.chiplets
struct is probably pretty expensive (especially the Hasher chiplet), so a quick win will be to run this in parallel with the core trace generation.processor/src/parallel/mod.rs
file is getting big and disorganized, but I left it as is for now to avoid blowing up the diff. This can be done with Parallel trace generation: pre-allocate main trace buffer #2160.Benchmark results
The current benchmark results (
blake3_1to1
benchmark) are on par withProcess::execute()
:build_trace()
: 175ms (with theFastProcessor::execute_for_trace()
taking roughly 5ms prior to that)Process::execute()
: 182msAfter further investigation, we spend 141ms of those 175ms in the (unoptimized)
combine_fragments()
copying data from the core trace fragments into theMainTrace
column vectors serially; generating the core trace fragments in parallel takes about 18-20ms on 10 cores. We already have #2160 that will basically fix this issue.So with #2160, assuming that we are able to remove that memory copy without adding additional overhead, we can expect generating the trace to go down from ~180ms to ~40ms (6ms for fast processor, and 34ms for
build_trace()
), a ~5x improvement. So this first version of our processor with trace generation would run at about 12 MHz, up from 2.7MHz.And that's without any further possible optimizations, such as truly parallelizing everything that is parallelizable in
build_trace()
(a lot is sequential today for convenience). Moving to a row-major representation of the trace might also help.