-
Notifications
You must be signed in to change notification settings - Fork 235
feat: Introduce FastProcessor
checkpoints
#2023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b6fe3b0
to
94be920
Compare
0d51839
to
1fb7d99
Compare
1fb7d99
to
ec93d8b
Compare
@bobbinth this is ready for a high-level review. I'm still debugging the end-to-end tests with parallel trace generation, which provides interesting insights for this PR. My plan is to
Specifically there are interesting edge cases & potential off-by-one errors in the interface between the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Looks good! I took a very high-level look (and still not fully understanding everything), but overall, the approach should work. I left a few small comments inline.
I think the main thing is not too clear to me yet is whether the approach we've taken to record the execution of the program itself is optimal. For example, could we have something like Vec<(RowIndex, Continuation)>
in CoreTraceStateBuilder
instead of BlockStack
? We'll probably need to track a bit more info with it, but it should be sufficient to create start/end rows for MAST nodes in the trace.
pub overflow: OverflowTable, | ||
pub block_stack: BlockStack, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cloning these every 1024 cycles may get quite expensive (especially if the MAST tree or overflow tables are deep). Not sure if these will work, but here are potential alternatives:
- For the overflow table, we could probably keep a "replay" of values being moved back onto the stack top from the overflow table (similar to how we have memory, advice etc. replays).
- For the block stack table, maybe we can keep a "transcript" of executed nodes - something similar to
NodeExecutionPhase
but with hasher addresses added to the internal variants.
// State stored at the start of a trace fragment | ||
snapshot_start: Option<SnapshotStart>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like there is some redundancy between the data in snapshot_start
and block_stack
(especially between block_stack
and continuation_stack
) - though, I don't know yet what's the best way to reconcile them.
Also question: why is snapshot_start
optional? i.e., under what circumstances it is None
?
processor/src/fast/checkpoints.rs
Outdated
BasicBlock { | ||
/// Node ID of the basic block being executed | ||
node_id: MastNodeId, | ||
/// Index of the operation batch within the basic block | ||
batch_index: usize, | ||
/// Index of the operation within the batch | ||
op_idx_in_batch: usize, | ||
/// Whether a RESPAN operation needs to be added before executing this batch. When true, | ||
/// `batch_index` refers to the batch to be executed *after* the RESPAN operation, and | ||
/// `op_index_in_batch` MUST be set to 0. | ||
needs_respan: bool, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, we should be able to reduce this just to a tuple of values (node_id, op_index)
where op_index
is the first operation in this block to start executing from. But this would be much easier to implement after the batch execution refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could combine batch_index
and op_idx_in_batch
into op_index
(to the benefit of the fast processor after the refactoring), but I think needs_respan
(or equivalent) needs to stay, because otherwise we can't distinguish between the 2 states
- the first operation in batch i
- the RESPAN right before the first operation in batch i
processor/src/fast/checkpoints.rs
Outdated
/// Specifies the execution phase when starting fragment generation. | ||
#[derive(Debug, Clone, PartialEq, Eq)] | ||
pub enum NodeExecutionPhase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in one of the previous comments, this feels very similar to the Continuation
enum and maybe there is a way to combine them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're different in a subtle way. ContinuationStack
works at the granularity of MastNodeId
s in 2 states: start or finish. It encodes what comes after the current node (not explicitly encoded in the ContinuationStack
). NodeExecutionPhase
is there to encode what the ContinuationStack
doesn't: exactly where we're at in the execution of the current node.
The reason we don't want to merge them is that we only want to create a NodeExecutionPhase
at a fragment boundary - otherwise, for example, we'd be creating a new NodeExecutionPhase
at each operation. Also they're different at a fundamental level, where NodeExecutionPhase
doesn't encode a continuation - it encodes the "current state".
CoreTraceFragmentGenerator::execute_fragment_generation()
in #1839 can provide more intuition for how the NodeExecutionPhase
is used: we basically start a fragment by finishing the current node, and then keep going by looping over the ContinuationStack
(in a way similar to the fast processor).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe NodeExecutionState
would be a better name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to NodeExecutionState
, and added more docs
fd22d69
to
a0946f0
Compare
3888d58
to
0d29227
Compare
04562b3
to
73c98a1
Compare
This is now in a reviewable state - I will stop force pushing, but intend to clean it up further. Some things to be cleaned up still (which I'd gladly take suggestions for):
@adr1anh since your last review, I removed Optionally, you can look at #1839 (stacked on top of this one) to look at how |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me!
Left a few comments (mostly nits).
My main question would be related to the branching in check_extract_trace_state
which is called quite frequently and if we can do anything about it. I can't immediately see a global solution but maybe it is worth thinking about some solution nevertheless
Yes we have a solution for this in a following PR (TLDR that involves generics) - should have mentioned it earlier. |
This could be left for a future PR, but I noticed there are many instances where we use |
The reason they're |
f7881ba
to
92bc273
Compare
Rebased on the latest |
92bc273
to
08776c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed async, LGTM!
ea20c25
to
b3b9e33
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thank you! I left some comments inline - most of them are about improving code organization and many could be done in follow-up PRs.
Also, it may have been somewhere in the comments, but I'm curious what does the current performance look like for:
- Pure execution (with the
NoopTracer
). - Execution with trace building.
// Note: we pass in a `NoopTracer`, because the parallel trace generation skips the circuit | ||
// evaluation completely |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this temporary (i.e., until the chiplets trace is also being built)? Or do we not need to capture the memory read here at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not temporary - the parallel trace generators assume that the program was run correctly, and hence that the circuit evaluated to 0. So re-running the circuit evaluation would be wasteful, and not provide any useful data for trace generation purposes. Therefore they'll never make the memory queries, and so it's important that we don't record them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would probably have a separate "trace" module and move trace-related structs there. For example:
- This file could be under
src/fast/trace/state_builder.rs
. - We could also have
src/fast/trace/state.rs
and in the future probably more sub-modules.
Ok((stack_outputs, self.advice)) | ||
} | ||
|
||
async fn execute_impl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need execute_impl()
? It seems like now it could be replaced with execute_with_tracer()
. If we do still need it, may be a good idea to add a brief doc comment explaining why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need it so that tests can access the processor state after execution - added that to a docstring.
processor/src/fast/mod.rs
Outdated
current_forest: &Arc<MastForest>, | ||
continuation_stack: &mut ContinuationStack, | ||
host: &mut impl AsyncHost, | ||
tracer: &mut impl Tracer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but I wonder if it would make sene to create a wrapper struct to group some of these parameters together - e.g., something like ExecutionContext
. This way, we could make these method signatures (and their call-sites) a bit more concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - but let's leave this for a subsequent PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but we should try to split this file up into several smaller files to make it easier to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave it a first pass - we can always refine later
3c51720
to
6973f1a
Compare
To be used as the main checkpoint struct in the parallel processor
6973f1a
to
63a9c9a
Compare
63a9c9a
to
63bcf81
Compare
Partially closes #2022 (i.e. only the core trace checkpoints are generated)
Note that I didn't split the
FastProcessor
'smod.rs
file just yet, since this will cause pretty bad conflicts when merging theasync-experimental
branch. I would prefer to wait for that before we split the file up.One of the challenges with this PR was in accommodating the 2 modes that we want the
FastProcessor
to run in:Mode 1 requires storing a subset of what is needed for mode 2. For example, for the overflow stack, mode 1 requires just the overflowed stack elements, whereas mode 2 requires the extra features provided by
OverflowTable
. Similarly, for execution context management, mode 1 requires the lightweightExecutionContextInfo
(defined inFastProcessor
'smod.rs
), whereas mode 2 requires the beefierBlockStack
. In order for mode 1 to indeed be as fast as it can, I decided to include both data structures. That is, in mode 1, we use the fastest version; however in mode 2, we use both versions "simultaneously" (and hence doing some duplicate work). I found that to be the best of both worlds. Note that the data structures needed for mode 2 are all stored inCoreTraceStateBuilder
.Finally,
FastProcessor::execute_op_batch()
is quite complex now due to the 2 simultaneous definitions of "operation batch index":NodeExecutionPhase::BasicBlock
variant (stored inCoreTraceState
) uses the new definition that takes the inserted NOOPs into accountThis should be naturally cleaned up with #1815.