-
Notifications
You must be signed in to change notification settings - Fork 235
Implement parallel trace generation for the system, stack and decoder columns #1839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
909d72d
to
23ce6bf
Compare
cb9ad2a
to
e8ca5c9
Compare
cf18eaf
to
d14734d
Compare
d14734d
to
58346f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Looks good. Not a review yet, but I left some comments inline.
processor/src/parallel/replay.rs
Outdated
// Map operations - store the values separately so we can return references | ||
pub map_gets: VecDeque<(RpoDigest, bool)>, // (key, found) | ||
pub map_get_values: VecDeque<Vec<Felt>>, // values for successful gets | ||
pub map_inserts: VecDeque<(Word, Vec<Felt>)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need these? The only way to access advice map data is by putting it onto the advice stack first, and so we should be able to safely ignore any updates to the advice map (or reads directly from it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes this doesn't make sense - I generated this (with Claude) as placeholder a while back, and didn't come back to clean it up. Basically all of AdviceReplay
is placeholder right now
processor/src/parallel/replay.rs
Outdated
// Merkle store operations | ||
pub tree_nodes: VecDeque<(Word, Felt, Felt, Word)>, | ||
pub merkle_paths: VecDeque<(Word, Felt, Felt, MerklePath)>, | ||
pub leaf_depths: VecDeque<(Word, Felt, Felt, u8)>, | ||
pub node_updates: VecDeque<(Word, Felt, Felt, Word, (MerklePath, Word))>, | ||
pub root_merges: VecDeque<(Word, Word, Word)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, I think we may get away without keeping track of node_updates
and root_merges
. Also, do we actually use leaf_depths
for anything? Could you double check if it is being used either here or in miden-base
somehow, and if not, we could probably get rid of this functionality from the advice provider altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, I think we may get away without keeping track of node_updates and root_merges.
Also generated placeholder, so yes will be removed.
Could you double check if it is being used either here or in miden-base somehow, and if not, we could probably get rid of this functionality from the advice provider altogether.
It wasn't used anywhere so created #1905
processor/src/parallel/replay.rs
Outdated
#[derive(Debug)] | ||
pub struct HasherReplay { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we actually need this replay: the fragment should have all the data needed to compute the hashes - so, if i understood correctly, by having this reply we:
- Save the time in the fragment builder by avoiding hash computation twice.
- Make the fast processor a bit slower by requiring it to record the hasher replay.
If so, it may be an acceptable tradeoff to take as being able to execute as fast as possible is probably more important than saving hash computations in a separate thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i understood correctly, by having this reply we:
Save the time in the fragment builder by avoiding hash computation twice.
Make the fast processor a bit slower by requiring it to record the hasher replay.
That's right. And ultimately we'll find the right balance of what to replay and what to recompute only by benchmarks. But my initial reasoning was that hash computations were orders of magnitude more expensive than a Vec::push()
, so was probably worth adding a Vec::push()
to the FastProcessor
so that trace generation can go faster (with the assumption that trace generation will be the bottleneck). But we can revisit when we have benchmark data
if value != ONE { | ||
panic!( | ||
"Assertion failed: expected 1, got {} at clock {}", | ||
value, self.state.system.clk | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this could be:
assert!(value == ONE, "assertion failed: expected...");
But also, should we use debug_assert!
instead? My understanding that this shouldn't happen unless something in the fast processor is messed up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed, switched to debug_assert
processor/src/parallel/mod.rs
Outdated
BasicBlock { | ||
/// Index of the operation batch within the basic block | ||
batch_index: usize, | ||
/// Index of the operation within the batch | ||
op_idx_in_batch: usize, | ||
/// Whether a RESPAN operation needs to be added before executing this batch. When true, | ||
/// `batch_index` refers to the batch to be executed *after* the RESPAN operation, and | ||
/// `op_index_in_batch` MUST be set to 0. | ||
needs_respan: bool, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we avoid starting fragments in the middle of an operation group? There are at most 72 operations per group (and usually much fewer), and so fragment length will never vary by more than 72 cycles.
We could even try to never split basic blocks, but a basic block could be pretty long (potentially thousands of cycles) - and so, I'm much less sure about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we avoid starting fragments in the middle of an operation group?
Yeah I went back and forth on how exactly to design this. The ultimate reason why I went down this route was so that all fragments are exactly NUM_ROWS_PER_FRAGMENT
long (except for the last one). The main benefit of this is so that when can easily parallelize writing those fragments back to a single MainTrace
buffer (i.e. the exact location in the buffer of where to write each fragment is determined directly from the fragment index).
I think it was also solving another problem I was having but now I'm forgetting (I'll update this thread if/when I remember).
So I'll leave it as-is for now, but we can for sure revisit if the downsides turn out to outweigh the benefits
processor/src/parallel/mod.rs
Outdated
pub struct CoreTraceState { | ||
pub system: SystemState, | ||
pub stack: StackState, | ||
pub block_stack: BlockStack, | ||
pub traversal: ExecutionTraversal, | ||
pub memory: MemoryReplay, | ||
pub advice: AdviceReplay, | ||
pub hasher: HasherReplay, | ||
pub external_node_replay: ExternalNodeReplay, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the struct that the fast processor needs to build for every fragment, right? If so, we should try to minimize its footprint as much as possible, which may include:
- Simplifying or eliminating the need for
AdviceReplay
as mentioned in some other comments. - Potentially getting rid of the
HasherReplay
. - Maybe combining
BlockStack
,ExecutionTraversal
, andExternalNodeReplay
.
Regarding the last point, I didn't actually understand how they work together. For example, don't we actually need a list of ExecutionTraversal
s because we can go in and out of various MAST forest during execution (and so keeping track of only a single MAST forest may not be enough).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplifying or eliminating the need for AdviceReplay as mentioned in some other comments.
An AdviceReplay
(or equivalently an AdviceProvider
with some "replay" functionality) is the only way I found to fully isolate the fragment generators from the Host
/AdviceProvider
, and fully ensure determinism (i.e. that the fragment generators get the same result to every "query" as did the fast processor). Otherwise we have to push that "ensure determinism" complexity into the host/advice provider, which might make sense but I didn't want to tackle it at this time.
Regarding the last point, I didn't actually understand how they work together. For example, don't we actually need a list of ExecutionTraversals because we can go in and out of various MAST forest during execution (and so keeping track of only a single MAST forest may not be enough).
Hmmm, looking back at it now, it does seem like I missed that; the traversal needs to encode any other MastForest
used. But the basic/rough idea was:
- we need the
BlockStack
to know how to populateEND
rows - we need the
ExecutionTraversal
to know how to "go back up" theMastForest
- where we didn't need this in
Process
since that information is basically on the "Rust stack"
- where we didn't need this in
ExternalNodeReplay
: similar toAdviceProviderReplay
, we need this to ensure determinism (and to avoid needing to pass theHost
to trace fragment generators)
This is also pre-cleanup so yeah indeed they should probably end up being together in a struct or something.
098baca
to
8dd5d23
Compare
5508afe
to
cb0dd6f
Compare
d96cb43
to
b7f5026
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a second pass and left some questions and nits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have more or less reviewed everything, except for some test modules, which I will do in a final pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work again!
Left some additional questions and nits which are non-blocking
// 5: SPLIT | ||
// 6: BLOCK ADD END | ||
// 9: END |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? Should be the same as above with SPLIT instead of JOIN in the inner part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's correct, since what I'm showing here is the execution, not the MAST. So since we have ONE
on the stack, only the "true" branch is executed (i.e. BLOCK ADD END
).
// 5: SPLIT | ||
// 6: BLOCK SWAP END | ||
// 9: END |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same idea, but "false" branch is executed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a full review, but leaving a few notes inline as they may help posterior work.
for (op_idx_in_batch, &op) in batch.ops().iter().enumerate().skip(start_op_idx) { | ||
let (op_group_idx, op_idx_in_group) = | ||
batch.op_idx_in_batch_to_group(op_idx_in_batch).expect("invalid op batch"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I would recommend leaving a TODO to add an operation iterator that gives you those group indexes, since the complexity of performing that is O(1)
per operation (as opposed to the current O(log N)
where N
is the batch length).
It's because that iterator is easy to make (and that group indexes weren't really used elsewhere) that I didn't add a group indexation api already. You can see an example with group_chunks
, which can be tweaked to support your next_op_group_index
approach as well.
This doesn't have to be done in this PR, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #2170
// if we executed all operations in a group and haven't reached the end of the batch | ||
// yet, set up the decoder for decoding the next operation group | ||
if op_idx_in_batch + 1 == end_indices[op_group_idx] | ||
&& let Some(next_op_group_idx) = batch.next_op_group_index(op_group_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cost a linear scan per group, I'd precompute a “next op-group” table once, or use the above suggestion to refactor to iterators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also referenced in #2170
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good! Left a couple of nits, but otherwise it's good to go from my end
861f245
to
ce1ccc1
Compare
ce1ccc1
to
d34362b
Compare
@bobbinth I think we were waiting for 0.18 to be released before merging this right? Then it should be good to go. |
Is it possible to increase the amount of memory accessible to the CI? If not I can try to think of another workaround. |
e31d96b
to
1f5bc8b
Compare
In 1f5bc8b I ended up limiting the size of traces built by |
Previously, we would only do it for external nodes. Accordingly, we adjusted the naming in the `Tracer` trait to reflect that we are no longer only recording external node resolutions.
These will be useful for managing batch/group/operation indices in trace generation.
This will be useful in parallel trace generation, when using `Processor::execute_sync_op` that requires a Host, but we are not actually getting any data from a host (we are getting it from replays instead).
1f5bc8b
to
bba4e21
Compare
Partially closes #1558
We now expose a new
build_trace(Vec<TraceFragmentContext>)
, which takes the output ofFastProcessor::execute_for_trace()
, and builds a trace in parallel (only system, stack and decoder columns so far).Specifically, the strategy is to break up the execution in fixed size fragments of
n
columns (e.g. 2048 or 4096), with 1 thread generating each corresponding fragment. The fragments are then copied back into a singleMainTrace
buffer. Hence, the additional complexity of parallel trace generation compared to howProcess
does it is we need to be able to start generating the trace from any state at any clock cycle.This is tested exhaustively in the
test_trace_generation_at_fragment_boundaries()
test (currently 25 differentrstest
cases). We also now check for every test run withTest::execute()
(and similar) that the trace generated byProcess
is the same as the trace generated bybuild_trace()
.Note to reviewers: in the interest of time, this PR leaves a number of
TODO(plafer)
, for which the plan is to address in a subsequent PR, after the entire trace is generated in parallel. I plan to open an issue for this before merging this one.build_trace()