Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add end-to-end replay slot metrics #25752

Merged
merged 10 commits into from
Jul 5, 2022
Merged

Conversation

carllin
Copy link
Contributor

@carllin carllin commented Jun 3, 2022

Problem

Current replay-slot-stats returns the sum of all execution metrics across all replay threads, which is not that useful for debugging

Summary of Changes

On each iteration of replay, take the longest running thread, and accumulate its execution metrics into a new end-to-end ThreadExecuteTimings. Report metrics for that final ThreadExecuteTimings as the end-to-end metrics for the slot

Fixes #

@carllin carllin added the v1.10 label Jun 3, 2022
@carllin carllin force-pushed the FixMetrics branch 2 times, most recently from 4d8b214 to aa10a4a Compare June 3, 2022 06:52
@carllin
Copy link
Contributor Author

carllin commented Jun 3, 2022

@t-nelson should be close to what we discussed

"execute_batch",
);

let thread_index = PAR_THREAD_POOL.current_thread_index().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will you know if it's rayon just scheduled a bunch of small jobs on one thread vs. a single (or multiple) large batch(es) which was executed on one thread. Also temporally if that created the longest timings or maybe that thread started sooner than another thread and actually caused the batch to take longer. And the rayon scheduling could vary across machines, no?

Copy link
Contributor Author

@carllin carllin Jun 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's rayon just scheduled a bunch of small jobs on one thread vs. a single (or multiple) large batch(es) which was executed on one thread

Yeah we won't be able to distinguish this, we'll just take the longest running thread and hope that captures the bottleneck. Right now we're only handing the thread pool 2-3 batches at a time, so hopefully longest thread does reflect the actual end to end timings.

Also temporally if that created the longest timings or maybe that thread started sooner than another thread and actually caused the batch to take longer. And the rayon scheduling could vary across machines, no?

I was thinking if the longest thread's total thread execution time doesn't add up to the total replay time, then we know there's some rayon shenanigans going on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if recording either start or end time for threads and snapshotting for the longest running thread could help understand cases where longest thread time is shorter than total execution time. Could shed some light on if that thread got started late, for example

@carllin
Copy link
Contributor Author

carllin commented Jun 4, 2022

image

Example of this metric on 5k tps GCE cluster

The newly added execute_batches_us wraps the thread pool execute batches logic, and measures the total time executing batches.

The total_thread_us measures the longest thread within that thread pool.

These two match up reasonably well.

The rest of the gap with the replay_time metric is all the stuff in process_entries_with_callback() before work is handed to the thread pool

@carllin
Copy link
Contributor Author

carllin commented Jun 4, 2022

And here's an example of breaking down the thread total time into load/execute/store/other misc. Looks like there's some unaccounted for work here:

image

@carllin
Copy link
Contributor Author

carllin commented Jun 7, 2022

image

Gap was made up for by `update_transaction_statuses`, updating the status cache

@carllin carllin marked this pull request as ready for review June 7, 2022 18:22
}

impl Default for ConfirmationTiming {
fn default() -> Self {
Self {
started: Instant::now(),
replay_elapsed: 0,
execute_batches_us: 0,
poh_verify_elapsed: 0,
transaction_verify_elapsed: 0,
fetch_elapsed: 0,
fetch_fail_elapsed: 0,
execute_timings: ExecuteTimings::default(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we rename this cumulative_execute_timings ?

Comment on lines 334 to 369
cumulative_execute_timings
.saturating_add_in_place(ExecuteTimingType::TotalBatchesLen, batches.len() as u64);
cumulative_execute_timings.saturating_add_in_place(ExecuteTimingType::NumExecuteBatches, 1);
saturating_add_assign!(
confirmation_timing.execute_batches_us,
execute_batches_elapsed.as_us()
);

let mut current_max_thread_execution_time: Option<ThreadExecuteTimings> = None;
for (_, thread_execution_time) in execution_timings_per_thread
.into_inner()
.unwrap()
.into_iter()
{
let ThreadExecuteTimings {
total_thread_us,
execute_timings,
..
} = &thread_execution_time;
cumulative_execute_timings.accumulate(execute_timings);
if *total_thread_us
> current_max_thread_execution_time
.as_ref()
.map(|thread_execution_time| thread_execution_time.total_thread_us)
.unwrap_or(0)
{
current_max_thread_execution_time = Some(thread_execution_time);
}
}

if let Some(current_max_thread_execution_time) = current_max_thread_execution_time {
end_to_end_execute_timings.accumulate(&current_max_thread_execution_time);
end_to_end_execute_timings
.execute_timings
.saturating_add_in_place(ExecuteTimingType::NumExecuteBatches, 1);
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what our typical convention is, but it seems like over half of the code in execute_batches_internal now deals w/ collection/recording/processing of metrics. The collection part obviously needs to live here, but I'm wondering if the processing/recording piece could be shoved into its own function (either called from here or one level up in execute_batches).

Maybe create the hashmap up there and pass a reference to be filled out by the threads. I think we could even avoid passing down confirmation_timing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I refactored the code such that execute_batches_internal now returns all the metrics it collects in a struct ExecuteBatchesInternalMetrics which is then aggregated in execute_batches() via a separate function process_execute_batches_internal_metrics() as you suggested: ef9608d

bw-solana
bw-solana previously approved these changes Jun 14, 2022
Copy link
Contributor

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Left a comment about separating thread execution & metrics collection from the recording/processing of the metrics, but this is more cosmetic. Functionally, looks good and will be helpful addition!

@mergify mergify bot dismissed bw-solana’s stale review June 22, 2022 05:06

Pull request has been modified.

@carllin carllin force-pushed the FixMetrics branch 2 times, most recently from e631729 to 4b597c2 Compare June 22, 2022 15:57
Copy link
Contributor

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some comments to explain how these macros work?

for timing in new_timings {
timings.accumulate(&timing);
}
let execution_timings_per_thread: RwLock<HashMap<usize, ThreadExecuteTimings>> =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why HashMap? could use a Vec with the same size as the thread pool. the map's metadata is the only thing that makes us need a lock

@carllin
Copy link
Contributor Author

carllin commented Jul 1, 2022

can you add some comments to explain how these macros work?

Done: d8552c5

@carllin carllin merged commit ce39c14 into solana-labs:master Jul 5, 2022
mergify bot pushed a commit that referenced this pull request Jul 5, 2022
(cherry picked from commit ce39c14)

# Conflicts:
#	Cargo.lock
#	core/src/progress_map.rs
#	ledger/src/blockstore_processor.rs
#	program-runtime/Cargo.toml
#	programs/bpf/Cargo.lock
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants