-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Per-pipeline-invocation profiling #8153
Conversation
This should give better results when multiple Halide pipelines are running at the same time.
- Don't profile bounds queries - Simplify layout calculation - Bill time after decrementing main thread as overhead, not waiting on parallel tasks - Change waiting on parallel tasks label
May I suggest to add the ability to do microsecond level sleeps to set the sampling rate, with a user-facing API to set it? I'd maybe like to take 5, or 10 samples per millisecond, instead of just 1. Additionally, now that we measure per-pipeline and per-instance, an accurate rdtsc-based wall-time measurement would be nice (if available on the system). Thoughts on this? |
src/runtime/profiler_common.cpp
Outdated
// Unfortunately a user_context is not available or meaningful if | ||
// there are zero or multiple instances running. | ||
halide_error(nullptr, "Cannot use device profiling while multiple profiled pipelines are running on host. Shutting down profiler."); | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have never worked with this Hexagon thing, but perhaps shutting down the profiler is an overkill? Might be not user-friendly. You still might want to see the profiler results for another pipeline that is running alone on a later point in the program? Instead, omitting the profile report for all pipelines that have run simultaneously might be a less extreme alternative? As I said: no clue what I'm talking about, but trying to spot potential annoyances for anyone using this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this situation means that the profiler state is already corrupt. This should already have thrown an error in hexagon_host when trying to launch the remote kernel. If you try to launch a profiled hexagon kernel, and there's already an instance in flight, it will refuse to even launch the kernel. I'll change it to a hard assert because it's supposed to be unreachable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to a debug assert in this place, and instead put in a hard error message if you either try to launch a profiled pipeline while a profiled pipeline is running on device, or try to launch a profiled pipeline on device while another profiled pipeline is running.
|
||
// Tell the instance the pipeline to which it belongs. | ||
instance->pipeline_stats = p; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here might be a good point to just take the halide_current_time_ns()
to take the start time, to produce accurate wall times, as opposed to summing samples (which I demonstrated to be very noise/inaccurate in #5796).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix in next push, though I won't resolve this because the way I did it might need discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed. There was a question about whether or not pipeline start and end should count as sampling events, which could be used to ensure the sum of the billed times for each func equalled the wall clock time of the whole pipeline, but I decided that would actually introduce some biases, so there's an awkward adjustment step at pipeline exit to rescale the billed times to add up to the wall clock time.
halide_profiler_state *s = halide_profiler_get_state(); | ||
if (!s->sampling_thread) { | ||
return; | ||
} | ||
|
||
s->current_func = halide_profiler_please_stop; | ||
int one = 1; | ||
atomic_store_relaxed(&(s->shutdown), &one); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of having a relaxed "atomic" operation and this auxiliary function? Doesn't the C++ memory consistency model guarantee the same behavior even without this relaxed atomic store on all platforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of atomic_store_relaxed is just *a = *b, so this is really just documentation that this is an intentional unlocked communication of a value across threads.
I changed things to microseconds (except on Windows, which now just does a yield if the requested sleep time is under a millisecond). Regarding rdtsc, I think that would be a separate change that provides a different implementation for one of the *_clock.cpp runtime modules to use on x86. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pristine! 😄
@abadams I'm testing the PR for hvx. Will update you soon. |
@abadams @mcourteaux I tested the PR on simulator (host-hvx) and device (arm-hvx). Profiling worked with a few minor changes on the simulator but I see a crash on device. I'm trying to debug the device crash. I see the crash on device with the |
Ready to land? |
Waiting on the hvx issue. |
Can be merged. The HVX issue is unrelated to this, it seems. |
It looks like there's other stuff in #8187 which is needed though. Only the missing sampling token issue was broken on main. |
* Get abadams/per_instance_profiling working on hvx * More changes * Add Hexagon libraries * Fix multiple instances of profiler_state * Update hexagon libraries
Ready to land (pending green)? |
This is a new sampling profiler as discussed in #5796. When it samples, it samples all simultaneously running Halide pipelines, and all simultaneously running instances of the same pipeline, tracking their stats separately (they have separate sampling tokens, instead of fighting over a single global one). At pipeline exit these stats are accumulated into global pipeline stats and static per-pipeline stats.
The big differences with the current sampling profiler are that:
pipeline invocations are measured as total wall-clock time, rather than wall-clock time minus time spent in any other Halide pipeline running at the same time. The old behavior didn't really work when pipelines could use different numbers of threads, or if simultaneously running pipelines set the sampling token at different rates.
In the per-pipeline results, time with the thread pool busy spent doing work on some other Halide pipeline is tracked as its own entry. This conveniently also measures time spend in the ragged end of a too-coarse parallel for loop, or time spent trying to grab thread pool locks in a too-fine parallel for loop, so it's also useful for standalone micro-benchmarking.
Pipeline invocations that are just bounds queries get ignored. Previously these could misleadingly bring the average runtime way way down.
I'm going to need some help testing (and probably fixing) the hvx changes. At the very least the remote runtime needs to be rebuilt.
Fixes #5796