-
-
Notifications
You must be signed in to change notification settings - Fork 747
[RFC] Fine-grained metrics using Spans
#7619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| self.fast_metrics.log_read(nbytes) | ||
| with meter("memory-read"): | ||
| nbytes = cast(int, self.fast.weights[key]) | ||
| self.fast_metrics.log_read(nbytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fast_metrics / slow_metrics is a big chunk of ad-hoc code that I could remove in #7586. The key feature I leveraged (which I don't think is possible in this design) was being able to capture the metrics multiple times.
In other words: if, downstream of this PR, I want to have a holistic view of all activity of the SpillBuffer, I need to
- I must make 100% sure that I'm doing something with the metrics after all points of access to the SpillBuffer. I suppose we could add a
assert_root: bool=Falseparameter tometer()to make this easier.
For example, in Fine performance metrics for execute, gather_dep, etc. #7586SpillBuffer.cumulative_metrics(which is conveniently posted to the worker's Prometheus API) I am also capturing scatter activity, while I don't bother in the worker state machine. - I must perform a full scan of all elements in
Worker.digests_total, search in the tuples for the keywords relevant to the SpillBuffer, and extrapolate from there. This is assuming that all components that access the SpillBuffer (WorkerStateMachine, WorkerMemoryManager, Worker.get_data) all post the metrics in the same way toWorker.digests_total.
Both of the above are feasible - but also quite unconvenient.
Another issue here is that this design has no support for non-time metrics. with meter("memory-read"): doesn't do anything above (reading from SpillBuffer.fast takes nanoseconds). #7586 uses the memory-read tag to post bytes and count metrics to Worker.digests_total, which can then in turn be used to calculate cache hit ratios. In this design, you were forced to keep a secondary ad-hoc metric storage system (SpillBuffer.fast_metrics / SpillBuffer.slow.metrics).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fast_metrics / slow_metrics is a big chunk of ad-hoc code that I could remove in #7586
I don't personally mind leaving these existing ad-hoc metrics in place. As I said, t1 - t0 is extremely cheap. If it was a choice between two systems that were each easier to understand, versus one that did everything but was harder to understand, my preference would be for the two simpler ones.
For SpillBuffer specifically (I haven't looked closely), aren't fast_metrics/slow_metrics basically just an aggregation of the fine-grained spans we'd be collecting? We'd be tracing read/write times/counts/bytes broken out by activity and key prefix. fast_metrics/slow_metrics to me look like that same data, just summed over activity and prefix.
Intuitively to me, if we want coarser metrics that are just an aggregation of finer ones, that seems pretty tractable. I haven't thought about exactly how we'd implement it, but either doing an aggregation during span processing, or just letting the metrics system (Prometheus) do the sum, seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a way we can get aggregated metrics, like we're currently doing with SpillBuffer. I haven't actually removed the fast/slow metrics yet, but I imagine this would allow us to: 8f9fafe
Basically
- spans can hold arbitrary metadata, like otel
- we set the metadata
aggregate=Trueon spans that we want aggregate metrics from - our span-processing logic on the worker knows that when a span has
aggregate=True, it exports a metric both for the fully-qualified name("transition", "x", "released->memory", "disk-write")and the non-qualified namedisk-write. - Now we have both a cumulate disk-write metric across all tasks and operations, as well as disk-write broken down by task and operation. This generalizes to anything else we might want to trace.
|
One of the biggest critiques to #7586 was that there are too many
|
|
Ultimately, #7586 in the current incarnation says: If you want to meter anything, you need at top level a Whereas this PR says: All metrics are recorded in a local list. After you finish recording them, you must do something with this list. Which, to me, feels a bit of a chicken-and-egg situation. |
Sorry, I didn't add this because this PR already went far further into implementation than I'd intended. But this is pretty straightforward, borrowing the idea of attributes on spans. You just add non-time metrics (like disk bytes read) as attributes on spans. Then in
To be clear this PR is just meant to be an illustration of an architecture. Not everything from #7586 is traced here. This is meant to show just enough to get a feel for the API and confirm it actually works. |
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 26 files + 1 26 suites +1 14h 55m 25s ⏱️ + 3h 27m 17s For more details on these failures and errors, see this check. Results for commit c0fe05f. ± Comparison against base commit 310fc95. This pull request skips 1 test. |
this needs to be cleaned up
lol without `direct=False` `gather` breaks across multiple workers cause it thinks it's running on a worker
|
I'm hoping having an explicit
Mostly I think having the data structure of spans (and an automatic way to keep track of the currently-active ones) is what makes this a pretty simple change. |
An idea for an alternative architecture for #7586. See #7586 (comment) for motivation.
Core ideas:
Spandata structure that forms a tree of spans. This could hopefully be swapped for the OTel API itself in the future.starton the span before launching the task (so we can track event loop overhead)spanfield (likestimulus_id, as @fjetter said). Async callbacks likeexecuteexplicitly fill in thespanfiled on the Event they return.stopon the span (again, so we can track event loop overhead)span.flat()flattens the span's tree, giving us a simple breakdown of timespan.flat()->DigestEventto digest the metrics via the normal Instructions infrastructureIn pseudocode (referring to @crusaderky's comment on #7586 that I now can't find, maybe deleted?):
cc @hendrikmakait @fjetter @crusaderky