[RFC] Fine-grained metrics using `Span`s #7619

gjoseph92 · 2023-03-07T01:50:46Z

An idea for an alternative architecture for #7586. See #7586 (comment) for motivation.

Core ideas:

Rip off the OTel tracing API with a recursive Span data structure that forms a tree of spans. This could hopefully be swapped for the OTel API itself in the future.
Spans are associated with asyncio Tasks via contextvars. We call start on the span before launching the task (so we can track event loop overhead)
State machine events have a span field (like stimulus_id, as @fjetter said). Async callbacks like execute explicitly fill in the span filed on the Event they return.
When the ExecuteSuccessEvent, etc is processed, we call stop on the span (again, so we can track event loop overhead)
span.flat() flattens the span's tree, giving us a simple breakdown of time
Convert span.flat() -> DigestEvent to digest the metrics via the normal Instructions infrastructure

In pseudocode (referring to @crusaderky's comment on #7586 that I now can't find, maybe deleted?):

def handle_stimulus(*stims):
    instructions = self.state.handle_stimulus(*stims)

    for inst in instructions:
        if isinstance(inst, GatherDep):
            span = get_span("gather-dep")
            span.start()
            with span.as_current():
                task = asyncio.create_task(
                    self.gather_dep(
                        inst.worker,
                        ...
                        span=span,
                    ),
                )
        elif isinstance(inst, Execute):
            span = get_span(("execute", key_split(inst.key)))
            span.start()
            with span.as_current():
                task = asyncio.create_task(
                    self.execute(
                        inst.key,
                        ...
                        span=span,
                    ),
                )


def _handle_execute_success(self, ev: ExecuteSuccessEvent) -> RecsInstrs:
    ...
    instructions.extend(
        convert_span_to_digest_metric_instrs(ev.span)
    )
    # ->
    # [
    #     DigestMetric(("execute", "task-prefix", "deserialize"), 0.5),
    #     DigestMetric(("execute", "task-prefix", "disk-read"), 2.0),
    #     DigestMetric(("execute", "task-prefix", "cpu-thread"), 4.0),
    #     DigestMetric(("execute", "task-prefix", "cpu"), 1.0),
    #     ^ non-thread CPU time (GIL?)
    #     DigestMetric(("execute", "task-prefix"), 1.0),
    #     ^ this last one is "overhead", aka event loop.
    #     the difference between start and stop of the overall span,
    #     and total time in sub-spans.
    # ]
    return recs, instructions


async def execute(..., span):
    ...
    await run_in_executor(
        apply_function_simple
    )
    ...
    return ExecuteSuccessEvent(span=span)


async def apply_function_simple(...):
    with meter("cpu"):
        func()

cc @hendrikmakait @fjetter @crusaderky

distributed/span.py

crusaderky · 2023-03-07T14:16:44Z

distributed/spill.py

-            self.fast_metrics.log_read(nbytes)
+            with meter("memory-read"):
+                nbytes = cast(int, self.fast.weights[key])
+                self.fast_metrics.log_read(nbytes)


fast_metrics / slow_metrics is a big chunk of ad-hoc code that I could remove in #7586. The key feature I leveraged (which I don't think is possible in this design) was being able to capture the metrics multiple times.

In other words: if, downstream of this PR, I want to have a holistic view of all activity of the SpillBuffer, I need to

I must make 100% sure that I'm doing something with the metrics after all points of access to the SpillBuffer. I suppose we could add a assert_root: bool=False parameter to meter() to make this easier.
For example, in Fine performance metrics for execute, gather_dep, etc. #7586 SpillBuffer.cumulative_metrics (which is conveniently posted to the worker's Prometheus API) I am also capturing scatter activity, while I don't bother in the worker state machine.

I must perform a full scan of all elements in Worker.digests_total, search in the tuples for the keywords relevant to the SpillBuffer, and extrapolate from there. This is assuming that all components that access the SpillBuffer (WorkerStateMachine, WorkerMemoryManager, Worker.get_data) all post the metrics in the same way to Worker.digests_total.

Both of the above are feasible - but also quite unconvenient.

Another issue here is that this design has no support for non-time metrics. with meter("memory-read"): doesn't do anything above (reading from SpillBuffer.fast takes nanoseconds). #7586 uses the memory-read tag to post bytes and count metrics to Worker.digests_total, which can then in turn be used to calculate cache hit ratios. In this design, you were forced to keep a secondary ad-hoc metric storage system (SpillBuffer.fast_metrics / SpillBuffer.slow.metrics).

fast_metrics / slow_metrics is a big chunk of ad-hoc code that I could remove in #7586

I don't personally mind leaving these existing ad-hoc metrics in place. As I said, t1 - t0 is extremely cheap. If it was a choice between two systems that were each easier to understand, versus one that did everything but was harder to understand, my preference would be for the two simpler ones.

For SpillBuffer specifically (I haven't looked closely), aren't fast_metrics/slow_metrics basically just an aggregation of the fine-grained spans we'd be collecting? We'd be tracing read/write times/counts/bytes broken out by activity and key prefix. fast_metrics/slow_metrics to me look like that same data, just summed over activity and prefix.

Intuitively to me, if we want coarser metrics that are just an aggregation of finer ones, that seems pretty tractable. I haven't thought about exactly how we'd implement it, but either doing an aggregation during span processing, or just letting the metrics system (Prometheus) do the sum, seems reasonable.

Here's a way we can get aggregated metrics, like we're currently doing with SpillBuffer. I haven't actually removed the fast/slow metrics yet, but I imagine this would allow us to: 8f9fafe

Basically

spans can hold arbitrary metadata, like otel

we set the metadata aggregate=True on spans that we want aggregate metrics from

our span-processing logic on the worker knows that when a span has aggregate=True, it exports a metric both for the fully-qualified name ("transition", "x", "released->memory", "disk-write") and the non-qualified name disk-write.

Now we have both a cumulate disk-write metric across all tasks and operations, as well as disk-write broken down by task and operation. This generalizes to anything else we might want to trace.

crusaderky · 2023-03-07T14:28:38Z

One of the biggest critiques to #7586 was that there are too many .add_callback() wrappers around the code.
I don't think this PR makes it any better - it just does less.

get_data() is currently unmetered. In order to meter it, you would necessarily need to write ad-hoc code (analogous to the ad-hoc decorator I wrote in distributed.core in Fine performance metrics for execute, gather_dep, etc. #7586)
SpillBuffer here needs a full ad-hoc secondary metering system, as discussed above, which Fine performance metrics for execute, gather_dep, etc. #7586 does with just a single .add_callback().

crusaderky · 2023-03-07T14:34:43Z

Ultimately, #7586 in the current incarnation says:

If you want to meter anything, you need at top level a .add_callback() context manager to tell it what to do when it acquires the metrics. There's also an optional system, which builds on top of it, where metrics are recorded in a temporary local list so that you can post-process them later on instead of immediately publishing them.

Whereas this PR says:

All metrics are recorded in a local list. After you finish recording them, you must do something with this list.

Which, to me, feels a bit of a chicken-and-egg situation.

gjoseph92 · 2023-03-07T16:47:07Z

Another issue here is that this design has no support for non-time metrics

Sorry, I didn't add this because this PR already went far further into implementation than I'd intended. But this is pretty straightforward, borrowing the idea of attributes on spans. You just add non-time metrics (like disk bytes read) as attributes on spans. Then in to_digest_metrics, you aggregate them and convert them to DigestMetrics, just like with the spans.

it just does less

To be clear this PR is just meant to be an illustration of an architecture. Not everything from #7586 is traced here. This is meant to show just enough to get a feel for the API and confirm it actually works.

github-actions · 2023-03-08T02:53:56Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      26 files +      1       26 suites +1 14h 55m 25s ⏱️ + 3h 27m 17s
  3 502 tests +    11   3 156 ✔️ -   229   104 💤 +  1   163 ❌ +  160     79 🔥 +    79
44 266 runs +2 300 38 306 ✔️ - 1 662 2 078 💤 +84 2 855 ❌ +2 851 1 027 🔥 +1 027

For more details on these failures and errors, see this check.

Results for commit c0fe05f. ± Comparison against base commit 310fc95.

This pull request skips 1 test.

distributed.tests.test_worker_client ‑ test_submit_different_names

this needs to be cleaned up

lol without `direct=False` `gather` breaks across multiple workers cause it thinks it's running on a worker

gjoseph92 · 2023-03-11T00:13:57Z

I'm hoping having an explicit Span data structure like this might also give us a nice way to handle #7601:

Keep track of all currently-running spans (global contextvar, etc.). Probably just currently-running root spans, but generally have a way to get "what spans that we care about are active right now"
On every metrics scrape, ingest the elapsed-so-far time of the currently-active spans, then emit metrics as usual
Do something to ensure we don't double-count time when the span ends—move start up to now? I'm not decided on the details of how we'd represent this, but it's pretty straightforward.
When in-progress spans actually end, ingest it as usual—any time we've already ingested due to a metrics scrape is not double-counted.

Mostly I think having the data structure of spans (and an automatic way to keep track of the currently-active ones) is what makes this a pretty simple change.

crusaderky reviewed Mar 7, 2023

View reviewed changes

distributed/span.py Show resolved Hide resolved

crusaderky reviewed Mar 7, 2023

View reviewed changes

gjoseph92 added 7 commits March 7, 2023 18:31

span basics

4e91afc

initial integration

5631d52

counters on spans

5108883

fixup! counters on spans

f8bfddf

thread time as counter, not span

617ae2e

name flattened spans as own-time

2018c02

meter transitions within state machine

c0fe05f

gjoseph92 force-pushed the spans branch from 98ed10e to c0fe05f Compare March 8, 2023 01:32

gjoseph92 added 9 commits March 8, 2023 08:01

rename

818591c

meter -> trace

85194bf

fixup! counters on spans

bc3e8bb

fixup! name flattened spans as own-time

d21d260

failure metadata on spans

8274709

WIP trace non-state-machine

e992be6

this needs to be cleaned up

set failure on spans

5bcf9cc

aggregate=true attribute

8f9fafe

WIP test things work in parallel

d62c057

lol without `direct=False` `gather` breaks across multiple workers cause it thinks it's running on a worker

gjoseph92 added 3 commits March 13, 2023 00:15

comment

005497c

nobody's gonna read that

a3ca783

adjust name

ec1f9e0

fjetter mentioned this pull request Mar 13, 2023

Fine performance metrics for execute, gather_dep, etc. #7586

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC] Fine-grained metrics using `Span`s #7619

[RFC] Fine-grained metrics using `Span`s #7619

Uh oh!

gjoseph92 commented Mar 7, 2023

Uh oh!

Uh oh!

crusaderky Mar 7, 2023 •

edited

Loading

Uh oh!

gjoseph92 Mar 7, 2023

Uh oh!

gjoseph92 Mar 10, 2023

Uh oh!

crusaderky commented Mar 7, 2023

Uh oh!

crusaderky commented Mar 7, 2023

Uh oh!

gjoseph92 commented Mar 7, 2023

Uh oh!

github-actions bot commented Mar 8, 2023

Uh oh!

gjoseph92 commented Mar 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[RFC] Fine-grained metrics using Spans #7619

Are you sure you want to change the base?

[RFC] Fine-grained metrics using Spans #7619

Uh oh!

Conversation

gjoseph92 commented Mar 7, 2023

Uh oh!

Uh oh!

crusaderky Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gjoseph92 Mar 7, 2023

Choose a reason for hiding this comment

Uh oh!

gjoseph92 Mar 10, 2023

Choose a reason for hiding this comment

Uh oh!

crusaderky commented Mar 7, 2023

Uh oh!

crusaderky commented Mar 7, 2023

Uh oh!

gjoseph92 commented Mar 7, 2023

Uh oh!

github-actions bot commented Mar 8, 2023

Unit Test Results

Uh oh!

gjoseph92 commented Mar 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[RFC] Fine-grained metrics using `Span`s #7619

[RFC] Fine-grained metrics using `Span`s #7619

crusaderky Mar 7, 2023 •

edited

Loading