Skip to content

WIP refactor prometheus metrics #7411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

crusaderky
Copy link
Collaborator

Closes #7364
This for now is just a POC to showcase the look&feel suggested in the parent issue.

@@ -34,20 +34,6 @@ def __init__(self, server: Worker):
def collect(self) -> Iterator[Metric]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gutted most of this class away; eventually it will be deleted completely.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the metrics you've refactored here actually showing up in a prometheus scrape? Is some magic making that happen?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they're showing up.

The magic is that they are registered automatically on prometheus_client.REGISTRY.
In the final thing we'll actually have to use a local registry so that the scheduler doesn't publish the worker metrics and vice versa.

["state"],
)

compute = Summary(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary metrics replace pairs of _bytes_total and _count_total Counters

unit="seconds",
)

latency = Histogram(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Histogram metrics also embed a Summary; they replace the _max metrics and all crick stuff.

"Current number of bytes worth of data transfers with peer workers",
["direction"],
unit="bytes",
)
Copy link
Collaborator Author

@crusaderky crusaderky Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly there's no single metric for count+bytes for current metrics, unlike Summary which does the job for total metrics. We could (should?) create a class for it though.

@@ -0,0 +1,80 @@
from __future__ import annotations

from prometheus_client import Counter, Gauge, Histogram, Summary
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: from distributed import prometheus; there create stub classes when prometheus_client is not available

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to evaluate switching to opentelemetry as brought up by @jacobtomlinson. From what I understand, this would allow us to introduce a hard dependency on the opentelemetry-api only (which contains a bunch of NoOpMetrics). I think this should solve your problem here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crusaderky I assume the reason all these metrics are defined in one file (as opposed to being defined in the place they're used) is to have one place to deal with the potential import error?

I was a little surprised to see the metrics created in one place and used in another, when they're only used in exactly one place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's because many metrics are contributed to by different modules.

For example for dask_worker_event_loop_blocked:

  • The disk-write-target label is set by worker_state_machine
  • The disk-write-spill label is set by worker_memory
  • The disk-read-execute label is set by worker
  • The disk-read-get-data label is set by worker

Besides that, it also felt nice to have everything in a nice, self-contained module and call the metrics through a metrics.XXX domain.

@github-actions
Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       22 files  ±  0         22 suites  ±0   9h 52m 49s ⏱️ - 15m 8s
  3 268 tests  -   1    3 174 ✔️  -   5       85 💤 ±0    9 +  4 
35 879 runs   - 11  34 297 ✔️  - 65  1 519 💤  - 2  63 +56 

For more details on these failures, see this check.

Results for commit 2634014. ± Comparison against base commit f830259.

@mrocklin
Copy link
Member

cc @ntabris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve configuration/composition of Prometheus metric collection
4 participants