-
-
Notifications
You must be signed in to change notification settings - Fork 734
WIP refactor prometheus metrics #7411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -34,20 +34,6 @@ def __init__(self, server: Worker): | |||
def collect(self) -> Iterator[Metric]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gutted most of this class away; eventually it will be deleted completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the metrics you've refactored here actually showing up in a prometheus scrape? Is some magic making that happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they're showing up.
The magic is that they are registered automatically on prometheus_client.REGISTRY
.
In the final thing we'll actually have to use a local registry so that the scheduler doesn't publish the worker metrics and vice versa.
["state"], | ||
) | ||
|
||
compute = Summary( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary
metrics replace pairs of _bytes_total
and _count_total
Counters
unit="seconds", | ||
) | ||
|
||
latency = Histogram( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Histogram
metrics also embed a Summary; they replace the _max
metrics and all crick stuff.
"Current number of bytes worth of data transfers with peer workers", | ||
["direction"], | ||
unit="bytes", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly there's no single metric for count+bytes for current metrics, unlike Summary which does the job for total metrics. We could (should?) create a class for it though.
@@ -0,0 +1,80 @@ | |||
from __future__ import annotations | |||
|
|||
from prometheus_client import Counter, Gauge, Histogram, Summary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: from distributed import prometheus
; there create stub classes when prometheus_client
is not available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to evaluate switching to opentelemetry
as brought up by @jacobtomlinson. From what I understand, this would allow us to introduce a hard dependency on the opentelemetry-api
only (which contains a bunch of NoOpMetrics). I think this should solve your problem here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crusaderky I assume the reason all these metrics are defined in one file (as opposed to being defined in the place they're used) is to have one place to deal with the potential import error?
I was a little surprised to see the metrics created in one place and used in another, when they're only used in exactly one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's because many metrics are contributed to by different modules.
For example for dask_worker_event_loop_blocked
:
- The
disk-write-target
label is set byworker_state_machine
- The
disk-write-spill
label is set byworker_memory
- The
disk-read-execute
label is set byworker
- The
disk-read-get-data
label is set byworker
Besides that, it also felt nice to have everything in a nice, self-contained module and call the metrics through a metrics.XXX
domain.
94f8b74
to
2634014
Compare
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 22 files ± 0 22 suites ±0 9h 52m 49s ⏱️ - 15m 8s For more details on these failures, see this check. Results for commit 2634014. ± Comparison against base commit f830259. |
cc @ntabris |
Closes #7364
This for now is just a POC to showcase the look&feel suggested in the parent issue.