Closed
Description
In Prometheus, we export the number of tasks on the worker using
distributed/distributed/worker_state_machine.py
Lines 3299 to 3324 in f830259
Erred tasks can be an important indicator of something very wrong going on. I just went through a real use case where "other" was very large, and I don't know for sure what I'm looking at - huge amount of erred tasks, released tasks, a bug in the code which hides a third state?
Solution 1: add a new set to the WorkerStateMachine, containing all erred tasks, just for the purpose of this counting
Solution 2: rewrite the task count as an inc/dec counter whenever there is a transition
Solution 3: #7411 (which is the same as 2, but natively with prometheus_client)
Metadata
Metadata
Assignees
Labels
No labels