Skip to content

Prometheus: "other" tasks count is confusing #7412

Closed
@crusaderky

Description

@crusaderky

In Prometheus, we export the number of tasks on the worker using

@property
def task_counts(self) -> dict[TaskStateState | Literal["other"], int]:
# Actors can be in any state other than {fetch, flight, missing}
n_actors_in_memory = sum(
self.tasks[key].state == "memory" for key in self.actors
)
out: dict[TaskStateState | Literal["other"], int] = {
# Key measure for occupancy.
# Also includes cancelled(executing) and resumed(executing->fetch)
"executing": len(self.executing),
# Also includes cancelled(long-running) and resumed(long-running->fetch)
"long-running": len(self.long_running),
"memory": len(self.data) + n_actors_in_memory,
"ready": len(self.ready),
"constrained": len(self.constrained),
"waiting": len(self.waiting),
"fetch": self.fetch_count,
"missing": len(self.missing_dep_flight),
# Also includes cancelled(flight) and resumed(flight->waiting)
"flight": len(self.in_flight_tasks),
}
# released | error
out["other"] = other = len(self.tasks) - sum(out.values())
assert other >= 0
return out

Erred tasks can be an important indicator of something very wrong going on. I just went through a real use case where "other" was very large, and I don't know for sure what I'm looking at - huge amount of erred tasks, released tasks, a bug in the code which hides a third state?

Solution 1: add a new set to the WorkerStateMachine, containing all erred tasks, just for the purpose of this counting
Solution 2: rewrite the task count as an inc/dec counter whenever there is a transition
Solution 3: #7411 (which is the same as 2, but natively with prometheus_client)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions