Fine performance metrics: Break down idle time on the Scheduler

- Part of #7665
- Blocked by #7666 
- Blocked by #7671

With #7671 done, we know how much time we spend with workers idle because they are not getting enough Compute messages from the scheduler.
This can be further reclassified on the scheduler side, by adding negative corrections to `Scheduler.cumulative_worker_metrics["execute", "n/a", "idle", "seconds"]`.

On the scheduler, we know for each worker:

- time spent with tasks in processing state. The delta between this and the sum of worker metrics other than 'idle' shows e.g. time spent on imperfectly pipelined RTTs between worker and scheduler, e.g. it should increase when `distributed.scheduler.worker-saturation` is too low.

- time spent with not enough tasks in processing state on the worker, but at least one task processing somewhere on the cluster, e.g. the workload is not fully parallelisable
- time spent with zero tasks in processing state anywhere on the cluster, e.g. waiting for the Client. This should include the initial decision time between the moment the scheduler receives `update_graph` and when it releases the event loop.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fine performance metrics: Break down idle time on the Scheduler #7672

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Fine performance metrics: Break down idle time on the Scheduler #7672

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions