Skip to content

Fine performance metrics: Meter task re-execution after losing a worker #7676

@crusaderky

Description

@crusaderky

When a worker dies and you lose tasks in memory state, they transition back to released on the scheduler and are re-computed somewhere else.

We would like to know how much time we spent re-computing tasks after a worker dies. This could inform the user e.g. to call replicate() on important data.

Add a boolean flag to the Compute message, stating that the task was previously in memory at some point and it's now being recomputed.
When the task ends successfully on the worker, instead of logging its granular metrics we will log a lump sum under the ("execute", <prefix>, "recompute", "seconds") label. This is an equivalent treatment to when a task fails and we log a lump sum under the ("execute", <prefix>, "failed", "seconds") label, which was introduced in #7586.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions