-
-
Notifications
You must be signed in to change notification settings - Fork 747
Description
- Part of Fine performance metrics meta-issue #7665
- Complements Fine performance metrics: Meter wasted partial compute time after losing a worker #7678
When a worker dies and you lose tasks in memory state, they transition back to released on the scheduler and are re-computed somewhere else.
We would like to know how much time we spent re-computing tasks after a worker dies. This could inform the user e.g. to call replicate() on important data.
Add a boolean flag to the Compute message, stating that the task was previously in memory at some point and it's now being recomputed.
When the task ends successfully on the worker, instead of logging its granular metrics we will log a lump sum under the ("execute", <prefix>, "recompute", "seconds") label. This is an equivalent treatment to when a task fails and we log a lump sum under the ("execute", <prefix>, "failed", "seconds") label, which was introduced in #7586.