Description
We had an early attempt to experiment with root-task withholding to address the problem of root-task-overproduction. Below a couple of links with additional information (non-exhaustive)
- Ease memory pressure by deprioritizing root tasks? #6360
- Workers run twice as many root tasks as they should, causing memory pressure #5223
- Distributed scheduler does not obey dask.order.order for num_workers=1, num_threads=1 #5555
We started an experimentation trying to withhold worker assignment for root tasks, i.e. delay worker assignment scheduler side, see #6560
Early prototypes show very promising results that should improve our cluster memory footprint. A prototype is available at #6614 (and should be ready to try for curious users)
Given that the current co-assignment logic has some significant shortcomings (e.g. #6597) and the withholding of root-tasks appears to be sufficient to control our memory footprint (some experimentation on configuration is still required) we should get the root-task withhold logic in a production ready, i.e. merge-able state and get rid of the current co-assignment logic.
This should be verified by thorough performance benchmark results, for this, see coiled/benchmarks#191 for work on automated benchmarks.
Once this is solid, we may consider adding a more robust co-assignment logic in a follow up step, if necessary.
AC
- The prototype PR is merged and the new assignment logic is hidden behind a feature toggle
- The feature toggle is disabled by default
- There is a CI job with an experimental flag running on ubuntu on a single python version that has this feature toggle enabled. All failing tests are specifically marked and are allowed to be skipped on this job.
- A follow up ticket with an overview of all skipped tests is created