`decide_worker` could be expensive on large clusters with queuing enabled

When queuing is enabled, we currently pick the worker for a queued task by finding the idle worker with the fewest tasks.

This means a linear search over the set of idle workers, which could, at worst, contain all workers.

So on really large clusters (thousands/tens of thousands of workers), this could possibly get expensive, because `decide_worker` is called for every task.

Previously, `idle` was a `SortedSet`, and when there were >20 workers (arbitrary cutoff), we'd just pick one round-robin style, which cost O(logn): https://github.com/dask/distributed/blob/02b9430f7bb6eae370394eaac6b4e8f574746c94/distributed/scheduler.py#L2204

We don't have data showing performance is a problem here, but we also haven't benchmarked a really large cluster yet. I don't want to prematurely optimize, but given that we intend to turn queuing on by default https://github.com/dask/distributed/issues/7213, it would also be bad if the default were slow for large-cluster users.

Do we want to preemptively change this logic? Options I could imagine (simple->complex):
1. Do nothing. With 10k workers, there are probably plenty of other things that are more inefficient than `decide_worker` that we should improve first. Plus, `idle` will only be large at the beginning and end of the computation; most of the time it should be quite small.
1. If `len(idle)` > some arbitrary cutoff (maybe 20 again), just pick `next(iter(self.idle))`. (I'd like to make `idle` no longer sorted since it's rather expensive and we're only sorting by name, not something useful https://github.com/dask/distributed/pull/7245.)

    We could do something simple with CPython set iteration order (or use a `dict[WorkerState, None]`) to make this properly round-robin.
1. Maintain a structure binning idle workers by the number of tasks processing. This assumes worker thread counts are relatively small (in the thousands at most). We could find the least-occupied worker in O(1), and updating when tasks are added/removed would be O(1) as well. (Could also use a heap, but then the update would be O(logn). Taking a bucket-sort approach by assuming thread counts are small seems smarter.)

cc @fjetter @crusaderky

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`decide_worker` could be expensive on large clusters with queuing enabled #7246

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

decide_worker could be expensive on large clusters with queuing enabled #7246

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`decide_worker` could be expensive on large clusters with queuing enabled #7246