Description
When queuing is enabled, we currently pick the worker for a queued task by finding the idle worker with the fewest tasks.
This means a linear search over the set of idle workers, which could, at worst, contain all workers.
So on really large clusters (thousands/tens of thousands of workers), this could possibly get expensive, because decide_worker
is called for every task.
Previously, idle
was a SortedSet
, and when there were >20 workers (arbitrary cutoff), we'd just pick one round-robin style, which cost O(logn):
distributed/distributed/scheduler.py
Line 2204 in 02b9430
We don't have data showing performance is a problem here, but we also haven't benchmarked a really large cluster yet. I don't want to prematurely optimize, but given that we intend to turn queuing on by default #7213, it would also be bad if the default were slow for large-cluster users.
Do we want to preemptively change this logic? Options I could imagine (simple->complex):
-
Do nothing. With 10k workers, there are probably plenty of other things that are more inefficient than
decide_worker
that we should improve first. Plus,idle
will only be large at the beginning and end of the computation; most of the time it should be quite small. -
If
len(idle)
> some arbitrary cutoff (maybe 20 again), just picknext(iter(self.idle))
. (I'd like to makeidle
no longer sorted since it's rather expensive and we're only sorting by name, not something useful Remove sortedcontainers #7245.)We could do something simple with CPython set iteration order (or use a
dict[WorkerState, None]
) to make this properly round-robin. -
Maintain a structure binning idle workers by the number of tasks processing. This assumes worker thread counts are relatively small (in the thousands at most). We could find the least-occupied worker in O(1), and updating when tasks are added/removed would be O(1) as well. (Could also use a heap, but then the update would be O(logn). Taking a bucket-sort approach by assuming thread counts are small seems smarter.)