Skip to content

decide_worker could be expensive on large clusters with queuing enabled #7246

Closed
@gjoseph92

Description

@gjoseph92

When queuing is enabled, we currently pick the worker for a queued task by finding the idle worker with the fewest tasks.

This means a linear search over the set of idle workers, which could, at worst, contain all workers.

So on really large clusters (thousands/tens of thousands of workers), this could possibly get expensive, because decide_worker is called for every task.

Previously, idle was a SortedSet, and when there were >20 workers (arbitrary cutoff), we'd just pick one round-robin style, which cost O(logn):

ws = wp_vals[self.n_tasks % n_workers]

We don't have data showing performance is a problem here, but we also haven't benchmarked a really large cluster yet. I don't want to prematurely optimize, but given that we intend to turn queuing on by default #7213, it would also be bad if the default were slow for large-cluster users.

Do we want to preemptively change this logic? Options I could imagine (simple->complex):

  1. Do nothing. With 10k workers, there are probably plenty of other things that are more inefficient than decide_worker that we should improve first. Plus, idle will only be large at the beginning and end of the computation; most of the time it should be quite small.

  2. If len(idle) > some arbitrary cutoff (maybe 20 again), just pick next(iter(self.idle)). (I'd like to make idle no longer sorted since it's rather expensive and we're only sorting by name, not something useful Remove sortedcontainers #7245.)

    We could do something simple with CPython set iteration order (or use a dict[WorkerState, None]) to make this properly round-robin.

  3. Maintain a structure binning idle workers by the number of tasks processing. This assumes worker thread counts are relatively small (in the thousands at most). We could find the least-occupied worker in O(1), and updating when tasks are added/removed would be O(1) as well. (Could also use a heap, but then the update would be O(logn). Taking a bucket-sort approach by assuming thread counts are small seems smarter.)

cc @fjetter @crusaderky

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions