Description
test_wait_first_completed
is failing in #7191, with the worker-saturation
value set to 1.1
distributed/distributed/tests/test_client.py
Lines 732 to 746 in 0983731
It works fine with 1.0, but because of the round-up logic #7116 allowing workers to be oversaturated, fails for 1.1
It blocks forever because the worker with 1 thread gets assigned [block_on_event, inc]
, and the worker with 2 threads gets assigned [block_on_event]
. It should be the other way around.
The culprit has something to do with the round-robin logic that only applies to rare situations like this, where the cluster is small but larger than the TaskGroup being assigned
distributed/distributed/scheduler.py
Lines 2210 to 2236 in 0983731
If I update is_rootish
like so:
diff --git a/distributed/scheduler.py b/distributed/scheduler.py
index cf240240..802df12d 100644
--- a/distributed/scheduler.py
+++ b/distributed/scheduler.py
@@ -3043,6 +3043,8 @@ class SchedulerState:
"""
if ts.resource_restrictions or ts.worker_restrictions or ts.host_restrictions:
return False
+ if not ts.dependencies:
+ return True
tg = ts.group
# TODO short-circuit to True if `not ts.dependencies`?
return (
the test passes.