Reduce size of scheduler_info #9045

fjetter · 2025-04-16T11:46:32Z

The scheduler_info is used for this view

I am introducing a limit to the number of workers being returned.

I cannot imagine that somebody is using this for actual diagnostics (but it's cute and we should leave it) and I just randomly hard coded this to 5. With this change the total numbers in the upper right corner is still the max and indicates that the below view is only a subset. I'm open to suggestions for making this prettier but that has only low priority.

cc @jacobtomlinson I think you introduced this HTML view and might have an opinion about details or the hard coded number, etc.

fjetter · 2025-04-16T11:49:43Z

distributed/scheduler.py

        self.task_groups = {}
        self.task_prefixes = {}
        self.task_metadata = {}
+        self.total_memory = 0


It's almost overkill to maintain additional state for this but if we have to call identity often I want it to be constant time

Also, I'm surprised we are not tracking this already. Seems like a useful metric for adaptive scaling or smth like that 🤷

fjetter · 2025-04-16T11:50:33Z

distributed/scheduler.py

        )

-    def identity(self) -> dict[str, Any]:
+    def identity(self, n_workers: int = -1) -> dict[str, Any]:


the default of -1 is to maintain backwards compat (no idea who is calling this in the wild). All callers in this code base are setting this explicitly

Copilot

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.

Files not reviewed (1)

distributed/widgets/templates/scheduler_info.html.j2: Language not supported

Comments suppressed due to low confidence (1)

distributed/client.py:4412

[nitpick] There are varying conventions for n_workers across client calls (e.g. 0 in dashboard_link, 5 in scheduler_info, -1 for full info). Consider clarifying/documenting the intent behind each value to avoid confusion.

def scheduler_info(self, n_workers: int = 5, **kwargs: Any) -> SchedulerInfo:

Copilot · 2025-04-16T11:55:21Z

distributed/client.py

            return self.cluster.dashboard_link
        except AttributeError:
-            scheduler, info = self._get_scheduler_info()
+            scheduler, info = self._get_scheduler_info(n_workers=0)


Using n_workers=0 in this context results in an empty workers list. If the intention is to provide a default view for diagnostics, consider using a value such as -1 (to fetch all) or the same default limit (e.g. 5) used elsewhere.

Suggested change

scheduler, info = self._get_scheduler_info(n_workers=0)

scheduler, info = self._get_scheduler_info(n_workers=-1)

that's intentional since the workers info is not even accessed here

github-actions · 2025-04-16T12:01:48Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

27 files ±0 27 suites ±0 11h 17m 19s ⏱️ + 1m 9s
4 100 tests ±0 3 980 ✅ - 2 112 💤 ±0 7 ❌ +1 1 🔥 +1
51 401 runs +1 49 085 ✅ ±0 2 308 💤 - 1 7 ❌ +1 1 🔥 +1

For more details on these failures and errors, see this check.

Results for commit 41ce938. ± Comparison against base commit 16aa189.

♻️ This comment has been updated with latest results.

fjetter · 2025-04-16T15:53:54Z

I'll move forward with merging. Test failures appear to be unrelated and I think the benefits of this outweight possible minor UX things. I'm happy to follow up on this if there is feedback but I'd like to get the functional fix into the release today.

rcheatham-q · 2025-04-28T21:20:39Z

For the record, this broke my automated alert that would notify me when I didn't have all of my hundreds of expected workers connected to the scheduler. If there is a better method of doing this, I'm happy to hear it, but this method seemed to be exactly what I needed.

fjetter · 2025-04-29T08:01:36Z

For the record, this broke my automated alert that would notify me when I didn't have all of my hundreds of expected workers connected to the scheduler

There are now new fields

"n_workers"
"total_threads"
"total_memory"

Alternatively, you can also query this yourself with client.scheduler_info(n_workers=-1)

This changed in dask/distributed#9045. We always want everything.

rcheatham-q · 2025-05-14T20:37:32Z

There are now new fields

"n_workers"

"total_threads"

"total_memory"

Alternatively, you can also query this yourself with client.scheduler_info(n_workers=-1)

The fact that a workaround exists doesn't change the fact that this was a backwards incompatible change that broke a bunch of things. We can already see three of them listed in this PR, and I have encountered another issue caused by this change that means I have to pin distributed to the latest version without this change.

The proper implementation for this change would have been to set the default value of n_workers to -1 and changed the view to specify the number of workers to pull.

At the very least this should have been mentioned as a breaking change in the changelog.

For no good reason other than making Jupyter pretty printing nicer, dask/distributed#9045 broke the API of `client.scheduler_info()` by changing the default number of workers that are retrieved to `5`. We use this information in various places to ensure the correct number of workers have connected to the scheduler, and thus this change now specifies `n_workers=-1` to get all workers instead of just the first `5`. This happened not to be seen in CI previously because we don't run multi-GPU tests, let alone with more than `5` workers. Running various tests from `test_dask_cuda_worker.py` on a system with 8 GPUs will cause them to timeout while waiting for the workers to connect.

For no good reason other than making Jupyter pretty printing nicer, dask/distributed#9045 broke the API of `client.scheduler_info()` by changing the default number of workers that are retrieved to `5`. We use this information in various places to ensure the correct number of workers have connected to the scheduler, and thus this change now specifies `n_workers=-1` to get all workers instead of just the first `5`. This happened not to be seen in CI previously because we don't run multi-GPU tests, let alone with more than `5` workers. Running various tests from `test_dask_cuda_worker.py` on a system with 8 GPUs will cause them to timeout while waiting for the workers to connect. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Jacob Tomlinson (https://github.com/jacobtomlinson) URL: #1514

This fixes hangs when there are more than 5 workers in the dask cluster (as a consequence of dask/distributed#9045; previously that returned all the worker addresses in 'workers', now it's limited to the top 5 by default).

This fixes hangs when there are more than 5 workers in the dask cluster (as a consequence of dask/distributed#9045; previously that returned all the worker addresses in 'workers', now it's limited to the top 5 by default). Authors: - Tom Augspurger (https://github.com/TomAugspurger) Approvers: - Rick Ratzel (https://github.com/rlratzel) URL: #5147

fjetter added 5 commits April 16, 2025 13:26

Reduce size of scheduler_info

3bc4ec4

adjust template

41dd231

fix totals

f45e119

introduce total_memory attr

123d7fd

add type annotation

665431f

fjetter commented Apr 16, 2025

View reviewed changes

This was referenced Apr 16, 2025

Scheduler not robust to many connected clients #9043

Closed

Release 2025.4.0 dask/community#418

Closed

fjetter requested review from Copilot and hendrikmakait April 16, 2025 11:54

Copilot AI reviewed Apr 16, 2025

View reviewed changes

fjetter force-pushed the reduce_size_scheduler_info branch from b0e37f3 to 665431f Compare April 16, 2025 12:01

fix test_Scheduler__to_dict

41ce938

fjetter merged commit 48fcf48 into dask:main Apr 16, 2025
25 of 33 checks passed

fjetter deleted the reduce_size_scheduler_info branch April 29, 2025 08:01

jrbourbeau mentioned this pull request May 2, 2025

Fix workers / threads / memory displayed in client repr #9066

Merged

TomAugspurger added a commit to TomAugspurger/cuml that referenced this pull request May 5, 2025

Compat for scheduler_info

1d75861

This changed in dask/distributed#9045. We always want everything.

jrbourbeau mentioned this pull request May 14, 2025

xgboost.dask broken for larger clusters dmlc/xgboost#11461

Closed

jrbourbeau mentioned this pull request Jun 30, 2025

Ensure memray profiler runs on all workers #9095

Merged

pentschev mentioned this pull request Jul 1, 2025

Get all workers from scheduler info rapidsai/dask-cuda#1514

Merged

TomAugspurger mentioned this pull request Jul 7, 2025

Fix calls to Client.scheduler_info rapidsai/cugraph#5147

Merged

This was referenced Jul 8, 2025

test_make_with_scenarios fails LiberTEM/LiberTEM#1737

Closed

Compat for Distributed scheduler_info change LiberTEM/LiberTEM#1751

Merged

uellue mentioned this pull request Jul 9, 2025

Release 0.15.2 LiberTEM/LiberTEM#1752

Closed

73 tasks

jacobtomlinson mentioned this pull request Nov 3, 2025

LocalCluster fails to spawn requested number of workers (capped at 5 workers on Windows) #9134

Open

	scheduler, info = self._get_scheduler_info(n_workers=0)
	scheduler, info = self._get_scheduler_info(n_workers=-1)

Uh oh!

Reduce size of scheduler_info #9045

Reduce size of scheduler_info #9045

Uh oh!

Conversation

fjetter commented Apr 16, 2025

Uh oh!

fjetter Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

fjetter Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

fjetter Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

fjetter Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

fjetter commented Apr 16, 2025

Uh oh!

Uh oh!

rcheatham-q commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjetter commented Apr 29, 2025

Uh oh!

rcheatham-q commented May 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Apr 16, 2025 •

edited

Loading

rcheatham-q commented Apr 28, 2025 •

edited

Loading