Skip to content

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Apr 16, 2025

See #9043

The scheduler_info is used for this view

image

I am introducing a limit to the number of workers being returned.

I cannot imagine that somebody is using this for actual diagnostics (but it's cute and we should leave it) and I just randomly hard coded this to 5. With this change the total numbers in the upper right corner is still the max and indicates that the below view is only a subset. I'm open to suggestions for making this prettier but that has only low priority.

image

cc @jacobtomlinson I think you introduced this HTML view and might have an opinion about details or the hard coded number, etc.

self.task_groups = {}
self.task_prefixes = {}
self.task_metadata = {}
self.total_memory = 0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's almost overkill to maintain additional state for this but if we have to call identity often I want it to be constant time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm surprised we are not tracking this already. Seems like a useful metric for adaptive scaling or smth like that 🤷

)

def identity(self) -> dict[str, Any]:
def identity(self, n_workers: int = -1) -> dict[str, Any]:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default of -1 is to maintain backwards compat (no idea who is calling this in the wild). All callers in this code base are setting this explicitly

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.

Files not reviewed (1)
  • distributed/widgets/templates/scheduler_info.html.j2: Language not supported
Comments suppressed due to low confidence (1)

distributed/client.py:4412

  • [nitpick] There are varying conventions for n_workers across client calls (e.g. 0 in dashboard_link, 5 in scheduler_info, -1 for full info). Consider clarifying/documenting the intent behind each value to avoid confusion.
def scheduler_info(self, n_workers: int = 5, **kwargs: Any) -> SchedulerInfo:

return self.cluster.dashboard_link
except AttributeError:
scheduler, info = self._get_scheduler_info()
scheduler, info = self._get_scheduler_info(n_workers=0)
Copy link

Copilot AI Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using n_workers=0 in this context results in an empty workers list. If the intention is to provide a default view for diagnostics, consider using a value such as -1 (to fetch all) or the same default limit (e.g. 5) used elsewhere.

Suggested change
scheduler, info = self._get_scheduler_info(n_workers=0)
scheduler, info = self._get_scheduler_info(n_workers=-1)

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's intentional since the workers info is not even accessed here

@fjetter fjetter force-pushed the reduce_size_scheduler_info branch from b0e37f3 to 665431f Compare April 16, 2025 12:01
@github-actions
Copy link
Contributor

github-actions bot commented Apr 16, 2025

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    27 files  ±0      27 suites  ±0   11h 17m 19s ⏱️ + 1m 9s
 4 100 tests ±0   3 980 ✅  - 2    112 💤 ±0  7 ❌ +1  1 🔥 +1 
51 401 runs  +1  49 085 ✅ ±0  2 308 💤  - 1  7 ❌ +1  1 🔥 +1 

For more details on these failures and errors, see this check.

Results for commit 41ce938. ± Comparison against base commit 16aa189.

♻️ This comment has been updated with latest results.

@fjetter
Copy link
Member Author

fjetter commented Apr 16, 2025

I'll move forward with merging. Test failures appear to be unrelated and I think the benefits of this outweight possible minor UX things. I'm happy to follow up on this if there is feedback but I'd like to get the functional fix into the release today.

@fjetter fjetter merged commit 48fcf48 into dask:main Apr 16, 2025
25 of 33 checks passed
@rcheatham-q
Copy link

rcheatham-q commented Apr 28, 2025

For the record, this broke my automated alert that would notify me when I didn't have all of my hundreds of expected workers connected to the scheduler. If there is a better method of doing this, I'm happy to hear it, but this method seemed to be exactly what I needed.

@fjetter
Copy link
Member Author

fjetter commented Apr 29, 2025

For the record, this broke my automated alert that would notify me when I didn't have all of my hundreds of expected workers connected to the scheduler

There are now new fields

  • "n_workers"
  • "total_threads"
  • "total_memory"

Alternatively, you can also query this yourself with client.scheduler_info(n_workers=-1)

@fjetter fjetter deleted the reduce_size_scheduler_info branch April 29, 2025 08:01
TomAugspurger added a commit to TomAugspurger/cuml that referenced this pull request May 5, 2025
This changed in dask/distributed#9045. We always
want everything.
@rcheatham-q
Copy link

There are now new fields

  • "n_workers"
  • "total_threads"
  • "total_memory"

Alternatively, you can also query this yourself with client.scheduler_info(n_workers=-1)

The fact that a workaround exists doesn't change the fact that this was a backwards incompatible change that broke a bunch of things. We can already see three of them listed in this PR, and I have encountered another issue caused by this change that means I have to pin distributed to the latest version without this change.

The proper implementation for this change would have been to set the default value of n_workers to -1 and changed the view to specify the number of workers to pull.

At the very least this should have been mentioned as a breaking change in the changelog.

pentschev added a commit to pentschev/dask-cuda that referenced this pull request Jul 1, 2025
For no good reason other than making Jupyter pretty printing nicer,
dask/distributed#9045 broke the API of
`client.scheduler_info()` by changing the default number of workers that
are retrieved to `5`.

We use this information in various places to ensure the correct number
of workers have connected to the scheduler, and thus this change now
specifies `n_workers=-1` to get all workers instead of just the first
`5`. This happened not to be seen in CI previously because we don't run
multi-GPU tests, let alone with more than `5` workers. Running various
tests from `test_dask_cuda_worker.py` on a system with 8 GPUs will cause
them to timeout while waiting for the workers to connect.
rapids-bot bot pushed a commit to rapidsai/dask-cuda that referenced this pull request Jul 1, 2025
For no good reason other than making Jupyter pretty printing nicer, dask/distributed#9045 broke the API of `client.scheduler_info()` by changing the default number of workers that are retrieved to `5`.

We use this information in various places to ensure the correct number of workers have connected to the scheduler, and thus this change now specifies `n_workers=-1` to get all workers instead of just the first `5`. This happened not to be seen in CI previously because we don't run multi-GPU tests, let alone with more than `5` workers. Running various tests from `test_dask_cuda_worker.py` on a system with 8 GPUs will cause them to timeout while waiting for the workers to connect.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Jacob Tomlinson (https://github.com/jacobtomlinson)

URL: #1514
TomAugspurger added a commit to TomAugspurger/cugraph that referenced this pull request Jul 7, 2025
This fixes hangs when there are more than 5 workers in the dask cluster
(as a consequence of dask/distributed#9045;
previously that returned all the worker addresses in 'workers', now it's
limited to the top 5 by default).
@uellue uellue mentioned this pull request Jul 9, 2025
73 tasks
rapids-bot bot pushed a commit to rapidsai/cugraph that referenced this pull request Jul 15, 2025
This fixes hangs when there are more than 5 workers in the dask cluster (as a consequence of dask/distributed#9045; previously that returned all the worker addresses in 'workers', now it's limited to the top 5 by default).

Authors:
  - Tom Augspurger (https://github.com/TomAugspurger)

Approvers:
  - Rick Ratzel (https://github.com/rlratzel)

URL: #5147
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants