-
-
Notifications
You must be signed in to change notification settings - Fork 750
Reduce size of scheduler_info #9045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| self.task_groups = {} | ||
| self.task_prefixes = {} | ||
| self.task_metadata = {} | ||
| self.total_memory = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's almost overkill to maintain additional state for this but if we have to call identity often I want it to be constant time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm surprised we are not tracking this already. Seems like a useful metric for adaptive scaling or smth like that 🤷
| ) | ||
|
|
||
| def identity(self) -> dict[str, Any]: | ||
| def identity(self, n_workers: int = -1) -> dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default of -1 is to maintain backwards compat (no idea who is calling this in the wild). All callers in this code base are setting this explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.
Files not reviewed (1)
- distributed/widgets/templates/scheduler_info.html.j2: Language not supported
Comments suppressed due to low confidence (1)
distributed/client.py:4412
- [nitpick] There are varying conventions for n_workers across client calls (e.g. 0 in dashboard_link, 5 in scheduler_info, -1 for full info). Consider clarifying/documenting the intent behind each value to avoid confusion.
def scheduler_info(self, n_workers: int = 5, **kwargs: Any) -> SchedulerInfo:
| return self.cluster.dashboard_link | ||
| except AttributeError: | ||
| scheduler, info = self._get_scheduler_info() | ||
| scheduler, info = self._get_scheduler_info(n_workers=0) |
Copilot
AI
Apr 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using n_workers=0 in this context results in an empty workers list. If the intention is to provide a default view for diagnostics, consider using a value such as -1 (to fetch all) or the same default limit (e.g. 5) used elsewhere.
| scheduler, info = self._get_scheduler_info(n_workers=0) | |
| scheduler, info = self._get_scheduler_info(n_workers=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's intentional since the workers info is not even accessed here
b0e37f3 to
665431f
Compare
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 27 files ±0 27 suites ±0 11h 17m 19s ⏱️ + 1m 9s For more details on these failures and errors, see this check. Results for commit 41ce938. ± Comparison against base commit 16aa189. ♻️ This comment has been updated with latest results. |
|
I'll move forward with merging. Test failures appear to be unrelated and I think the benefits of this outweight possible minor UX things. I'm happy to follow up on this if there is feedback but I'd like to get the functional fix into the release today. |
|
For the record, this broke my automated alert that would notify me when I didn't have all of my hundreds of expected workers connected to the scheduler. If there is a better method of doing this, I'm happy to hear it, but this method seemed to be exactly what I needed. |
There are now new fields
Alternatively, you can also query this yourself with |
This changed in dask/distributed#9045. We always want everything.
The fact that a workaround exists doesn't change the fact that this was a backwards incompatible change that broke a bunch of things. We can already see three of them listed in this PR, and I have encountered another issue caused by this change that means I have to pin distributed to the latest version without this change. The proper implementation for this change would have been to set the default value of At the very least this should have been mentioned as a breaking change in the changelog. |
For no good reason other than making Jupyter pretty printing nicer, dask/distributed#9045 broke the API of `client.scheduler_info()` by changing the default number of workers that are retrieved to `5`. We use this information in various places to ensure the correct number of workers have connected to the scheduler, and thus this change now specifies `n_workers=-1` to get all workers instead of just the first `5`. This happened not to be seen in CI previously because we don't run multi-GPU tests, let alone with more than `5` workers. Running various tests from `test_dask_cuda_worker.py` on a system with 8 GPUs will cause them to timeout while waiting for the workers to connect.
For no good reason other than making Jupyter pretty printing nicer, dask/distributed#9045 broke the API of `client.scheduler_info()` by changing the default number of workers that are retrieved to `5`. We use this information in various places to ensure the correct number of workers have connected to the scheduler, and thus this change now specifies `n_workers=-1` to get all workers instead of just the first `5`. This happened not to be seen in CI previously because we don't run multi-GPU tests, let alone with more than `5` workers. Running various tests from `test_dask_cuda_worker.py` on a system with 8 GPUs will cause them to timeout while waiting for the workers to connect. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Jacob Tomlinson (https://github.com/jacobtomlinson) URL: #1514
This fixes hangs when there are more than 5 workers in the dask cluster (as a consequence of dask/distributed#9045; previously that returned all the worker addresses in 'workers', now it's limited to the top 5 by default).
This fixes hangs when there are more than 5 workers in the dask cluster (as a consequence of dask/distributed#9045; previously that returned all the worker addresses in 'workers', now it's limited to the top 5 by default). Authors: - Tom Augspurger (https://github.com/TomAugspurger) Approvers: - Rick Ratzel (https://github.com/rlratzel) URL: #5147
See #9043
The scheduler_info is used for this view
I am introducing a limit to the number of workers being returned.
I cannot imagine that somebody is using this for actual diagnostics (but it's cute and we should leave it) and I just randomly hard coded this to
5. With this change the total numbers in the upper right corner is still the max and indicates that the below view is only a subset. I'm open to suggestions for making this prettier but that has only low priority.cc @jacobtomlinson I think you introduced this HTML view and might have an opinion about details or the hard coded number, etc.