-
Notifications
You must be signed in to change notification settings - Fork 147
feat: list all registered schedulers (#1009) #1050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
could you provide more context to what you want to achieve with this? |
|
All the details are in the linked #1009, @kiukchung. Please let me know if more details are needed there. |
|
Hi @clumsy thanks for the pointer. Could you describe your use-case in wanting the list of supported schedulers offered to your users to be dynamic? Usually torchx users want to control the schedulers they configure for their users. |
|
Sure, @kiukchung Take NeMo for example, NVidia ships it with all dependencies, including Unfortunately nemo-run unconditionally registers custom schedulers: https://github.com/NVIDIA/NeMo-Run/blob/main/pyproject.toml#L43-L48 Thus we cannot use It makes sense to have a feature to restrict the available schedulers, but does it have to be the default one? |
|
@clumsy ah that's an interesting edge-case. What you basically want is for One thing to note about We could do something like: If you're open to it, you can add support for prefixes in and make a change in There's some interesting cases regarding name conflicts and ordering (e.g. if nemo registers a scheduler with the same name as the one somewhere else what do you do?) |
|
Thanks, @kiukchung! The proposed solution works for me and I can provide the implementation shortly. |
|
Please check this implementation, @kiukchung , @andywag, @d4l3k, @tonykao8080 The existing commands continue to work, e.g.:
|
|
@kiukchung has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
The failure seems unrelated to the PR: |
|
Hi @kiukchung, should I do a bogus push to trigger new test run? The last failure seem to have been unrelated. |
|
Hi @kiukchung does this PR need more work or is there anything I can do to help integrate the change? |
|
Sorry for bugging, but could you please let me know if I can help getting this one merged, @kiukchung , @d4l3k , @tonykao8080 , @andywag ? Thanks! |
57ec248 to
8f0c6b0
Compare
|
I just realized this was never merged. Does it need more work, @kiukchung? |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1050 +/- ##
==========================================
- Coverage 91.66% 91.66% -0.01%
==========================================
Files 85 85
Lines 6623 6620 -3
==========================================
- Hits 6071 6068 -3
Misses 552 552
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
8f0c6b0 to
36e77f7
Compare
|
Hm, not sure why I wonder if it's related to the new permissions. I rebased on top of upstream main and pushed to my fork like I always do :/ |
|
Is there anything I can do here, @kiukchung? |
|
Please have a look when you get a chance, @kiukchung |
@clumsy sorry I didn't get to this sooner! Will take a closer look tomorrow morning. Just want to make sure everything is BC! |
|
Appreciate your help, @kiukchung! |
cfa412b to
05924eb
Compare
|
Fixed the BC issues in the new revision and added more unit tests @kiukchung |
|
@kiukchung has imported this pull request. If you are a Meta employee, you can view this in D74743585. |
tonykao8080
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review automatically exported from Phabricator review in Meta.
|
Please let me know what you think about the latest revision, @kiukchung |
05924eb to
2d45b47
Compare
|
Could you please import the recent revision, @kiukchung @tonykao8080? |
|
will take a closer look at this tomorrow when I'm back from PTO. |
this fails internal tests because of the non-BC issue we've discussed. Since we have There are call-sites where users iterate through the loaded scheduler factories and actually create scheduler instances. This PR makes it such that schedulers such as |
2d45b47 to
d99cb1e
Compare
|
Understood, @kiukchung. Made sure we preserve the existing behavior of when If not provided we use the defaults: built-in (as before, but renamed as you've suggested) + extra (explicitly registered under the new entrypoint, allowing to override defaults). Libraries that already extend TorchX, e.g. NeMo-Run can now transition to the new mechanics. |
68b6131 to
5f82b9b
Compare
|
Does it look good now, @kiukchung ? |
|
Thanks for checking @tonykao8080. Looks like I have approvals and permissions to merge this one, any objections @kiukchung? |
taking a look |
kiukchung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment. This change may not be necessary to achieve what you want.
| def default_schedulers() -> dict[str, SchedulerFactory]: | ||
| """Build default schedulers (built-in + extras from torchx.schedulers.extra).""" | ||
| return { | ||
| **{s: _defer_load_scheduler(p) for s, p in BUILTIN_SCHEDULER_MODULES.items()}, | ||
| **load_group("torchx.schedulers.extra", default={}), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a closer look at this, and realized that we dont' need this change to get what you want.
Suppose what you want is NeMO's [torchx.schedulers] as well as other ones that you use. Python entrypoint groups are compounding (but the keys in the groups are not). So as long as you don't have a conflict in the scheduler names, you can define your own entrypoint as
[torchx.schedulers]
local_cwd = torchx.schedulers.local_scheduler_fb:create_cwd_scheduler
aws_batch = torchx.schedulers.aws_batch_scheduler:create_scheduler
... others you want ...
and NeMO defines their set as:
[torchx.schedulers]
nemo_sched_1 = ...
nemo_sched_2 = ...
the schedulers torchx ends up loading would be:
local_cwd
aws_batch
nemo_sched_1
nemo_sched_2
However in the absence of your entrypoint [torchx.schedulers] torchx would only load the ones defined in NeMO (nemo_sched_1 and nemo_sched_2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aware of this effect @kiukchung but it means there must be some package re-registering the (ideally all) built-in schedulers and keep up with the list (I remember you mentioned there's no plan to extend the list though - I was not aware when this ticket was cut).
E.g. inside a NeMo container where only NeMo-Run and TorchX dependency are installed we only get NeMo-Run schedulers (like you mentioned), i.e. no local_cwd. As as user I would like to get NeMo-ones, built-ins and maybe add my own (just like components), but not at the cost re-registering if that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you running on a pre-built/published NeMo docker image (e.g.nvcr.io/nvidia/nemo) directly? you don't have your workspace/project to install?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't include torchx in runtime dependencies to have a nice separation in dependency closures, so no @kiukchung
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fact that you want to use local_cwd means that you have a direct dep on torchx no? I'm having a hard time understanding the exact use-case to be able to offer you solutions/help. Perhaps a quick catch-up over slack or VC would help?
I realize this is a long standing ask so want to help unblock you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to use the vanilla one that is already there :) So in order for me to be able to use local_cwd from within the NeMo container is either install a custom package that re-register built-ins (I don't know such package nor does it look tempting to create one), or comment out entrypoints for NeMo-Run @kiukchung
A simple merge for the list of all registered schedulers.
Test plan:
[x] all existing tests should pass