Skip to content

Conversation

@clumsy
Copy link
Collaborator

@clumsy clumsy commented Apr 23, 2025

A simple merge for the list of all registered schedulers.

Test plan:
[x] all existing tests should pass

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 23, 2025
@kiukchung
Copy link
Contributor

could you provide more context to what you want to achieve with this?

@clumsy
Copy link
Collaborator Author

clumsy commented Apr 23, 2025

All the details are in the linked #1009, @kiukchung. Please let me know if more details are needed there.

@kiukchung
Copy link
Contributor

Hi @clumsy thanks for the pointer. torchx.schedulers.get_scheduler_factories() is a public API and this change is not backwards-compatible for the case get_scheduler_factories(skip_defaults=False) where there exists registered entrypoint schedulers. Now users will get their configured + default schedulers instead of just their configured ones.

Could you describe your use-case in wanting the list of supported schedulers offered to your users to be dynamic? Usually torchx users want to control the schedulers they configure for their users.

@clumsy
Copy link
Collaborator Author

clumsy commented May 1, 2025

Sure, @kiukchung

Take NeMo for example, NVidia ships it with all dependencies, including nemo-run (https://github.com/NVIDIA/NeMo/blob/94589bde88fab1997c842be4e000faf69180cffb/nemo/collections/common/parts/nemo_run_utils.py#L18)

Unfortunately nemo-run unconditionally registers custom schedulers: https://github.com/NVIDIA/NeMo-Run/blob/main/pyproject.toml#L43-L48

Thus we cannot use local_cwd from within the container for example, or if we have nemo-run installed.

It makes sense to have a feature to restrict the available schedulers, but does it have to be the default one?

@kiukchung
Copy link
Contributor

kiukchung commented May 1, 2025

@clumsy ah that's an interesting edge-case. What you basically want is for torchx.schedulers to be additive. We don't treat it as such today. Since Python entrypoint groups don't compound, we have to come up with a convention for the group names.

One thing to note about DEFAULT_SCHEDULER_MODULES is that TorchX treats them as the default if you haven't registered your own (akin to map.get("key", default="DEFAULT_VAL")) rather than "generally useful ones" that get added regardless of whether you have your own registrations. This is generally the case for all the TorchX configurations exposed as entrypoints (see: https://pytorch.org/torchx/latest/advanced.html)

We could do something like: {org_name}.torchx.schedulers and at load time select *.torchx.schedulers entrypoint groups. For BC we'd also have to keep reading torchx.schedulers.

If you're open to it, you can add support for prefixes in torch.util.entrypoints.load() (https://github.com/pytorch/torchx/blob/9120355959fef8eb438226e9521a07a00e02078e/torchx/util/entrypoints.py#L54)

and make a change in torchx.schedulers.get_schedulers() to call the load() fn appropriately.

There's some interesting cases regarding name conflicts and ordering (e.g. if nemo registers a scheduler with the same name as the one somewhere else what do you do?)

@clumsy
Copy link
Collaborator Author

clumsy commented May 13, 2025

Thanks, @kiukchung! The proposed solution works for me and I can provide the implementation shortly.

clumsy pushed a commit to clumsy/torchx that referenced this pull request May 13, 2025
@clumsy
Copy link
Collaborator Author

clumsy commented May 13, 2025

Please check this implementation, @kiukchung , @andywag, @d4l3k, @tonykao8080

The existing commands continue to work, e.g.:

  • torchx runopts
my_package.custom_local_docker:
    usage:
        [copy_env=COPY_ENV],[env=ENV],[privileged=PRIVILEGED],[debug=DEBUG],[image_repo=IMAGE_REPO],[quiet=QUIET]

    optional arguments:
        copy_env=COPY_ENV (typing.List[str], None)
            list of glob patterns of environment variables to copy if not set in AppDef. Ex: FOO_*
        env=ENV (typing.Dict[str, str], None)
            environment variables to be passed to the run. The separator sign can be eiher comma or semicolon
            (e.g. ENV1:v1,ENV2:v2,ENV3:v3 or ENV1:V1;ENV2:V2). Environment variables from env will be applied on top
            of the ones from copy_env
        privileged=PRIVILEGED (bool, False)
            If true runs the container with elevated permissions. Equivalent to running with `docker run --privileged`.
        debug=DEBUG (bool, False)
            run a container with noop entrypoint, useful for debugging environment
        image_repo=IMAGE_REPO (str, None)
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, False)
            whether to suppress verbose output for image building. Defaults to ``False``.
  • torchx run -s my_package.custom_local_cwd
usage: torchx run [-h]
                  [-s {local_docker,local_cwd,slurm,kubernetes,kubernetes_mcad,aws_batch,aws_sagemaker,gcp_batch,ray,lsf,my_package.custom_aws_batch,my_package.aws_sagemaker,my_package.custom_gcp_batch,my_package.kubernetes,my_package.kubernetes_mcad,my_package.custom_local_cwd,my_package.local_docker,my_package.lsf,my_package.custom_aws_batch,my_package.custom_local_docker,}]
  • torchx run -s my_package.custom_local_cwd utils.echo --msg "test"
torchx 2025-05-13 12:52:15 INFO     Tracker configurations: {}
torchx 2025-05-13 12:52:15 INFO     Log directory not set in scheduler cfg. Creating a temporary log dir that will be deleted on exit. To preserve log directory set the `log_dir` cfg option
torchx 2025-05-13 12:52:15 INFO     Log directory is: /var/folders/8b/nbn0wcb93m710myrqx8r_clh0000gq/T/torchx_m0wx4hxg
my_package.custom_local_cwd://torchx/echo-gww0rzcc0z72tc
torchx 2025-05-13 12:52:15 INFO     Launched app: my_package.custom_local_cwd://torchx/echo-gww0rzcc0z72tc
torchx 2025-05-13 12:52:15 INFO     AppStatus:
    State: RUNNING
    Num Restarts: 0
    Roles:
    Msg: <NONE>
    Structured Error Msg: <NONE>
    UI URL: file:///var/folders/8b/nbn0wcb93m710myrqx8r_clh0000gq/T/torchx_m0wx4hxg/torchx/echo-gww0rzcc0z72tc

@facebook-github-bot
Copy link
Contributor

@kiukchung has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@clumsy
Copy link
Collaborator Author

clumsy commented Jun 6, 2025

The failure seems unrelated to the PR: E RuntimeError: operator torchvision::nms does not exist
@kiukchung Should we rerun the tests?

@clumsy
Copy link
Collaborator Author

clumsy commented Jun 18, 2025

Hi @kiukchung, should I do a bogus push to trigger new test run? The last failure seem to have been unrelated.

@clumsy
Copy link
Collaborator Author

clumsy commented Aug 22, 2025

Hi @kiukchung does this PR need more work or is there anything I can do to help integrate the change?

@clumsy
Copy link
Collaborator Author

clumsy commented Oct 3, 2025

Sorry for bugging, but could you please let me know if I can help getting this one merged, @kiukchung , @d4l3k , @tonykao8080 , @andywag ? Thanks!

@clumsy clumsy force-pushed the feat/extend_schedulers_list branch from 57ec248 to 8f0c6b0 Compare October 24, 2025 23:18
@clumsy
Copy link
Collaborator Author

clumsy commented Oct 24, 2025

I just realized this was never merged. Does it need more work, @kiukchung?

@codecov-commenter
Copy link

codecov-commenter commented Oct 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.66%. Comparing base (9016924) to head (a5aa88f).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1050      +/-   ##
==========================================
- Coverage   91.66%   91.66%   -0.01%     
==========================================
  Files          85       85              
  Lines        6623     6620       -3     
==========================================
- Hits         6071     6068       -3     
  Misses        552      552              
Flag Coverage Δ
unittests 91.66% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@clumsy clumsy force-pushed the feat/extend_schedulers_list branch from 8f0c6b0 to 36e77f7 Compare October 25, 2025 01:26
@clumsy
Copy link
Collaborator Author

clumsy commented Oct 25, 2025

Hm, not sure why The Diff and Pull Request are not in sync! check seem to be failing here, yet All checks passed, this Pull Request can be imported!, @kiukchung

I wonder if it's related to the new permissions. I rebased on top of upstream main and pushed to my fork like I always do :/
Perhaps it's because it's not approved?

@clumsy
Copy link
Collaborator Author

clumsy commented Oct 27, 2025

Is there anything I can do here, @kiukchung?

@clumsy
Copy link
Collaborator Author

clumsy commented Oct 29, 2025

Please have a look when you get a chance, @kiukchung
Thanks!

@kiukchung
Copy link
Contributor

Please have a look when you get a chance, @kiukchung Thanks!

@clumsy sorry I didn't get to this sooner! Will take a closer look tomorrow morning. Just want to make sure everything is BC!

@clumsy
Copy link
Collaborator Author

clumsy commented Oct 30, 2025

Appreciate your help, @kiukchung!

@clumsy clumsy force-pushed the feat/extend_schedulers_list branch 2 times, most recently from cfa412b to 05924eb Compare October 30, 2025 19:31
@clumsy
Copy link
Collaborator Author

clumsy commented Oct 30, 2025

Fixed the BC issues in the new revision and added more unit tests @kiukchung

@meta-codesync
Copy link

meta-codesync bot commented Oct 31, 2025

@kiukchung has imported this pull request. If you are a Meta employee, you can view this in D74743585.

Copy link
Contributor

@tonykao8080 tonykao8080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review automatically exported from Phabricator review in Meta.

@clumsy
Copy link
Collaborator Author

clumsy commented Nov 10, 2025

Please let me know what you think about the latest revision, @kiukchung

@clumsy clumsy force-pushed the feat/extend_schedulers_list branch from 05924eb to 2d45b47 Compare November 10, 2025 14:20
@clumsy
Copy link
Collaborator Author

clumsy commented Nov 11, 2025

Could you please import the recent revision, @kiukchung @tonykao8080?

@kiukchung
Copy link
Contributor

will take a closer look at this tomorrow when I'm back from PTO.

@kiukchung
Copy link
Contributor

Could you please import the recent revision, @kiukchung @tonykao8080?

this fails internal tests because of the non-BC issue we've discussed. get_scheduler_factories() returns either the schedulers defined in [torchx.schedulers] entrypoint OR if no such entrypoint group, returns the set of DEFAULT_SCHEDULERS.

Since we have [torchx.schedulers] entrypoint set internally, we don't expect the default set of schedulers to get loaded. Especially since we don't actually include dependencies to schedulers like "kubernetes" (which we don't use internally).

There are call-sites where users iterate through the loaded scheduler factories and actually create scheduler instances. This PR makes it such that schedulers such as kubernetes, kubernetes_mcad attempted to be created, which fails since we don't include those modules in our internal deployment of torchx.

@clumsy clumsy force-pushed the feat/extend_schedulers_list branch from 2d45b47 to d99cb1e Compare December 1, 2025 14:14
@clumsy
Copy link
Collaborator Author

clumsy commented Dec 1, 2025

Understood, @kiukchung. Made sure we preserve the existing behavior of when torchx.schedulers is provided - it overrides everything.

If not provided we use the defaults: built-in (as before, but renamed as you've suggested) + extra (explicitly registered under the new entrypoint, allowing to override defaults).

Libraries that already extend TorchX, e.g. NeMo-Run can now transition to the new mechanics.

@clumsy clumsy force-pushed the feat/extend_schedulers_list branch from 68b6131 to 5f82b9b Compare December 1, 2025 23:32
@clumsy
Copy link
Collaborator Author

clumsy commented Dec 3, 2025

Does it look good now, @kiukchung ?

@meta-pytorch meta-pytorch deleted a comment from azzhipa Dec 3, 2025
@clumsy
Copy link
Collaborator Author

clumsy commented Dec 3, 2025

Thanks for checking @tonykao8080. Looks like I have approvals and permissions to merge this one, any objections @kiukchung?

@kiukchung
Copy link
Contributor

Thanks for checking @tonykao8080. Looks like I have approvals and permissions to merge this one, any objections @kiukchung?

taking a look

Copy link
Contributor

@kiukchung kiukchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment. This change may not be necessary to achieve what you want.

Comment on lines +42 to +47
def default_schedulers() -> dict[str, SchedulerFactory]:
"""Build default schedulers (built-in + extras from torchx.schedulers.extra)."""
return {
**{s: _defer_load_scheduler(p) for s, p in BUILTIN_SCHEDULER_MODULES.items()},
**load_group("torchx.schedulers.extra", default={}),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a closer look at this, and realized that we dont' need this change to get what you want.

Suppose what you want is NeMO's [torchx.schedulers] as well as other ones that you use. Python entrypoint groups are compounding (but the keys in the groups are not). So as long as you don't have a conflict in the scheduler names, you can define your own entrypoint as

[torchx.schedulers]
local_cwd = torchx.schedulers.local_scheduler_fb:create_cwd_scheduler
aws_batch = torchx.schedulers.aws_batch_scheduler:create_scheduler
... others you want ...

and NeMO defines their set as:

[torchx.schedulers]
nemo_sched_1 = ...
nemo_sched_2 = ...

the schedulers torchx ends up loading would be:

local_cwd
aws_batch
nemo_sched_1
nemo_sched_2

However in the absence of your entrypoint [torchx.schedulers] torchx would only load the ones defined in NeMO (nemo_sched_1 and nemo_sched_2).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware of this effect @kiukchung but it means there must be some package re-registering the (ideally all) built-in schedulers and keep up with the list (I remember you mentioned there's no plan to extend the list though - I was not aware when this ticket was cut).
E.g. inside a NeMo container where only NeMo-Run and TorchX dependency are installed we only get NeMo-Run schedulers (like you mentioned), i.e. no local_cwd. As as user I would like to get NeMo-ones, built-ins and maybe add my own (just like components), but not at the cost re-registering if that makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you running on a pre-built/published NeMo docker image (e.g.nvcr.io/nvidia/nemo) directly? you don't have your workspace/project to install?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't include torchx in runtime dependencies to have a nice separation in dependency closures, so no @kiukchung

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fact that you want to use local_cwd means that you have a direct dep on torchx no? I'm having a hard time understanding the exact use-case to be able to offer you solutions/help. Perhaps a quick catch-up over slack or VC would help?

I realize this is a long standing ask so want to help unblock you.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to use the vanilla one that is already there :) So in order for me to be able to use local_cwd from within the NeMo container is either install a custom package that re-register built-ins (I don't know such package nor does it look tempting to create one), or comment out entrypoints for NeMo-Run @kiukchung

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants