Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Updating Pytorch-Launcher component to work with pipelines v2 #11273

Merged
merged 1 commit into from
Nov 12, 2024

Conversation

Fiona-Waters
Copy link
Contributor

@Fiona-Waters Fiona-Waters commented Oct 7, 2024

Description of your changes:
This PR will resolve kubeflow/training-operator#2068
I have updated the pytorch launcher component to use v2 constructs.
I have also updated the pytorch launcher component to use kubeflow training-operator TrainingClient.

Checklist:

Copy link

Hi @Fiona-Waters. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Fiona-Waters
Copy link
Contributor Author

@terrytangyuan maybe you could help with the training-operator related error I am getting here. Any advice would be really appreciated. Thank you.

@terrytangyuan
Copy link
Member

Can you print out job object? It's supposed to be an Job/CRD object with fields like kind.

@Fiona-Waters
Copy link
Contributor Author

Can you print out job object? It's supposed to be an Job/CRD object with fields like kind.

Sure. Running it quickly locally it looks like this:

job {'api_version': 'kubeflow.org/v1',
 'kind': 'PyTorchJob',
 'metadata': {'annotations': None,
              'creation_timestamp': None,
              'deletion_grace_period_seconds': None,
              'deletion_timestamp': None,
              'finalizers': None,
              'generate_name': None,
              'generation': None,
              'labels': None,
              'managed_fields': None,
              'name': 'pytorchjob',
              'namespace': 'kubeflow',
              'owner_references': None,
              'resource_version': None,
              'self_link': None,
              'uid': None},
 'spec': {'elastic_policy': None,
          'nproc_per_node': None,
          'pytorch_replica_specs': {'Master': {}, 'Worker': {}},
          'run_policy': {'active_deadline_seconds': None,
                         'backoff_limit': None,
                         'clean_pod_policy': 'Running',
                         'scheduling_policy': None,
                         'suspend': None,
                         'ttl_seconds_after_finished': None}},
 'status': None}

I can try to log it out when running on KinD later today.

@Fiona-Waters
Copy link
Contributor Author

Can you print out job object? It's supposed to be an Job/CRD object with fields like kind.

Sure. Running it quickly locally it looks like this:

job {'api_version': 'kubeflow.org/v1',
 'kind': 'PyTorchJob',
 'metadata': {'annotations': None,
              'creation_timestamp': None,
              'deletion_grace_period_seconds': None,
              'deletion_timestamp': None,
              'finalizers': None,
              'generate_name': None,
              'generation': None,
              'labels': None,
              'managed_fields': None,
              'name': 'pytorchjob',
              'namespace': 'kubeflow',
              'owner_references': None,
              'resource_version': None,
              'self_link': None,
              'uid': None},
 'spec': {'elastic_policy': None,
          'nproc_per_node': None,
          'pytorch_replica_specs': {'Master': {}, 'Worker': {}},
          'run_policy': {'active_deadline_seconds': None,
                         'backoff_limit': None,
                         'clean_pod_policy': 'Running',
                         'scheduling_policy': None,
                         'suspend': None,
                         'ttl_seconds_after_finished': None}},
 'status': None}

I can try to log it out when running on KinD later today.

This is what the serialized job looks like. Kind is present.

{'apiVersion': 'kubeflow.org/v1', 'kind': 'PyTorchJob', 'metadata': {'name': 'pytorchjob', 'namespace': 'kubeflow'}, 'spec': {'pytorchReplicaSpecs': {'Master': {}, 'Worker': {}}, 'runPolicy': {'cleanPodPolicy': 'Running'}}}.

@Fiona-Waters Fiona-Waters marked this pull request as ready for review October 17, 2024 22:29
@Fiona-Waters Fiona-Waters changed the title [WIP] chore: Updating Pytorch-Launcher component to work with pipelines v2 chore: Updating Pytorch-Launcher component to work with pipelines v2 Oct 17, 2024
@Fiona-Waters
Copy link
Contributor Author

@HumairAK would really appreciate if you could review this if/when you have time, please.

@terrytangyuan
Copy link
Member

Thank you for your work on this! I left some comments.

@github-actions github-actions bot added ci-passed All CI tests on a pull request have passed and removed ci-passed All CI tests on a pull request have passed labels Oct 23, 2024
@rimolive
Copy link
Member

Thanks for you contribution, @Fiona-Waters !

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Oct 25, 2024
@github-actions github-actions bot added the ci-passed All CI tests on a pull request have passed label Oct 25, 2024
@HumairAK
Copy link
Collaborator

some nits above, otherwise happy to approve once addressed

@google-oss-prow google-oss-prow bot removed the lgtm label Oct 31, 2024
@github-actions github-actions bot added ci-passed All CI tests on a pull request have passed and removed ci-passed All CI tests on a pull request have passed labels Oct 31, 2024
@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Nov 4, 2024
@github-actions github-actions bot added ci-passed All CI tests on a pull request have passed and removed ci-passed All CI tests on a pull request have passed labels Nov 4, 2024
Copy link

@Shreyanand Shreyanand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

The latest commits address all the requested changes.

Copy link

@Shreyanand: changing LGTM is restricted to collaborators

In response to this:

/lgtm

The latest commits address all the requested changes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off by: Fiona-Waters fiwaters6@gmail.com

Signed-off-by: Fiona Waters <fiwaters6@gmail.com>
@github-actions github-actions bot added the ci-passed All CI tests on a pull request have passed label Nov 12, 2024
@HumairAK
Copy link
Collaborator

Thanks @Fiona-Waters! This is great!

/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: HumairAK, Shreyanand

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit cac3739 into kubeflow:master Nov 12, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved ci-passed All CI tests on a pull request have passed lgtm needs-ok-to-test size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update pytorch launcher component in Kubeflow Pipelines repository
5 participants