-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Pytorch job in Katib #283
Conversation
/retest |
1 similar comment
/retest |
examples/pytorchjob-example.yaml
Outdated
spec: | ||
containers: | ||
- name: pytorch | ||
image: johnugeorge/pytorch-mnist-with-summary:0.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check in the source code for this? Also this should be on gcr.io.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised PR : kubeflow/pytorch-operator#112
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardsliu Can you help uploading the image to gcr.io ? I don't have access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnugeorge Done. Pushed as gcr.io/kubeflow-ci/pytorch-mnist-with-summary:0.4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
serviceAccountName: metrics-collector | ||
containers: | ||
- name: {{.WorkerID}} | ||
image: johnugeorge/metrics-collector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New metric-collector image has to be created once this current PR is merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change the image name once the new image is created after this PR merge.
/hold /assign @richardsliu |
- "{{.Name}}={{.Value}}" | ||
{{- end}} | ||
{{- end}} | ||
metricsCollectorSpec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you also need to configure PV here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardsliu Currently, this example uses the default metric collector which parses the stdout logs. It doesn't need PV. I will add one more example that uses the tf event metric collector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM.
examples/pytorchjob-example.yaml
Outdated
spec: | ||
containers: | ||
- name: pytorch | ||
image: johnugeorge/pytorch-mnist-with-summary:0.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the image?
pkg/controller/studyjob/const.go
Outdated
@@ -13,5 +13,6 @@ package studyjob | |||
|
|||
const ( | |||
DefaultJobWorker = "Job" | |||
TFJobWorker = "TFJob" | |||
TFJobWorker = "TFJob" | |||
PytorchJobWorker = "PyTorchJob" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyTorchJobWorker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
/retest |
/test all |
/retest |
/retest |
} else { | ||
labelName = "job-name" | ||
} | ||
pl, _ := d.clientset.CoreV1().Pods(namespace).List(metav1.ListOptions{LabelSelector: labelName + "=" + wID, IncludeUninitialized: true}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whitch pod will be watched in Pytorch and TFJob Job by pytorch_job_name = wID
and tf_job_name = wID
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pods spawn by the pytorch job and tf job will have this label key whose value is set to job name https://github.com/kubeflow/tf-operator/blob/master/pkg/common/jobcontroller/jobcontroller.go#L190
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my question is pytorch job and tf job will create several pods, and metrics collector will get logs from one pod.
Does metrics collector get logs from which one of the pods created by a pytorch job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, it uses only 1 worker. If there are more workers, Master can take responsibility to emit logs. We need to separately discuss better ways to tackle distributed job. @richardsliu
@johnugeorge Need to rebase the PR. |
/retest |
@richardsliu rebased |
/lgtm |
/hold cancel |
I wrote a comment to metrics collector. |
@YujiOshima Shall we merge this and start a separate discussion for distributed jobs? I will create one issue |
@johnugeorge OK, Thanks. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: YujiOshima The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR tracks
Related: #39
This change is