-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add information how to run TFjob and Pytorch examples in Katib #321
Add information how to run TFjob and Pytorch examples in Katib #321
Conversation
/retest |
- [Web UI](#web-ui) | ||
- [API Documentation](#api-documentation) | ||
- [Quickstart to run tfjob and pytorch operator jobs in Katib](#quickstart-to-run-tfjob-and-pytorch-operator-jobs-in-katib) | ||
- [TFjob operator](#tfjob-operator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you already updated Katib documentation in Kubeflow repo(https://github.com/kubeflow/kubeflow/tree/master/kubeflow/katib), shall we just point out to the Kubeflow page instead of duplicating it in this section? This will help in easy managing the pages for future updates.
I think, we can merge this entire section with "Getting Started" section.
"For running TFJobs and PyTorchJobs in Katib, Install job operators given in "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But here #313 (comment) decided to copy information in Katib README as well. And what we should do with information about running TFjob and Pytorch examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. @richardsliu WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 9 unresolved discussions (waiting on @andreyvelich and @hougangliu)
README.md, line 95 at r1 (raw file):
## Quickstart to run tfjob and pytorch operator jobs in Katib For running tfjob and pytorch operator jobs in Katib you have to install their packages.
For running tfjob and pytorch operator jobs in Katib, you have to install their packages.
README.md, line 116 at r1 (raw file):
After this you have to install volume for tfjob operator.
After this, you have to install volume for tfjob operator.
BTW, tfjob operator doesn't depend on pv/pvc, what you create here is used by tfjob in katib example. So I suggest you can move this part to #running-examples
README.md, line 135 at r1 (raw file):
requests: storage: 10Gi
I suggest you can use kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pvc.yaml
to replace yaml content
README.md, line 142 at r1 (raw file):
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pvc.yaml kubectl create -f https://raw.githubusercontent.com/andreyvelich/katib/example-doc-pytorch-tfjob-313/examples/tfevent-volume/tfevent-pv.yaml
use kubeflow/katib/master instead of andreyvelich/katib/example-doc-pytorch-tfjob-313
README.md, line 189 at r1 (raw file):
kubectl create -f katib-mysql-pv.yaml
please use https://raw.githubusercontent.com/kubeflow/katib/master/manifests/pv/pv.yaml instead
README.md, line 193 at r1 (raw file):
### Running examples After deploy everything you can run examples.
After deploy everything, you can run examples.
README.md, line 217 at r1 (raw file):
If you create pv for Katib delete it
If you create pv for Katib, delete it
README.md, line 220 at r1 (raw file):
kubectl delete -f katib-mysql-pv.yaml
please use https://raw.githubusercontent.com/kubeflow/katib/master/manifests/pv/pv.yaml instead
@hougangliu |
/lgtm |
``` | ||
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/pytorchjob-example.yaml | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please write about checking the status of jobs kubectl get studyjob
and look UI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YujiOshima Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status: 0 of 1 files reviewed, 14 unresolved discussions (waiting on @hougangliu, @andreyvelich, and @YujiOshima)
README.md, line 210 at r3 (raw file):
```yaml kubectl describe studyjob pytorchjob-example -n kubeflow
$ kubectl describe studyjob pytorchjob-example -n kubeflow
README.md, line 330 at r3 (raw file):
When the spec.Status.Condition becomes ```Completed```, the StudyJob is finished. You can monitor your results in Katib UI. For accessing to Katib UI you have to install Ambassador.
For accessing to Katib UI, you have to install Ambassador.
README.md, line 339 at r3 (raw file):
After deploy Ambassador, you can access to Katib UI using /katib/ path.
here, we had better show the full url path of ambassador service with /katib/
, too. (Just in case that a user had no idea about what ambassador is)
README.md, line 357 at r3 (raw file):
If you deploy Ambassador delete it
If you deploy Ambassador, delete it
@hougangliu
|
I prefer the first one (you had better show the |
/lgtm |
@andreyvelich Great thank you! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, YujiOshima The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Fixes: #313.
This change is