-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Pytorch job in Katib #283
Changes from all commits
ce4bde5
896c4d0
147a846
34d535d
455dc9e
16a0e41
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
apiVersion: "kubeflow.org/v1alpha1" | ||
kind: StudyJob | ||
metadata: | ||
namespace: kubeflow | ||
labels: | ||
controller-tools.k8s.io: "1.0" | ||
name: pytorchjob-example | ||
spec: | ||
studyName: pytorchjob-example | ||
owner: crd | ||
optimizationtype: maximize | ||
objectivevaluename: accuracy | ||
optimizationgoal: 0.99 | ||
requestcount: 4 | ||
metricsnames: | ||
- accuracy | ||
parameterconfigs: | ||
- name: --lr | ||
parametertype: double | ||
feasible: | ||
min: "0.01" | ||
max: "0.05" | ||
- name: --momentum | ||
parametertype: double | ||
feasible: | ||
min: "0.5" | ||
max: "0.9" | ||
workerSpec: | ||
retain: true | ||
goTemplate: | ||
rawTemplate: |- | ||
apiVersion: "kubeflow.org/v1beta1" | ||
kind: PyTorchJob | ||
metadata: | ||
name: {{.WorkerID}} | ||
namespace: kubeflow | ||
spec: | ||
pytorchReplicaSpecs: | ||
Master: | ||
replicas: 1 | ||
restartPolicy: Never | ||
template: | ||
spec: | ||
containers: | ||
- name: pytorch | ||
image: gcr.io/kubeflow-ci/pytorch-mnist-with-summary:0.4 | ||
imagePullPolicy: Always | ||
command: | ||
- "python" | ||
- "/opt/pytorch_dist_mnist/mnist_with_summary.py" | ||
{{- with .HyperParameters}} | ||
{{- range .}} | ||
- "{{.Name}}={{.Value}}" | ||
{{- end}} | ||
{{- end}} | ||
metricsCollectorSpec: | ||
retain: true | ||
goTemplate: | ||
rawTemplate: |- | ||
apiVersion: batch/v1beta1 | ||
kind: CronJob | ||
metadata: | ||
name: {{.WorkerID}} | ||
namespace: kubeflow | ||
spec: | ||
schedule: "*/1 * * * *" | ||
successfulJobsHistoryLimit: 15 | ||
failedJobsHistoryLimit: 15 | ||
jobTemplate: | ||
spec: | ||
template: | ||
spec: | ||
serviceAccountName: metrics-collector | ||
containers: | ||
- name: {{.WorkerID}} | ||
image: johnugeorge/metrics-collector | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same with this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. New metric-collector image has to be created once this current PR is merged There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will change the image name once the new image is created after this PR merge. |
||
args: | ||
- "./metricscollector" | ||
- "-s" | ||
- "{{.StudyID}}" | ||
- "-t" | ||
- "{{.TrialID}}" | ||
- "-w" | ||
- "{{.WorkerID}}" | ||
- "-k" | ||
- "{{.WorkerKind}}" | ||
- "-n" | ||
- "{{.NameSpace}}" | ||
restartPolicy: Never | ||
|
||
suggestionSpec: | ||
suggestionAlgorithm: "random" | ||
requestNumber: 3 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
apiVersion: apiextensions.k8s.io/v1beta1 | ||
kind: CustomResourceDefinition | ||
metadata: | ||
name: pytorchjobs.kubeflow.org | ||
spec: | ||
group: kubeflow.org | ||
version: v1beta1 | ||
scope: Namespaced | ||
names: | ||
kind: PyTorchJob | ||
singular: pytorchjob | ||
plural: pytorchjobs |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -42,6 +42,7 @@ rules: | |
- kubeflow.org | ||
resources: | ||
- tfjobs | ||
- pytorchjobs | ||
verbs: | ||
- "*" | ||
--- | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
/* | ||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
http://www.apache.org/licenses/LICENSE-2.0 | ||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
||
package apis | ||
|
||
import ( | ||
"github.com/kubeflow/pytorch-operator/pkg/apis/pytorch/v1beta1" | ||
) | ||
|
||
func init() { | ||
// Register the types with the Scheme so the components can map objects to GroupVersionKinds and back | ||
AddToSchemes = append(AddToSchemes, v1beta1.SchemeBuilder.AddToScheme) | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,6 +13,7 @@ import ( | |
restclient "k8s.io/client-go/rest" | ||
|
||
"github.com/kubeflow/katib/pkg/api" | ||
"github.com/kubeflow/katib/pkg/controller/studyjob" | ||
) | ||
|
||
type MetricsCollector struct { | ||
|
@@ -34,8 +35,16 @@ func NewMetricsCollector() (*MetricsCollector, error) { | |
|
||
} | ||
|
||
func (d *MetricsCollector) CollectWorkerLog(wID string, objectiveValueName string, metrics []string, namespace string) (*api.MetricsLogSet, error) { | ||
pl, _ := d.clientset.CoreV1().Pods(namespace).List(metav1.ListOptions{LabelSelector: "job-name=" + wID, IncludeUninitialized: true}) | ||
func (d *MetricsCollector) CollectWorkerLog(wID string, wkind string, objectiveValueName string, metrics []string, namespace string) (*api.MetricsLogSet, error) { | ||
var labelName string | ||
if wkind == studyjob.TFJobWorker { | ||
labelName = "tf_job_name" | ||
} else if wkind == studyjob.PyTorchJobWorker { | ||
labelName = "pytorch_job_name" | ||
} else { | ||
labelName = "job-name" | ||
} | ||
pl, _ := d.clientset.CoreV1().Pods(namespace).List(metav1.ListOptions{LabelSelector: labelName + "=" + wID, IncludeUninitialized: true}) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Whitch pod will be watched in Pytorch and TFJob Job by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pods spawn by the pytorch job and tf job will have this label key whose value is set to job name https://github.com/kubeflow/tf-operator/blob/master/pkg/common/jobcontroller/jobcontroller.go#L190 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, my question is pytorch job and tf job will create several pods, and metrics collector will get logs from one pod. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently, it uses only 1 worker. If there are more workers, Master can take responsibility to emit logs. We need to separately discuss better ways to tackle distributed job. @richardsliu |
||
if len(pl.Items) == 0 { | ||
return nil, errors.New(fmt.Sprintf("No Pods are found in Job %v", wID)) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you also need to configure PV here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardsliu Currently, this example uses the default metric collector which parses the stdout logs. It doesn't need PV. I will add one more example that uses the tf event metric collector.