-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor the controller logic #234
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed, please reply here (e.g.
|
Hi @wackxu. Thanks for your PR. I'm waiting for a kubernetes or tensorflow member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The work is going on. Then it is ok to review, I will call yo @jlewi |
cool, will help review when it ready |
Awesome, plz @ me if the PR is ready since I implemented a similar controller before and I am glad to help to review it :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just submit the last one commit refactor logic
, i think the other commits are the same as #215
This PR is almost done, I need to wait #215 (almost get merge) get merged. Then I will done the following work. |
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need this header? @jlewi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so.
cmd/tf_operator/app/server.go
Outdated
Client: leaderElectionClient.CoreV1(), | ||
LockConfig: resourcelock.ResourceLockConfig{ | ||
Identity: id, | ||
EventRecorder: &record.FakeRecorder{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use real event broadcast? we can get some events in tf-operator endpoint when tf-operator failover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i created a PR #260 fix this.
cmd/tf_operator/app/server.go
Outdated
|
||
id, err := os.Hostname() | ||
if err != nil { | ||
return fmt.Errorf("failed to get hostname: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failed ... would be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix.
fs.IntVar(&s.ChaosLevel, "chaos-level", -1, "DO NOT USE IN PRODUCTION - level of chaos injected into the TfJob created by the operator.") | ||
fs.BoolVar(&s.PrintVersion, "version", false, "Show version and quit") | ||
fs.DurationVar(&s.GCInterval, "gc-interval", 10*time.Minute, "GC interval") | ||
fs.StringVar(&s.ControllerConfigFile, "controller_config_file", "", "Path to file containing the controller config.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's fix controller_config_file
to controller-config-file
in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd/tf_operator/app/server.go
Outdated
name := os.Getenv("MY_POD_NAME") | ||
if len(name) == 0 { | ||
glog.Fatalf("must set env MY_POD_NAME") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this in validate func?
@@ -1,147 +0,0 @@ | |||
// Licensed under the Apache License, Version 2.0 (the "License"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i suspect remove this will break dashboard...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref #258
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I think #215 has been merged, could me mind remove the repeated commits please?
@ScorpioCPH Done, PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
import ( | ||
"github.com/spf13/pflag" | ||
"time" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use gofmt
to format your source code.
Move built-in packages to the beginning of import {} maybe better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review, Will address it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ScorpioCPH I have run go fmt
and it is the right order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new line between built-in package time
and github.com/spf13/pflag
, and retry.
cmd/tf_operator/app/server.go
Outdated
"time" | ||
"k8s.io/client-go/rest" | ||
clientset "k8s.io/client-go/kubernetes" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please format this import{} statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it
cmd/tf_operator/app/server.go
Outdated
} | ||
name := os.Getenv("MY_POD_NAME") | ||
if len(name) == 0 { | ||
glog.Fatalf("must set env MY_POD_NAME") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why must we set a pod name? I think this is no matter for users, in another word, Kubeflow users don't know what is Pod
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I would like to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove it. I think it's a legacy of the original etcd code.
@@ -1,194 +1,22 @@ | |||
package main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add copyright.
cmd/tf_operator/app/server.go
Outdated
return fmt.Errorf("failed to get hostname: %v", err) | ||
} | ||
|
||
rl := &resourcelock.EndpointsLock{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this resourcelock
used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is for high availability when we run more than one tensorflow operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resoucelock have endpoint and configmap type. tf_operator use endpoint type. tf_operator endpoint use annotation store kubernetes leader election information as a optimistic lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some discussion in #263.
cmd/tf_operator/app/server.go
Outdated
}, | ||
} | ||
|
||
election.RunOrDie(election.LeaderElectionConfig{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we have more than one Kubeflow controller in a cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can run more than one tensorflow operator for high availability as kubernetes controller manage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only a replecas work. so it is a Cold Backup
@wackxu agree with @ScorpioCPH Can you explain how using informers works and how it compares to the current model? High level question. Does this refactor make the controller stateless? The initial design of the controller was stateful such that we maintained state about each TfJob in memory. This was convenient because it meant that every time the job controller first loaded a job we could perform some expensive operation to determine the state of the job. Then when we periodically (every O(seconds) check the state of the job) we could take advantage of the known state. If the controller is stateless then this could be more problematic because either we'd need to 1) store this state in the ApiServer or 2) recompute the state on each update which could be prohibitively expensive. |
@jlewi You can get details information about how informers works from https://engineering.bitnami.com/articles/a-deep-dive-into-kubernetes-controllers.html |
@jlewi In this refactor, the controller is still stateful. If we want to keep style with |
@wackxu Thanks for the info its very helpful. Can you update the initial comment in this PR to explain what refactoring this PR is doing and why? |
@jlewi @zjj2wry @ScorpioCPH I think we can first merge it and It just modify the structure for controller and next step @gaocegege will focus on refactor the logic about reconcile |
/ok-to-test |
@wackxu Thank you I think my high level questions have been addressed. Your plan seems reasonable. @zjj2wry @ScorpioCPH I'll let you take the lead on reviewing this. |
@jlewi LGTM, let's move forward :) |
@wackxu looks like there are some lint issues. Can you please fix? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit but otherwise this is ready to go once lint issues are fixed and travis is passing.
cmd/tf_operator/app/server.go
Outdated
} | ||
name := os.Getenv("MY_POD_NAME") | ||
if len(name) == 0 { | ||
glog.Fatalf("must set env MY_POD_NAME") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove it. I think it's a legacy of the original etcd code.
cmd/tf_operator/app/server.go
Outdated
return fmt.Errorf("failed to get hostname: %v", err) | ||
} | ||
|
||
rl := &resourcelock.EndpointsLock{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some discussion in #263.
cmd/tf_operator/app/server.go
Outdated
|
||
id, err := os.Hostname() | ||
if err != nil { | ||
return fmt.Errorf("failed to get hostname: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix.
/retest |
@jlewi Done, PTAL |
Fixed some merge conflicts; waiting for the tests to rerun before merging. |
…orker. * This was added in kubeflow#221 and accidentally removed in the refactor in kubeflow#234.
…roken (#308) * In syncTfJob when checking whether a work queue item corresponds to a TrainingJob already in the map we need to check the UID. Otherwise we will not properly handle the case where a training job is deleted and then a new job is recreated with the same name. * We need to make sure that the Replicas field in TrainingJob is always properly set; * We were only initializing replicas in setup which was problematic in the case where the TfJob controller gets restarted because on restarted setup won't be invoked because the job is past that phase and as a result the replicas won't be reinitialized. * test_runner needs to ignore case when checking whether the job succeeded otherwise we conclude that successful jobs failed * The controller should only forget about job after the job has been cleaned up; not when it is marked as succeeded or failed. * Add back code to support termination policies use the worker and not the master as the chief *This was added in #221 and accidentally removed in the refactor in #234.
This PR is mainly about refactoring the structure of controller keep style with kubernetes controller. And the main logic of package
trainer
is not modified. We can continue do refactor next step to make the controller stateless.This change is