Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor the controller logic #234

Merged
merged 3 commits into from
Jan 8, 2018
Merged

refactor the controller logic #234

merged 3 commits into from
Jan 8, 2018

Conversation

wackxu
Copy link
Contributor

@wackxu wackxu commented Dec 19, 2017

This PR is mainly about refactoring the structure of controller keep style with kubernetes controller. And the main logic of package trainer is not modified. We can continue do refactor next step to make the controller stateless.


This change is Reviewable

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.
  • If your company signed a CLA, they designated a Point of Contact who decides which employees are authorized to participate. You may need to contact the Point of Contact for your company and ask to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the project maintainer to go/cla#troubleshoot. The email used to register you as an authorized contributor must be the email used for the Git commit.
  • In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

@k8s-ci-robot
Copy link

Hi @wackxu. Thanks for your PR.

I'm waiting for a kubernetes or tensorflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wackxu
Copy link
Contributor Author

wackxu commented Dec 19, 2017

The work is going on. Then it is ok to review, I will call yo @jlewi

@zjj2wry
Copy link
Member

zjj2wry commented Dec 20, 2017

cool, will help review when it ready

@gaocegege
Copy link
Member

gaocegege commented Dec 24, 2017

Awesome, plz @ me if the PR is ready since I implemented a similar controller before and I am glad to help to review it :-)

Copy link
Member

@ScorpioCPH ScorpioCPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just submit the last one commit refactor logic, i think the other commits are the same as #215

@wackxu wackxu changed the title [WIP]refactor the controller logic refactor the controller logic Dec 29, 2017
@wackxu
Copy link
Contributor Author

wackxu commented Dec 29, 2017

This PR is almost done, I need to wait #215 (almost get merge) get merged. Then I will done the following work.

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
Copy link
Member

@zjj2wry zjj2wry Dec 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need this header? @jlewi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so.

Client: leaderElectionClient.CoreV1(),
LockConfig: resourcelock.ResourceLockConfig{
Identity: id,
EventRecorder: &record.FakeRecorder{},
Copy link
Member

@zjj2wry zjj2wry Dec 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use real event broadcast? we can get some events in tf-operator endpoint when tf-operator failover.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i created a PR #260 fix this.


id, err := os.Hostname()
if err != nil {
return fmt.Errorf("failed to get hostname: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed ... would be better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix.

fs.IntVar(&s.ChaosLevel, "chaos-level", -1, "DO NOT USE IN PRODUCTION - level of chaos injected into the TfJob created by the operator.")
fs.BoolVar(&s.PrintVersion, "version", false, "Show version and quit")
fs.DurationVar(&s.GCInterval, "gc-interval", 10*time.Minute, "GC interval")
fs.StringVar(&s.ControllerConfigFile, "controller_config_file", "", "Path to file containing the controller config.")
Copy link
Member

@zjj2wry zjj2wry Dec 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's fix controller_config_file to controller-config-file in this PR?

Copy link
Member

@zjj2wry zjj2wry Jan 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wackxu i created a PR #259 fix this. let this PR focus on controller. what you think?

name := os.Getenv("MY_POD_NAME")
if len(name) == 0 {
glog.Fatalf("must set env MY_POD_NAME")
}
Copy link
Member

@zjj2wry zjj2wry Dec 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this in validate func?

@zjj2wry
Copy link
Member

zjj2wry commented Dec 29, 2017

@wackxu will review controller logic when #215 get merged

@@ -1,147 +0,0 @@
// Licensed under the Apache License, Version 2.0 (the "License");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suspect remove this will break dashboard...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref #258

Copy link
Member

@ScorpioCPH ScorpioCPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I think #215 has been merged, could me mind remove the repeated commits please?

@wackxu
Copy link
Contributor Author

wackxu commented Jan 2, 2018

@ScorpioCPH Done, PTAL

@coveralls
Copy link

coveralls commented Jan 2, 2018

Coverage Status

Coverage increased (+2.0%) to 30.515% when pulling 5a2e58f on wackxu:refcon into ed7cae7 on tensorflow:master.

Copy link
Member

@ScorpioCPH ScorpioCPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wackxu Thanks very much for your refactor work! Would you please add more description about what this PR do and why we need this PR, Thanks!

@jlewi Hi, I suggested that we need more design doc/discussion before the following PR review. It will be helpful to reduce duplicating efforts.


import (
"github.com/spf13/pflag"
"time"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use gofmt to format your source code.

Move built-in packages to the beginning of import {} maybe better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review, Will address it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScorpioCPH I have run go fmt and it is the right order.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new line between built-in package time and github.com/spf13/pflag, and retry.

"time"
"k8s.io/client-go/rest"
clientset "k8s.io/client-go/kubernetes"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please format this import{} statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it

}
name := os.Getenv("MY_POD_NAME")
if len(name) == 0 {
glog.Fatalf("must set env MY_POD_NAME")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why must we set a pod name? I think this is no matter for users, in another word, Kubeflow users don't know what is Pod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I would like to remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove it. I think it's a legacy of the original etcd code.

@@ -1,194 +1,22 @@
package main
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add copyright.

return fmt.Errorf("failed to get hostname: %v", err)
}

rl := &resourcelock.EndpointsLock{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this resourcelock used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is for high availability when we run more than one tensorflow operator.

Copy link
Member

@zjj2wry zjj2wry Jan 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resoucelock have endpoint and configmap type. tf_operator use endpoint type. tf_operator endpoint use annotation store kubernetes leader election information as a optimistic lock

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some discussion in #263.

},
}

election.RunOrDie(election.LeaderElectionConfig{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we have more than one Kubeflow controller in a cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can run more than one tensorflow operator for high availability as kubernetes controller manage

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only a replecas work. so it is a Cold Backup

@jlewi
Copy link
Contributor

jlewi commented Jan 3, 2018

@wackxu agree with @ScorpioCPH

Can you explain how using informers works and how it compares to the current model?

High level question. Does this refactor make the controller stateless?

The initial design of the controller was stateful such that we maintained state about each TfJob in memory.

This was convenient because it meant that every time the job controller first loaded a job we could perform some expensive operation to determine the state of the job. Then when we periodically (every O(seconds) check the state of the job) we could take advantage of the known state. If the controller is stateless then this could be more problematic because either we'd need to 1) store this state in the ApiServer or 2) recompute the state on each update which could be prohibitively expensive.

@wackxu
Copy link
Contributor Author

wackxu commented Jan 3, 2018

@jlewi You can get details information about how informers works from https://engineering.bitnami.com/articles/a-deep-dive-into-kubernetes-controllers.html

@wackxu
Copy link
Contributor Author

wackxu commented Jan 3, 2018

@jlewi In this refactor, the controller is still stateful. If we want to keep style with kubernetes controller, I think the change is so big that we can do it next step. I just modify the structure of controller. And the main logic of package trainer is not modified

@jlewi
Copy link
Contributor

jlewi commented Jan 3, 2018

@wackxu Thanks for the info its very helpful. Can you update the initial comment in this PR to explain what refactoring this PR is doing and why?

@wackxu
Copy link
Contributor Author

wackxu commented Jan 5, 2018

@jlewi @zjj2wry @ScorpioCPH I think we can first merge it and It just modify the structure for controller and next step @gaocegege will focus on refactor the logic about reconcile

@coveralls
Copy link

coveralls commented Jan 5, 2018

Coverage Status

Coverage increased (+2.0%) to 30.336% when pulling 4a050fc on wackxu:refcon into f7803f1 on tensorflow:master.

@coveralls
Copy link

coveralls commented Jan 5, 2018

Coverage Status

Coverage increased (+2.0%) to 30.336% when pulling 4a050fc on wackxu:refcon into f7803f1 on tensorflow:master.

@zjj2wry
Copy link
Member

zjj2wry commented Jan 5, 2018

/ok-to-test

@jlewi
Copy link
Contributor

jlewi commented Jan 5, 2018

@wackxu Thank you I think my high level questions have been addressed. Your plan seems reasonable.

@zjj2wry @ScorpioCPH I'll let you take the lead on reviewing this.

@ScorpioCPH
Copy link
Member

@jlewi LGTM, let's move forward :)

@jlewi
Copy link
Contributor

jlewi commented Jan 5, 2018

@wackxu looks like there are some lint issues. Can you please fix?

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit but otherwise this is ready to go once lint issues are fixed and travis is passing.

}
name := os.Getenv("MY_POD_NAME")
if len(name) == 0 {
glog.Fatalf("must set env MY_POD_NAME")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove it. I think it's a legacy of the original etcd code.

return fmt.Errorf("failed to get hostname: %v", err)
}

rl := &resourcelock.EndpointsLock{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some discussion in #263.


id, err := os.Hostname()
if err != nil {
return fmt.Errorf("failed to get hostname: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix.

@wackxu
Copy link
Contributor Author

wackxu commented Jan 8, 2018

/retest

@coveralls
Copy link

coveralls commented Jan 8, 2018

Coverage Status

Coverage increased (+2.09%) to 30.462% when pulling 43e3fb3 on wackxu:refcon into 190394d on tensorflow:master.

@coveralls
Copy link

coveralls commented Jan 8, 2018

Coverage Status

Coverage increased (+2.2%) to 30.567% when pulling 0bded69 on wackxu:refcon into 190394d on tensorflow:master.

@wackxu
Copy link
Contributor Author

wackxu commented Jan 8, 2018

@jlewi Done, PTAL

@jlewi
Copy link
Contributor

jlewi commented Jan 8, 2018

Fixed some merge conflicts; waiting for the tests to rerun before merging.

@coveralls
Copy link

coveralls commented Jan 8, 2018

Coverage Status

Coverage increased (+2.1%) to 30.551% when pulling 3f8c5f8 on wackxu:refcon into 1f85e46 on tensorflow:master.

@jlewi jlewi merged commit da1a861 into kubeflow:master Jan 8, 2018
jlewi added a commit to jlewi/k8s that referenced this pull request Jan 14, 2018
…orker.

  * This was added in kubeflow#221 and accidentally removed in the refactor in kubeflow#234.
jlewi added a commit that referenced this pull request Jan 16, 2018
…roken (#308)

* In syncTfJob when checking whether a work queue item corresponds to a TrainingJob already in the map we need to check the UID. Otherwise we will not properly handle the case where a training job is deleted and then a new job is recreated with the same name.

* We need to make sure that the Replicas field in TrainingJob is always properly set;

* We were only initializing replicas in setup which was problematic in the case where the TfJob controller gets restarted because on restarted setup won't be invoked because the job is past that phase and as a result the replicas won't be reinitialized.

* test_runner needs to ignore case when checking whether the job succeeded otherwise we conclude
that successful jobs failed

* The controller should only forget about job after the job has been cleaned up; not when it is marked as succeeded or failed.

* Add back code to support termination policies use the worker and not the master as the chief
    *This was added in #221 and accidentally removed in the refactor in #234.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants