-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor the TfJob to use Informer and Controller #206
Comments
In general, I'd much prefer to reuse existing well maintained libraries then custom code. So I'm all for replacing custom code with a more established library. Are their specific changes you are thinking of making? |
Hi we(caicloud) are implementing the controller for kubeflow which is similar to kubernetes/sample-controller. And we are glad to file a proposal for the new design soon. |
@gaocegege I am so sorry that I have almost done it and I will commit a PR ASAP |
@jlewi We can make the change step by step,The first phases main include follows: |
@wackxu nice work! Are you targeting at refactoring the code base only, or do you also plan to change the API? |
That's awesome.
Are you planning on submitting a single PR or splitting it up into multiple
PRs?
…On Dec 9, 2017 7:08 AM, "Deyuan Deng" ***@***.***> wrote:
@wackxu <https://github.com/wackxu> nice work!
Are you targeting at refactoring the code base only, or do you also plan
to change the API?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#206 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAvcA0foys2IdSOYNPkxd8K8i0rUHvSrks5s-oZLgaJpZM4Q6iaY>
.
|
@wackxu It doesn't matter, it is awesome 😄 |
fix #206 1.refactor the spec package to the kubernetes standard API, all detention of our CRD is in pkg pkg/apis/tensorflow and also the validation、 set Default for the CRD and some help func. 2.for use K8s code gen, wo import the project https://github.com/kubernetes/code-generator, and some shell script in the folder hack. You can see how to use it in https://blog.openshift.com/kubernetes-deep-dive-code-generation-customresources/ 3.All the file in pkg/client are all auto generated and we should not modify it manually. 4.I do not plan to change the logical to use the sharedInformer and clientset. This PR is big enough, I think I can do it in a follow PR. 5.SharedInformer is a helpful tool with cache that we can use it to list and watch our resources. More details you can see http://borismattijssen.github.io/articles/kubernetes-informers-controllers-reflectors-stores
Reopening this because it was autoclosed but I think there are PRs still pending related to this. |
#234 refactors the logic but there are some other things should be done to migrate to the new controller design. We implemented a simple controller https://github.com/caicloud/kubeflow-controller and glad to help the upstream to do the rest work. |
@gaocegege Merging the caicloud work seems like a great idea; I think the caiacloud controller has many advantages over the current implementation. A couple questions/comments
|
@jlewi @gaocegege caicloud kubeflow-controller is similar with kubernetes job controller, It is a better implementation. I think we can do it in the next PR and I am very glad to help review it. The main change is so big and I think we should first merge #234, #234 is mainly change the structure of |
Thanks for your review, I will write a design doc ASAP and file an issue to discuss the differences between caicloud controller and the upstream controller. And I agree with @wackxu that we should focus on #234 and keep moving.
There are some changes in caicloud TFJob CRD, and @DjangoPeng is summarising them and he will file an issue to discuss it.
Now it is simple, the controller checks the statuses of the workers, if all workers are succeeded, we think the TFJob is succeeded.
Yeah, I will write a design doc and post in this issue.
We have an internal discussion, too in https://github.com/caicloud/kubeflow-controller/issues/71. Now we have not a conclusion and welcome comments :-) Personally, I think if most of the use cases could be handled by job, then job is better since we do not have to manage all the lifecycle of the pods, which is implemented in job controller.
Yeah, and I think @zjj2wry are implementing the feature for tensorflow/k8s: #260. Now we have limited support in caicloud controller and the recorder records the creation of pods and services. For example, events when we create 4 workers and 2 PS:
And we could support more events if we need. |
@jlewi Thanks for your reply. I'd like to give a supplement.
In caicloud Clever platform, we support not only TensorFlow jobs but also batch job, service and others. Considering the open source work, we'll eventually combine TFJob CRD of ourselves and upstream. It's necessary to file an issue to discuss the details, and I'll do it after my company work.
The design and lifecycle of local job is determinate and easy-to-understand, because local job is similar with single batch job. But distributed job is more complicated and the status is up to user. So, we have to discuss the strategy and logic.
We're discussing the selection. Personally speaking, I prefer Pod. Ref: https://github.com/caicloud/kubeflow-controller/issues/71#issuecomment-355056365 |
I added some explanation about the implementation, and as wackxu@ said it is simliar to job controller in Kubernetes. Before #234 merged, I will fix some issues in our own implementation. And I will file a PR to contribute the implementation to the upstream after #234 if you think it is generally acceptable :-) |
Are you planning on using a single CRD or having different CRDs for different types of jobs? @DjangoPeng so is your plan to upstream to tensorflow/k8s any changes/improvements needed by Caicloud and then just use the TfJob controller in tensorflow/k8s? Do you forsee a need to maintain a downstream fork in caicloud/kubeflow-controller long term? |
Sorry for the late reply. We hope we could use the upstream controller eventually. And we will discuss about the changes and try to merge the changes in CRD into the upstream, or we will rethink about the CRD spec design of ours if we can not come to an agreement. As README in caicloud/kubeflow-controller said, it is just temporary. And we hope the controller could help to accelerate the development of tensorflow/k8s. |
Sounds great thanks. |
Is there more work to be done to use Informer and Controller? |
I think the work is done, and other things are in other issues, e.g. #314 |
I am closing the issue since we have already finished it. Thanks for your awesome contribution again @wackxu |
Kubernetes provides many useful tools that we can use to develop our project, such as sharedInformer, clientSet, lister and so on. Using these make our project more readable and good performance. and we also need versioned the API and it is easy for us to implement backwards compatible and make the code more specification.
we can refactor the project similar with example https://github.com/kubernetes/sample-controller.
I have done something work. How do you think about it? @jlewi
The text was updated successfully, but these errors were encountered: