Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor the TfJob to use Informer and Controller #206

Closed
wackxu opened this issue Dec 8, 2017 · 20 comments · Fixed by #215
Closed

refactor the TfJob to use Informer and Controller #206

wackxu opened this issue Dec 8, 2017 · 20 comments · Fixed by #215
Assignees

Comments

@wackxu
Copy link
Contributor

wackxu commented Dec 8, 2017

Kubernetes provides many useful tools that we can use to develop our project, such as sharedInformer, clientSet, lister and so on. Using these make our project more readable and good performance. and we also need versioned the API and it is easy for us to implement backwards compatible and make the code more specification.
we can refactor the project similar with example https://github.com/kubernetes/sample-controller.
I have done something work. How do you think about it? @jlewi

@jlewi
Copy link
Contributor

jlewi commented Dec 8, 2017

In general, I'd much prefer to reuse existing well maintained libraries then custom code. So I'm all for replacing custom code with a more established library.

Are their specific changes you are thinking of making?

@gaocegege
Copy link
Member

Hi we(caicloud) are implementing the controller for kubeflow which is similar to kubernetes/sample-controller. And we are glad to file a proposal for the new design soon.

@wackxu
Copy link
Contributor Author

wackxu commented Dec 9, 2017

@gaocegege I am so sorry that I have almost done it and I will commit a PR ASAP

@wackxu
Copy link
Contributor Author

wackxu commented Dec 9, 2017

@jlewi We can make the change step by step,The first phases main include follows:
1、refactor the spec package to the kubernetes standard API, make the CRD definition versioned and it is easy for us to maintenance and make the code backwards compatible.
2、import code-generator to help us generate sharedInformer, lister, clientSet and so on.Then we will use these to refactor the controller.
3、Use the sharedInformer to help us list and watch CRDs, It can cache the CRD data, so it is more Efficient and It can reduce the code line number and more readable.
4、replace the client to use the generate client and so on.

@ddysher
Copy link
Member

ddysher commented Dec 9, 2017

@wackxu nice work!

Are you targeting at refactoring the code base only, or do you also plan to change the API?

@jlewi
Copy link
Contributor

jlewi commented Dec 9, 2017 via email

@gaocegege
Copy link
Member

@wackxu It doesn't matter, it is awesome 😄

@jlewi jlewi changed the title refactor the k8s project refactor the TfJob to use K8s libraries Dec 12, 2017
@jlewi jlewi changed the title refactor the TfJob to use K8s libraries refactor the TfJob to use Informer and Controller Dec 12, 2017
jlewi pushed a commit that referenced this issue Dec 29, 2017
fix #206

1.refactor the spec package to the kubernetes standard API, all detention of our CRD is in pkg pkg/apis/tensorflow and also the validation、 set Default for the CRD and some help func.

2.for use K8s code gen, wo import the project https://github.com/kubernetes/code-generator, and some shell script in the folder hack. You can see how to use it in https://blog.openshift.com/kubernetes-deep-dive-code-generation-customresources/

3.All the file in pkg/client are all auto generated and we should not modify it manually.

4.I do not plan to change the logical to use the sharedInformer and clientset. This PR is big enough, I think I can do it in a follow PR.

5.SharedInformer is a helpful tool with cache that we can use it to list and watch our resources. More details you can see http://borismattijssen.github.io/articles/kubernetes-informers-controllers-reflectors-stores
@jlewi jlewi reopened this Jan 3, 2018
@jlewi
Copy link
Contributor

jlewi commented Jan 3, 2018

Reopening this because it was autoclosed but I think there are PRs still pending related to this.

@gaocegege
Copy link
Member

gaocegege commented Jan 3, 2018

#234 refactors the logic but there are some other things should be done to migrate to the new controller design.

We implemented a simple controller https://github.com/caicloud/kubeflow-controller and glad to help the upstream to do the rest work.

@jlewi
Copy link
Contributor

jlewi commented Jan 3, 2018

@gaocegege Merging the caicloud work seems like a great idea; I think the caiacloud controller has many advantages over the current implementation. A couple questions/comments

  1. https://github.com/caicloud/kubeflow-controller seems to distinguish between a local job and non local job. What's the difference?

  2. How does the caicloud controller determine when a job is done?

  3. Could you explain the proposed design? Here are some questions, but I think it might better if you could write up an explanation that could eventually be included in our docs as opposed to answering the questions directly.

    • It looks like its event driven. Can you explain this? For example is the controller handling events for the sub resources (e.g. pods and services) and not just TfJob events?

    • It looks like events are created by comparing current status to expectations. Can you explain how this works in terms of deciding whether a pod needs to be created?

    • How do you deal with the fact that that workers and pods are stateful?

  4. Looks like the caicloud controller is directly managing pods. I think that is probably better than using job controllers (see Consider how we manage replicas (stateful sets, managing pods directly) #45).

  5. Looks like caicloud controller is publishing Kubernetes events which is definitely something we should do

@wackxu
Copy link
Contributor Author

wackxu commented Jan 4, 2018

@jlewi @gaocegege caicloud kubeflow-controller is similar with kubernetes job controller, It is a better implementation. I think we can do it in the next PR and I am very glad to help review it. The main change is so big and I think we should first merge #234, #234 is mainly change the structure of controller and cmd pkg to keep style with kubernetes controller. The next PR we can fork on the reconcile logic like caicloud kubeflow-controller .

@ghost
Copy link

ghost commented Jan 4, 2018

@jlewi @wackxu

Thanks for your review, I will write a design doc ASAP and file an issue to discuss the differences between caicloud controller and the upstream controller. And I agree with @wackxu that we should focus on #234 and keep moving.

https://github.com/caicloud/kubeflow-controller seems to distinguish between a local job and non-local job. What's the difference?

There are some changes in caicloud TFJob CRD, and @DjangoPeng is summarising them and he will file an issue to discuss it.

How does the caicloud controller determine when a job is done?

Now it is simple, the controller checks the statuses of the workers, if all workers are succeeded, we think the TFJob is succeeded.

Could you explain the proposed design? Here are some questions, but I think it might better if you could write up an explanation that could eventually be included in our docs as opposed to answering the questions directly.

Yeah, I will write a design doc and post in this issue.

Looks like the caicloud controller is directly managing pods. I think that is probably better than using job controllers (see #45).

We have an internal discussion, too in https://github.com/caicloud/kubeflow-controller/issues/71. Now we have not a conclusion and welcome comments :-)

Personally, I think if most of the use cases could be handled by job, then job is better since we do not have to manage all the lifecycle of the pods, which is implemented in job controller.

Looks like caicloud controller is publishing Kubernetes events which is definitely something we should do

Yeah, and I think @zjj2wry are implementing the feature for tensorflow/k8s: #260.

Now we have limited support in caicloud controller and the recorder records the creation of pods and services.

For example, events when we create 4 workers and 2 PS:

Events:
  Type    Reason            Age                From                 Message
  ----    ------            ----               ----                 -------
  Normal  SuccessfulCreate  33s                kubeflow-controller  Created service: dist-training-job-worker-0-d88wg
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created service: dist-training-job-worker-1-hwl26
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created service: dist-training-job-worker-2-jx7g7
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created service: dist-training-job-worker-3-gdb66
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created service: dist-training-job-ps-0-nlfdp
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created service: dist-training-job-ps-1-jw7c4
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created pod: dist-training-job-2j9br
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created pod: dist-training-job-pm85l
  Normal  SuccessfulCreate  31s                kubeflow-controller  Created pod: dist-training-job-5rgvh
  Normal  SuccessfulCreate  29s (x3 over 31s)  kubeflow-controller  (combined from similar events): Created pod: dist-training-job-n48d7

And we could support more events if we need.

@DjangoPeng
Copy link
Member

@jlewi Thanks for your reply. I'd like to give a supplement.

There are some changes in caicloud TFJob CRD, and @DjangoPeng is summarising them and he will file an issue to discuss it.

In caicloud Clever platform, we support not only TensorFlow jobs but also batch job, service and others. Considering the open source work, we'll eventually combine TFJob CRD of ourselves and upstream. It's necessary to file an issue to discuss the details, and I'll do it after my company work.

How does the caicloud controller determine when a job is done?

The design and lifecycle of local job is determinate and easy-to-understand, because local job is similar with single batch job. But distributed job is more complicated and the status is up to user. So, we have to discuss the strategy and logic.

Looks like the caicloud controller is directly managing pods. I think that is probably better than using job controllers (see #45).

We're discussing the selection. Personally speaking, I prefer Pod. Ref: https://github.com/caicloud/kubeflow-controller/issues/71#issuecomment-355056365

@gaocegege
Copy link
Member

gaocegege commented Jan 5, 2018

https://github.com/gaocegege/kubeflow-controller/blob/a54021d250c6e532f3a6e32be32e37e1249c0780/docs/design_doc.md#implementation

I added some explanation about the implementation, and as wackxu@ said it is simliar to job controller in Kubernetes.

Before #234 merged, I will fix some issues in our own implementation. And I will file a PR to contribute the implementation to the upstream after #234 if you think it is generally acceptable :-)

@jlewi
Copy link
Contributor

jlewi commented Jan 5, 2018

In caicloud Clever platform, we support not only TensorFlow jobs but also batch job, service and others. Considering the open source work, we'll eventually combine TFJob CRD of ourselves and upstream. It's necessary to file an issue to discuss the details, and I'll do it after my company work.

Are you planning on using a single CRD or having different CRDs for different types of jobs?

@DjangoPeng so is your plan to upstream to tensorflow/k8s any changes/improvements needed by Caicloud and then just use the TfJob controller in tensorflow/k8s? Do you forsee a need to maintain a downstream fork in caicloud/kubeflow-controller long term?

@gaocegege
Copy link
Member

gaocegege commented Jan 10, 2018

@jlewi

Sorry for the late reply.

We hope we could use the upstream controller eventually. And we will discuss about the changes and try to merge the changes in CRD into the upstream, or we will rethink about the CRD spec design of ours if we can not come to an agreement.

As README in caicloud/kubeflow-controller said, it is just temporary. And we hope the controller could help to accelerate the development of tensorflow/k8s.

@jlewi
Copy link
Contributor

jlewi commented Jan 10, 2018

Sounds great thanks.

@jlewi
Copy link
Contributor

jlewi commented Jan 25, 2018

Is there more work to be done to use Informer and Controller?

@gaocegege
Copy link
Member

I think the work is done, and other things are in other issues, e.g. #314

@gaocegege
Copy link
Member

I am closing the issue since we have already finished it. Thanks for your awesome contribution again @wackxu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants