-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resume Experiment from Volume #1275
Resume Experiment from Volume #1275
Conversation
I tested example, everything was working. Since we are using PV with local host path after deletion Suggestion PV and PVC, directory on cluster will be not deleted. |
Hi @andreyvelich , after a careful review, I am still confused about how we persist/retrieve suggestion instance to/from PV. Since in restartSuggestion function, we fetch suggestion instance via kubernetes client directly. |
@sperlingxx Thank you for the review. When user submits Experiment, we deploy Suggestion service and deployment with attached PVC on it. katib/pkg/controller.v1beta1/suggestion/suggestion_controller.go Lines 147 to 157 in 44b0ae6
We don't delete PVC and it saves information from the Suggestion deployment. Ones user wants to restart Experiment, we change Suggestion status:
Then it automatically runs ReconcileSuggestion
PV/PVC is not created again because we didn't delete it. If user deletes Experiment, it automatically clean-up all Suggestion resources (deployment/service/pvc/pv) since we have owner reference for each object. |
reason := "Experiment is succeeded" | ||
// If ResumePolicy = Never, mark suggestion status succeeded, can't be restarted | ||
if instance.Spec.ResumePolicy == experimentsv1beta1.NeverResume { | ||
msg := "Suggestion is succeeded, can't be restarted" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we add restart message when suggestion is terminated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for additional information after Suggestion is succeeded for the user.
@johnugeorge You think it is not necessary?
|
@andreyvelich Thanks for detailed explanation. So, we assume suggestion programs can recover from disk (or maybe they are stateless) in |
Yes, later add functionality to set StorageClassName from the |
@johnugeorge Not sure if it is needed for the Volume, since we have Owner Reference for the volume: https://github.com/andreyvelich/katib/blob/issue-1250-resume-from-volume/pkg/controller.v1beta1/suggestion/composer/composer.go#L288. |
I renamed PVC to -, PV to -- to be consistent with Suggestion's deployment and service. |
Cool! |
f8753eb
to
0f67032
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johnugeorge The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
Fixes: #1250.
This is implementation for resuming experiment from the Volume.
Steps:
I added new
ResumePolicy=FromVolume
which represents resuming experiment from the volume.@gaocegege @johnugeorge @sperlingxx what do you think about the name? Should we rename it to
ResumePolicy=VolumeSource
or think about better name, maybe renameLongRunning
also?I added
ResumePolicy
to Suggestion API to understand if controller needs to deploy volume for the Suggestion.Do we need to add some volume specification to Suggestion API also.
For example: StorageClassName, PVC name ?
If
ResumePolicy=VolumeSource
, PVC is created. IfStorageClassName=DefaultSuggestionStorageClass
, PV with local cluster path is created also.I didn’t attach Experiment labels to PVC and PV, like we did here: https://github.com/andreyvelich/katib/blob/issue-1250-resume-from-volume/pkg/controller.v1beta1/suggestion/composer/composer.go#L74 (it was done to connect service with deployment).
Do we want to see these labels in the PVC/PV?
Currently, for the PV default cluster local volume path is:
prefix + <suggestion name>-<suggestion namespace>
. We need to add name and namespace to not conflict with other submitted Experiments in various namespaces.Prefix can be changed in
katib-config
later.Currently, for suggestion container default path is
/opt/katib/data
. Also can be changed inkatib-config
.PVC name =
<suggestion name>
PV name =
<suggestion name>-<suggestion namespace>
I added new validation for the webhook. That is needed to prevent submitting Experiment with the same name, if corresponding PV was not deleted after deleting Experiment.
We can't
Watch
for the PV, like for thePVC
. And suggestion controller can't be triggered and create PV again, once PV is deleted from the cluster.I had to change suggestion status workflow to allow resuming experiment from the volume.
This change doesn't affect current workflow for
LongRunning
Suggestion.For the
ResumePolicy=VolumeSource
:When User submits Experiment, Suggestion status:
Created (status = True) -> DeploymentReady (status = False) -> DeploymentReady (status = True) -> Running (status = True) -> (experiment is finished) -> Running (status = False) -> DeploymentReady (status = False) -> Succeeded (status = True) .
When User restarts Experiment, Suggestion status:
Running (status = True, reason="Experiment is restarting") -> (remove succeeded condition) -> DeploymentReady (status = True) -> Running (status = True, reason="SuggestionRunning")) -> (experiment is finished) -> Running (status = False) -> DeploymentReady (status = False) -> Succeeded (status = True).
This approach can avoid adding additional condition for Suggestion:
Restarting
.What do you think @gaocegege @johnugeorge @sperlingxx ?
I still need to make some testing, but this PR is ready for review.
I will add unit test and e2e test in the future PR.
/assign @gaocegege @johnugeorge @sperlingxx
/cc @jlewi