Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix caching issues #2847

Merged
merged 3 commits into from
Jun 4, 2018
Merged

Fix caching issues #2847

merged 3 commits into from
Jun 4, 2018

Conversation

mlgibbons
Copy link
Contributor

See #2844

…hich resulted in unreliable image loading and prevented offline execution of "minikube start".
…che in VM which were previously being ignored and leading to unreliable "minikube start" especially in offline mode.
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mlgibbons
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: dlorenc

Assign the PR to them by writing /assign @dlorenc in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 26, 2018
@minikube-bot
Copy link
Collaborator

Can one of the admins verify this patch?

@mlgibbons
Copy link
Contributor Author

/assign @dlorenc

@@ -101,7 +101,7 @@ func runStart(cmd *cobra.Command, args []string) {
clusterBootstrapper := viper.GetString(cmdcfg.Bootstrapper)

if shouldCacheImages {
go machine.CacheImagesForBootstrapper(k8sVersion, clusterBootstrapper)
machine.CacheImagesForBootstrapper(k8sVersion, clusterBootstrapper)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the race condition here, but I'm not quite sure that this is the right fix. We implemented it this way to avoid slowing things down at all for non-offline mode users.

The cache was basically designed to be "best effort", and making it synchronous like this could introduce flakiness and delays for the standard use-case.

Any ideas here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take your point about the standard use case. I've actually found that it makes the start operation more reliable (particularly if a previous one had been terminated) as well as also making debugging much easier as the log is now predictable and deterministic. If the cache loading has not completed before the "docker load" then we end up often with incomplete loads into the cache and - in online mode - re-downloading of the images with I would think a reduction in performance due to the duplication.

One improvement would be to allow the image download to be async but then block on it completing before the call to UpdateCluster(). This would allow it to run in parallel with the VM creation / startup while preventing the incomplete load problem.

Another change would be to change the semantics such that the load is not best effort i.e. if enabled then it either succeeds or fails (as per the patch) and if this behaviour is not desired then the cache-images flag should be set to false and thus the standard use case would not be impacted. This would seems more more logical to me i.e. if caching is enabled then it should complete successfully rather than failing silently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative would be to add a flag such as "mustCacheImages" or such like but I'd prefer to avoid adding more flags of possible and think the other two options are a better approach. Let me know what you think. Cheers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A further thought, and I think the better option, is to make False the default for --cache-images and have the behaviour as coded in the PR. This then makes things very clear and consistent i.e. don't cache images by default but if you do request caching then exit if the caching fails. Possibly also add the async wait before UpdateCluster() which I mentioned earlier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work and thoughtful discussion here! I think I agree with your last proposal. Then we can easily change the default to true if we feel it does make things more stable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A change in behavior from --cache-images default true, to --cache-images default false?
"Feels" like the wrong direction.

I believe the binaries for iso/kubeadm/kubelet are all downloaded to the host, cached & imported/used later. Is it too much of a departure to do the same for images? Caching & blocking at UpdateCluster by default for k8s images seems the "best" option (immediate stability risks ignored).

Sounds like "delay UpdateCluster until cached images are ready" would be preferred but is too risky a default right now?

With switches & defaults - Setting the tone for a future where caching is optimal, reliable & default seems best. (Even if it takes a little while to get there).

My use case includes many lower-bandwidth customers, and "frequent" destroy/starts.
I'd appreciate a config value akin to "kubernetes-version" for "cacheWaitOnKubernetesImages=true" and then in the future once reliable image caching is established - make blocking the default.

Copy link
Contributor Author

@mlgibbons mlgibbons Jun 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. I can understand where you're coming from. I think we can separate this into two distinct items: 1) when used caching needs to be reliable and so the race condition needs to be addressed; 2) whether the image caching should default to true or false.

Item one needs fixed and I think everyone is in agreement one that. The commit I just made addresses that issue. Item two depends on your perspective and I could argue for both cases. The key difference in my mind is that the ISO, kubeadm and kubectl MUST be downloaded to be able to spin up the cluster. The images MAY be downloaded, cached and loaded but don't need to be as the caching is an optimisation. Given that, having them treated differently to the other items makes sense.

I would personally go with having image caching disabled by default for a number of reasons including that it's one less thing to worry about if you don't need it and I've spent too much time deleting the incomplete images in the cache dir created after doing Ctrl-C during cache population. The commit I made has the default set to False - I read your comment after the commit was made - but it could be either depending on what you value most and your use case i.e. there is no "right" default setting.

…to cluster. Default imageing caching to false
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 1, 2018
Copy link
Contributor

@dlorenc dlorenc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like the way this turned out. One last nit/question.

if shouldCacheImages {
fmt.Println("Waiting for image caching to complete...")
if err := groupCacheImages.Wait(); err != nil {
glog.Errorln("Error caching images: ", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a warning instead of an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. I would suggest that it should be an error: if I asked it to cache images and it failed then that is an error which I would like to investigate as lack of the cached images in subsequent calls to "start" could be end up with the daemon downloading the images without any indication that it is doing that. Also, if there is a problem with caching and an error is returned I can easily disable the caching. With a warning it's too easy to miss this all of this happening.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. SGTM.

@dlorenc
Copy link
Contributor

dlorenc commented Jun 1, 2018

@minikube-bot OK to test

@mlgibbons
Copy link
Contributor Author

/cc dlorenc

@k8s-ci-robot k8s-ci-robot requested a review from dlorenc June 3, 2018 03:24
if shouldCacheImages {
fmt.Println("Waiting for image caching to complete...")
if err := groupCacheImages.Wait(); err != nil {
glog.Errorln("Error caching images: ", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. SGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants