Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addons auto-pause: Remove memory leak & add configurable interval #17936

Merged
merged 5 commits into from
Feb 20, 2024

Conversation

spowelljr
Copy link
Member

@spowelljr spowelljr commented Jan 12, 2024

What this PR accomplishes:

Memory Leak

For details on the memory leak see the issue #10596.

A fix for the memory leak was originally implemented by @vishjain in #11909. The PR successful resolved the memory leak but introduced a bug where once the pods were paused kubectl commands failed with Unable to connect to the server: EOF, the PR was reverted in #17866.

I reimplemented the fix following @vishjain's main concept of using a Ticker in place of multiple Timers. Once the addon is enabled the ticker is started with a one minute interval, if any kubectl commands are executed before the one minute countdown concludes the ticker is reset and starts again. Once the one minute countdown concludes the ticker is indefinitely stopped and the pods are paused. If a kubectl command is executed while the pods are paused, the pods are unpaused, and then the ticker is reset and the cycle continues.

Configurable Interval

A PR to implement a configurable interval for auto-pause was originally implemented by @wzslr321 in #17070. The PR had some issues related to loading configs that resulted in the auto-pause binary terminating and the addon not working, but all the logic around storing and configuring the interval value was well done and able to be reused. The PR was reverted in #17866 where more details on the config loading bug can be found.

For background, the auto-pause binary, which is where the ticker is, is its own separate binary from minikube. The binary exists inside the Kicbase/ISO itself and is ran via systemd. So in order the pass the interval value in I had to add a flag value to the auto-pause binary, and then when the auto-pause addon is enabled the interval flag is passed into the systemd unit file.

ExecStart=/bin/auto-pause --container-runtime={{.ContainerRuntime}} --interval={{.AutoPauseInterval}}

The interval can be configured in one of two ways, --auto-pause-interval during minikube start or minikube addons configure auto-pause. Both ways have validation to ensure a valid duration string is provided, and that the duration is greater than 0s, a negative or 0s duration will cause a panic in the Ticker.

Follow up work

There's an issue with running minikube addons configure auto-pause, the value is successfully saved, but the value is not immediately updated in the auto-pause addon itself. The systemd unit file is updated, but then a systemctl daemon-reload & systemctl restart auto-pause.service both have to be run for the update to be reflected in the binary. Due to the already large size of this PR I'll leave it for a follow up.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 12, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: spowelljr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 12, 2024
@spowelljr spowelljr changed the title WIP: Reimplement reverted auto-pause PRs with fixes Reimplement reverted auto-pause PRs with fixes Jan 12, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 12, 2024
@spowelljr spowelljr changed the title Reimplement reverted auto-pause PRs with fixes Addons auto-pause: Remove memory leak & add configurable interval Jan 12, 2024
@spowelljr
Copy link
Member Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jan 12, 2024
@spowelljr
Copy link
Member Author

ok-to-build-iso

@spowelljr spowelljr closed this Jan 12, 2024
@spowelljr spowelljr reopened this Jan 12, 2024
@spowelljr
Copy link
Member Author

ok-to-build-image

@minikube-pr-bot

This comment has been minimized.

@minikube-pr-bot

This comment has been minimized.

@minikube-pr-bot

This comment has been minimized.

@minikube-pr-bot

This comment has been minimized.

@spowelljr spowelljr added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. and removed ok-to-test Indicates a non-member PR verified by an org member that is safe to test. labels Feb 15, 2024
@spowelljr
Copy link
Member Author

ok-to-build-image

@spowelljr
Copy link
Member Author

ok-to-build-iso

@minikube-bot
Copy link
Collaborator

Hi @spowelljr, we have updated your PR with the reference to newly built kicbase image. Pull the changes locally if you want to test with them or update your PR further.

@minikube-bot
Copy link
Collaborator

Hi @spowelljr, we have updated your PR with the reference to newly built ISO. Pull the changes locally if you want to test with them or update your PR further.

@spowelljr
Copy link
Member Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 16, 2024
@minikube-pr-bot
Copy link

kvm2 driver with docker runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 17936) |
+----------------+----------+---------------------+
| minikube start | 50.9s    | 50.9s               |
| enable ingress | 27.6s    | 32.6s               |
+----------------+----------+---------------------+

Times for minikube start: 51.1s 51.4s 48.5s 53.6s 50.0s
Times for minikube (PR 17936) start: 51.5s 49.1s 55.0s 49.3s 49.7s

Times for minikube (PR 17936) ingress: 27.7s 30.1s 49.7s 28.6s 27.1s
Times for minikube ingress: 27.6s 28.7s 28.1s 27.1s 26.7s

docker driver with docker runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 17936) |
+----------------+----------+---------------------+
| minikube start | 23.8s    | 22.8s               |
| enable ingress | 21.0s    | 21.6s               |
+----------------+----------+---------------------+

Times for minikube start: 25.1s 21.6s 25.1s 22.1s 25.0s
Times for minikube (PR 17936) start: 22.0s 22.6s 22.3s 21.8s 25.3s

Times for minikube ingress: 20.8s 20.8s 20.8s 20.8s 21.8s
Times for minikube (PR 17936) ingress: 20.8s 21.8s 22.8s 20.8s 21.8s

docker driver with containerd runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 17936) |
+----------------+----------+---------------------+
| minikube start | 22.1s    | 21.8s               |
| enable ingress | 35.1s    | 35.1s               |
+----------------+----------+---------------------+

Times for minikube start: 23.4s 20.6s 23.1s 23.3s 20.4s
Times for minikube (PR 17936) start: 23.5s 20.5s 20.6s 20.7s 23.6s

Times for minikube ingress: 31.3s 32.4s 47.3s 32.3s 32.3s
Times for minikube (PR 17936) ingress: 31.3s 32.3s 32.3s 48.3s 31.3s

@minikube-pr-bot
Copy link

These are the flake rates of all failed tests.

Environment Failed Tests Flake Rate (%)
Docker_Linux_docker_arm64 TestIngressAddonLegacy/serial/ValidateIngressAddonActivation (gopogh) 0.00 (chart)
Docker_Linux_docker_arm64 TestIngressAddonLegacy/StartLegacyK8sCluster (gopogh) 0.00 (chart)
Docker_Linux_docker_arm64 TestKubernetesUpgrade (gopogh) 0.00 (chart)
Docker_Linux_docker_arm64 TestStartStop/group/old-k8s-version/serial/DeployApp (gopogh) 0.00 (chart)
Docker_Linux_docker_arm64 TestStartStop/group/old-k8s-version/serial/EnableAddonWhileActive (gopogh) 0.00 (chart)
Docker_Linux_docker_arm64 TestStartStop/group/old-k8s-version/serial/FirstStart (gopogh) 0.00 (chart)
Docker_Linux_docker_arm64 TestStartStop/group/old-k8s-version/serial/SecondStart (gopogh) 0.00 (chart)
Docker_Linux_docker_arm64 TestStartStop/group/old-k8s-version/serial/UserAppExistsAfterStop (gopogh) 0.00 (chart)
Docker_Linux TestIngressAddonLegacy/serial/ValidateIngressAddonActivation (gopogh) 0.00 (chart)
Docker_Linux TestIngressAddonLegacy/serial/ValidateIngressDNSAddonActivation (gopogh) 0.00 (chart)
Docker_Linux TestIngressAddonLegacy/StartLegacyK8sCluster (gopogh) 0.00 (chart)
Docker_Linux TestKubernetesUpgrade (gopogh) 0.00 (chart)
Docker_Linux TestStartStop/group/old-k8s-version/serial/DeployApp (gopogh) 0.00 (chart)
Docker_Linux TestStartStop/group/old-k8s-version/serial/EnableAddonWhileActive (gopogh) 0.00 (chart)
Docker_Linux TestStartStop/group/old-k8s-version/serial/FirstStart (gopogh) 0.00 (chart)
Docker_Linux TestStartStop/group/old-k8s-version/serial/SecondStart (gopogh) 0.00 (chart)
Docker_Linux TestStartStop/group/old-k8s-version/serial/UserAppExistsAfterStop (gopogh) 0.00 (chart)
KVM_Linux TestFunctional/parallel/DashboardCmd (gopogh) 0.00 (chart)
KVM_Linux TestFunctional/parallel/MountCmd/specific-port (gopogh) 0.00 (chart)
Docker_Linux_crio_arm64 TestFunctional/parallel/PersistentVolumeClaim (gopogh) 6.45 (chart)
Docker_Linux_crio_arm64 TestIngressAddonLegacy/serial/ValidateIngressAddonActivation (gopogh) 6.45 (chart)
Docker_Linux_crio TestFunctional/parallel/ImageCommands/ImageLoadDaemon (gopogh) 12.90 (chart)
Hyper-V_Windows TestMultiNode/serial/DeleteNode (gopogh) 33.33 (chart)
Docker_macOS TestSkaffold (gopogh) 41.38 (chart)

To see the flake rates of all tests by environment, click here.

@medyagh medyagh merged commit f3d93fe into kubernetes:master Feb 20, 2024
27 of 42 checks passed
@spowelljr spowelljr deleted the reimplementAutoPause branch February 20, 2024 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants