Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache and singleflight requests to kubelet #5408

Merged
merged 1 commit into from
Aug 26, 2024

Conversation

knisbet
Copy link
Contributor

@knisbet knisbet commented Aug 20, 2024

Add microcaching and merging of parallel requests to kubelet in the k8s workload attestor.

Pull Request check list

  • Commit conforms to CONTRIBUTING.md?
  • Proper tests/regressions included?
  • Documentation updated?

Affected functionality

Reduces memory usage of the agent plugin for k8s workload attestation. #5111

Description of change

  • Add a micro cache of half the retry interval to reduce the requests to kubelet and parsing of JSON response.
  • Use x/sync/singleflight to merge parallel kubelet requests into a single request
  • Scope creep: Add a context with basic 10s timeout on HTTP requests to Kubelet

Which issue this PR fixes

Fixes #5111

@knisbet
Copy link
Contributor Author

knisbet commented Aug 20, 2024

I hope no one minds, I wanted to play around with Spire on the weekend and I saw the issue about bursts in memory usage that didn't seem to have any activity since May. Thought I'd hack on the issue a bit and see what turned out.

Testing was done by scaling the deployment in https://github.com/spiffe/spire-tutorials/blob/main/k8s/quickstart/client-deployment.yaml to 100 replicas. And a delete of all pods in the namespace a couple of times.

On my homelab:
Peak Memory usage went from 513MiB to 57 MiB
Number of requests to kubelet went from 3028 to 103

Notes:

  • I suspect this may also impact Replace JSON library for targeted parsing of the Kubelet response #5109 as the json parsing library isn't as heavily used with this change.
  • By caching for half the retry I don't see a reason that more complicated expiry behaviour would be needed as suggested in Consider caching the Kubelet response #5111, as subsequent retries should never use the same cached request. But please check that incase I missed a reason.
  • Last, I couldn't find a context or timeout on the kubelet http client. I might've missed it somewhere, so I threw one in with a fairly generous 10 second timeout to ensure a timeout was getting applied.


p.cachedPodList = podList

go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, kicking off this goroutine without any soft of lifetime management should be ok given its current responsibilities, but in general we push for goroutine hygiene as much as possible. If we can, it would be good to make sure that this goroutine does not outlive the plugin, e.g. when the plugin is being unloaded.

One way to accomplish this is to:

  1. Implement io.Closer on *Plugin. The plugin framework will invoke Close() when the plugin is unloaded
  2. Add a wait group to the plugin that tracks the lifetime of this goroutine that close waits for

It might be overkill, considering this goroutine will only live at most ~250ms at the time of close, but bonus points, if you wanted to add a context that could be cancelled inside of Close to immediately bring it down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I should be able to post a fix this afternoon. I didn't bother for the same reasons you indicated, shouldn't take long to add in and test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the io.Closer and a context. I added the Context to the struct, but understand that can be debatable in different projects, happy to swap it out if there is a preference not to use a context within a struct in this project.

Also, I didn't link the Close function to wait for any in progress Attest calls to complete. I didn't see the behaviour in any other plugins, and I didn't deeply investigate whether the service is cancelling and waiting for any other in-progress Attest calls before unloading plugins. If an Attest call is able to still be in progress there is a small window where the cache will be saved and then immediately released. I'm happy to address if you'd like.

@azdagron
Copy link
Member

Thanks very much for this @knisbet! There are several people who will be very excited for this change. I dropped one comment on goroutine lifetimes but other than that this is looking great.

@knisbet knisbet requested a review from azdagron August 25, 2024 01:11
@knisbet
Copy link
Contributor Author

knisbet commented Aug 25, 2024

@azdagron I think I addressed your concern. Also, I don't recall seeing anywhere in the committer docs if there is a preference to Squash my changes into a single Commit. Let me know.

Copy link
Member

@azdagron azdagron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks again, @knisbet.

@azdagron
Copy link
Member

As far as squashing, the maintainers typically do that when we merge.

SPIRE requires DCO sign-off on all commits. Would you mind amending your commits and adding that?

…8s workload attestor

Adds a short lived cache for the responses from Kubelet reducing memory and CPU usage of the k8s workload attestor plugin.

Signed-off-by: Kevin Nisbet <kevin.nisbet+github@xybyte.com>
@knisbet knisbet force-pushed the kevin/5111-cache-kubelet-requests branch from a9a0d73 to d0d1228 Compare August 25, 2024 18:05
@knisbet
Copy link
Contributor Author

knisbet commented Aug 25, 2024

SPIRE requires DCO sign-off on all commits. Would you mind amending your commits and adding that?

Oops, sorry my bad. Should be fixed now.

@azdagron azdagron merged commit 48397e9 into spiffe:main Aug 26, 2024
34 checks passed
@knisbet knisbet deleted the kevin/5111-cache-kubelet-requests branch August 26, 2024 03:45
@MarcosDY MarcosDY added this to the 1.11.0 milestone Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider caching the Kubelet response
3 participants