Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete non-LRU cache in SPIRE Agent #5383

Merged
merged 22 commits into from
Sep 27, 2024

Conversation

amoore877
Copy link
Member

@amoore877 amoore877 commented Aug 13, 2024

Pull Request check list

[X] Commit conforms to CONTRIBUTING.md?
[X] Proper tests/regressions included?
[X] Documentation updated?

Affected functionality

Remove the original caching implementation in SPIRE Agent, as well as size configuration. Redo of #5184

Description of change

Per plan in #4224 , perform final update actions by making LRU Cache with size 1000 the default, non-configurable implementation.

Also per #4224 , not for release until at least 1.11 based on #5150 , which deprecates the options but does not delete them, and went out in 1.10.

Which issue this PR fixes

Fixes #4224

@amoore877
Copy link
Member Author

hmm little confused how https://github.com/spiffe/spire/actions/runs/10378374004/job/28734731515?pr=5383#step:9:1301 is failing:

[2024-08-13T23:06:18Z] executing 05-fetch-x509-svids...
[2024-08-13T23:06:18Z] Expected 10 X.509-SVIDs and received 10 for uid 1001

we want 10 and got 10. so should be good?

@amoore877
Copy link
Member Author

hmm little confused how https://github.com/spiffe/spire/actions/runs/10378374004/job/28734731515?pr=5383#step:9:1301 is failing:

[2024-08-13T23:06:18Z] executing 05-fetch-x509-svids...
[2024-08-13T23:06:18Z] Expected 10 X.509-SVIDs and received 10 for uid 1001

we want 10 and got 10. so should be good?

ahh the test is setting cache size 8 and expecting a drop

@amoore877
Copy link
Member Author

[2024-08-14T14:39:55Z] executing 04-ban-agent...
[2024-08-14T14:39:55Z] banning agent...
Error: rpc error: code = NotFound desc = agent not found
[2024-08-14T14:39:55Z] step 04-ban-agent failed

hmm separate from this PR I think this integration test might be flaky; we may not be doing enough checking that the agent is fully up and in the SPIRE DB

@azdagron azdagron added this to the 1.11.0 milestone Aug 15, 2024
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
@amoore877 amoore877 force-pushed the delete_non_lru_for_reals branch from c531e41 to b148c8f Compare August 19, 2024 19:51
@@ -120,10 +120,6 @@ type experimentalConfig struct {
UseSyncAuthorizedEntries bool `hcl:"use_sync_authorized_entries"`

Flags fflag.RawConfig `hcl:"feature_flags"`

UnusedKeyPositions map[string][]token.Pos `hcl:",unusedKeyPositions"`
X509SVIDCacheMaxSize int `hcl:"x509_svid_cache_max_size"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we shouldn't keep this setting but promote it out of experimental. It has some implications for existing deployments, namely that the initial fetch of a SVID may become slower now if there are more active workloads than 1000. It can become noticeable for the case where you run lots of short lived workloads.

Or maybe keep it as experimental in case someone complains and remove after it a few releases if nobody complains?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe keep it as experimental in case someone complains and remove after it a few releases if nobody complains?

so this is actually what is being executed. 1.10 already has a merged PR #5150 for deprecating this feature (via warnings in logs). This PR is planned for 1.11 to finally remove it entirely.

I'm sure if there are raised complaints wrt to the deprecation and removal, SPIRE maintainers and the community would be happy to discuss. You can raise this in SPIFFE slack or in contributor sync

It can become noticeable for the case where you run lots of short lived workloads.

fwiw, and this of course would vary significantly based on operator environment, my organization's own load testing showed on the version this feature was introduced at most 1-3s of fetchx509svid latency experienced by consumers when the cache size was set to 1k and the agent was being cycled through 26k registrations.

I'd be curious to know for your own environment how registrations are being managed for short-lived- are they always statically registered? if not, are they aggressively or lazily culled? from a security perspective it's best to have an agent only be assigned precisely the workload identities that are expected to be running with it in that exact moment, though of course there are windows where there would be excess dependent on strategy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just something I wanted to raise as a potential issue as I've ran into a similar issue with other pieces of software. 1-3 seconds for something that is supposed to be started lots of times, for example batch jobs, can add up to a lot. With the probably more common use case of longer lived services I don't think a couple of seconds of start up delay is going to matter that much.

We don't mandate short-lived vs long-lived, but registration entries would usually be towards the longer lived end so I don't expect to have any issues. There are some plans for using it in a place with shorter lived workloads, so I'll see at that point if I'll actually have any issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we shouldn't keep this setting but promote it out of experimental

The LRU SVID cache has been enabled by default since v1.9.0, using the cache size of 1000. Since we haven't heard from users the need of tuning it with a different value, we went ahead with the plan to deprecating x509_svid_cache_max_size in v1.10.0 and remove it in v1.11.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that catches some of the more subtle way this changes deployments. For a while people have been used to thinking that the agent caches X509-SVIDs for all workloads. This means that if spire-server becomes unavailable for some time, you still have up to X509-SVID TTL/2 to fix it since the agent can serve svids from the cache.

This now changes it so that the agent caches up to 1000 X509-SVIDs, but no more until the workloads actually start and start requesting SVIDs. This means that the time to fix things can be shortened drastically. 1000 is an arbitrary number, it would still be nice to have the ability to specify how many workloads you want to have in the cache by default.

I don't want to hard block this, but I think it's worth having the maintainers consider this a bit more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sorindumitru totally understand :) I definitely appreciate wanting to protect global operator experience and the benefit of open source is being able to have these sorts of conversations and concerns raised.

1000 is an arbitrary number, it would still be nice to have the ability to specify how many workloads you want to have in the cache by default.

iiuc , there is generally a desire from maintainers to limit just how many levers and knobs SPIRE has. It can already be difficult for some groups to adopt it, and each control is a potential point of confusion or breakage as well as a new dimension and permutation of deployment that maintainers need to support and receive issue/feature requests about. so that is a key motivator to removing this configuration. there is also definitely a split in operator groups- one that takes the binaries / source code as-is, and one that internally forks it to make adjustments (such as on hard-coded values as the most trivial example).

fwd: @amartinezfayo - if there's more to add not covered here or in #5383 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sorindumitru and @amoore877 for all the feedback and thoughts on this.
I think that the last comment from @amoore877 reflects pretty well what the maintainer's group analyzes when a new setting is b being introduced.
We have discussed this again in our last maintainer's call, and after reevaluating this, we feel that it would be prudent to keep the x509_svid_cache_max_size setting. The concerns pointed by @sorindumitru are some valid concerns. At this point, I personally think that being in a situation where you would need to adjust the cache size seems to be a lot more problematic than the fact that there is a new setting that can be tweaked, mostly considering that there will be a default value and users will not need to be aware of this setting if they don't need to adjust it. Having a proper documentation explaining when you may need to use this setting (e.g. when the agent handles more than 1000 active workloads) will help.
I think that we can promote it as a stable setting. @amoore877 Could you update the PR to reflect that?

Copy link
Member Author

@amoore877 amoore877 Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on that plan, would it not be better to close/abandon this PR and then another PR reverts #5150 which marked the setting as deprecated?

yet another PR (to reduce how much has to be reviewed) would then promote the setting out of experimental.

would we still be removing the non-LRU cache code?

would be good to update #4224 with the full goal state from maintainers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on that plan, would it not be better to close/abandon this PR and then another PR reverts #5150 which marked the setting as deprecated?

Since the x509_svid_cache_max_size setting will not belong to the experimental section anymore, I think that the deprecation notice was OK, because the setting will not be recognized there anymore. In terms of closing this PR, I was thinking that it could just be updated to move the x509_svid_cache_max_size setting out of the experimental section, being now a stable configurable. But if you prefer to handle that in a separate PR, that would work also.

would we still be removing the non-LRU cache code?

Yes, no changes with that, we are removing the non-LRU cache code in v1.11.0.

would be good to update #4224 with the full goal state from maintainers

I've updated #4224 accordingly.

pkg/agent/manager/cache/lru_cache.go Outdated Show resolved Hide resolved
pkg/agent/manager/cache/lru_cache.go Outdated Show resolved Hide resolved
pkg/agent/manager/cache/lru_cache.go Outdated Show resolved Hide resolved
pkg/agent/manager/cache/workload.go Outdated Show resolved Hide resolved
@@ -120,10 +120,6 @@ type experimentalConfig struct {
UseSyncAuthorizedEntries bool `hcl:"use_sync_authorized_entries"`

Flags fflag.RawConfig `hcl:"feature_flags"`

UnusedKeyPositions map[string][]token.Pos `hcl:",unusedKeyPositions"`
X509SVIDCacheMaxSize int `hcl:"x509_svid_cache_max_size"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we shouldn't keep this setting but promote it out of experimental

The LRU SVID cache has been enabled by default since v1.9.0, using the cache size of 1000. Since we haven't heard from users the need of tuning it with a different value, we went ahead with the plan to deprecating x509_svid_cache_max_size in v1.10.0 and remove it in v1.11.0.

amoore877 and others added 4 commits September 11, 2024 11:04
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
amoore877 and others added 4 commits September 11, 2024 11:30
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
@amoore877
Copy link
Member Author

another PR will follow adjusted plan

@amoore877 amoore877 closed this Sep 25, 2024
@amoore877 amoore877 reopened this Sep 27, 2024
@amoore877
Copy link
Member Author

amoore877 commented Sep 27, 2024

adjusted plan for expediency on release from talking to @amartinezfayo :

  • this PR will be merged
  • maintainers will add config for cache size in a new PR

Copy link
Member

@amartinezfayo amartinezfayo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @amoore877 for this!
We will be opening a PR adding the x509_svid_cache_max_size setting.

@amartinezfayo amartinezfayo merged commit 182b594 into spiffe:main Sep 27, 2024
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable SPIRE Agent LRU Cache
4 participants