-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete non-LRU cache in SPIRE Agent #5383
Delete non-LRU cache in SPIRE Agent #5383
Conversation
hmm little confused how https://github.com/spiffe/spire/actions/runs/10378374004/job/28734731515?pr=5383#step:9:1301 is failing:
we want 10 and got 10. so should be good? |
ahh the test is setting cache size 8 and expecting a drop |
hmm separate from this PR I think this integration test might be flaky; we may not be doing enough checking that the agent is fully up and in the SPIRE DB |
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
c531e41
to
b148c8f
Compare
@@ -120,10 +120,6 @@ type experimentalConfig struct { | |||
UseSyncAuthorizedEntries bool `hcl:"use_sync_authorized_entries"` | |||
|
|||
Flags fflag.RawConfig `hcl:"feature_flags"` | |||
|
|||
UnusedKeyPositions map[string][]token.Pos `hcl:",unusedKeyPositions"` | |||
X509SVIDCacheMaxSize int `hcl:"x509_svid_cache_max_size"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we shouldn't keep this setting but promote it out of experimental. It has some implications for existing deployments, namely that the initial fetch of a SVID may become slower now if there are more active workloads than 1000. It can become noticeable for the case where you run lots of short lived workloads.
Or maybe keep it as experimental in case someone complains and remove after it a few releases if nobody complains?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe keep it as experimental in case someone complains and remove after it a few releases if nobody complains?
so this is actually what is being executed. 1.10 already has a merged PR #5150 for deprecating this feature (via warnings in logs). This PR is planned for 1.11 to finally remove it entirely.
I'm sure if there are raised complaints wrt to the deprecation and removal, SPIRE maintainers and the community would be happy to discuss. You can raise this in SPIFFE slack or in contributor sync
It can become noticeable for the case where you run lots of short lived workloads.
fwiw, and this of course would vary significantly based on operator environment, my organization's own load testing showed on the version this feature was introduced at most 1-3s of fetchx509svid latency experienced by consumers when the cache size was set to 1k and the agent was being cycled through 26k registrations.
I'd be curious to know for your own environment how registrations are being managed for short-lived- are they always statically registered? if not, are they aggressively or lazily culled? from a security perspective it's best to have an agent only be assigned precisely the workload identities that are expected to be running with it in that exact moment, though of course there are windows where there would be excess dependent on strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just something I wanted to raise as a potential issue as I've ran into a similar issue with other pieces of software. 1-3 seconds for something that is supposed to be started lots of times, for example batch jobs, can add up to a lot. With the probably more common use case of longer lived services I don't think a couple of seconds of start up delay is going to matter that much.
We don't mandate short-lived vs long-lived, but registration entries would usually be towards the longer lived end so I don't expect to have any issues. There are some plans for using it in a place with shorter lived workloads, so I'll see at that point if I'll actually have any issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we shouldn't keep this setting but promote it out of experimental
The LRU SVID cache has been enabled by default since v1.9.0, using the cache size of 1000. Since we haven't heard from users the need of tuning it with a different value, we went ahead with the plan to deprecating x509_svid_cache_max_size
in v1.10.0 and remove it in v1.11.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if that catches some of the more subtle way this changes deployments. For a while people have been used to thinking that the agent caches X509-SVIDs for all workloads. This means that if spire-server becomes unavailable for some time, you still have up to X509-SVID TTL/2 to fix it since the agent can serve svids from the cache.
This now changes it so that the agent caches up to 1000 X509-SVIDs, but no more until the workloads actually start and start requesting SVIDs. This means that the time to fix things can be shortened drastically. 1000 is an arbitrary number, it would still be nice to have the ability to specify how many workloads you want to have in the cache by default.
I don't want to hard block this, but I think it's worth having the maintainers consider this a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sorindumitru totally understand :) I definitely appreciate wanting to protect global operator experience and the benefit of open source is being able to have these sorts of conversations and concerns raised.
1000 is an arbitrary number, it would still be nice to have the ability to specify how many workloads you want to have in the cache by default.
iiuc , there is generally a desire from maintainers to limit just how many levers and knobs SPIRE has. It can already be difficult for some groups to adopt it, and each control is a potential point of confusion or breakage as well as a new dimension and permutation of deployment that maintainers need to support and receive issue/feature requests about. so that is a key motivator to removing this configuration. there is also definitely a split in operator groups- one that takes the binaries / source code as-is, and one that internally forks it to make adjustments (such as on hard-coded values as the most trivial example).
fwd: @amartinezfayo - if there's more to add not covered here or in #5383 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @sorindumitru and @amoore877 for all the feedback and thoughts on this.
I think that the last comment from @amoore877 reflects pretty well what the maintainer's group analyzes when a new setting is b being introduced.
We have discussed this again in our last maintainer's call, and after reevaluating this, we feel that it would be prudent to keep the x509_svid_cache_max_size
setting. The concerns pointed by @sorindumitru are some valid concerns. At this point, I personally think that being in a situation where you would need to adjust the cache size seems to be a lot more problematic than the fact that there is a new setting that can be tweaked, mostly considering that there will be a default value and users will not need to be aware of this setting if they don't need to adjust it. Having a proper documentation explaining when you may need to use this setting (e.g. when the agent handles more than 1000 active workloads) will help.
I think that we can promote it as a stable setting. @amoore877 Could you update the PR to reflect that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on that plan, would it not be better to close/abandon this PR and then another PR reverts #5150 which marked the setting as deprecated?
yet another PR (to reduce how much has to be reviewed) would then promote the setting out of experimental.
would we still be removing the non-LRU cache code?
would be good to update #4224 with the full goal state from maintainers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on that plan, would it not be better to close/abandon this PR and then another PR reverts #5150 which marked the setting as deprecated?
Since the x509_svid_cache_max_size
setting will not belong to the experimental section anymore, I think that the deprecation notice was OK, because the setting will not be recognized there anymore. In terms of closing this PR, I was thinking that it could just be updated to move the x509_svid_cache_max_size
setting out of the experimental section, being now a stable configurable. But if you prefer to handle that in a separate PR, that would work also.
would we still be removing the non-LRU cache code?
Yes, no changes with that, we are removing the non-LRU cache code in v1.11.0.
would be good to update #4224 with the full goal state from maintainers
I've updated #4224 accordingly.
@@ -120,10 +120,6 @@ type experimentalConfig struct { | |||
UseSyncAuthorizedEntries bool `hcl:"use_sync_authorized_entries"` | |||
|
|||
Flags fflag.RawConfig `hcl:"feature_flags"` | |||
|
|||
UnusedKeyPositions map[string][]token.Pos `hcl:",unusedKeyPositions"` | |||
X509SVIDCacheMaxSize int `hcl:"x509_svid_cache_max_size"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we shouldn't keep this setting but promote it out of experimental
The LRU SVID cache has been enabled by default since v1.9.0, using the cache size of 1000. Since we haven't heard from users the need of tuning it with a different value, we went ahead with the plan to deprecating x509_svid_cache_max_size
in v1.10.0 and remove it in v1.11.0.
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
Signed-off-by: amoore877 <andrew.s.moore@uber.com>
another PR will follow adjusted plan |
adjusted plan for expediency on release from talking to @amartinezfayo :
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @amoore877 for this!
We will be opening a PR adding the x509_svid_cache_max_size
setting.
Pull Request check list
[X] Commit conforms to CONTRIBUTING.md?
[X] Proper tests/regressions included?
[X] Documentation updated?
Affected functionality
Remove the original caching implementation in SPIRE Agent, as well as size configuration. Redo of #5184
Description of change
Per plan in #4224 , perform final update actions by making LRU Cache with size 1000 the default, non-configurable implementation.
Also per #4224 , not for release until at least 1.11 based on #5150 , which deprecates the options but does not delete them, and went out in 1.10.
Which issue this PR fixes
Fixes #4224