-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expired Agent data left in SPIRE database indefinitely #1836
Expired Agent data left in SPIRE database indefinitely #1836
Comments
Thanks for bring this up @rturner3! This is a very tricky problem, especially in the face of TOFU node attestation. A solution will likely be fairly involved, possibly querying node attestor plugins about whether or not a particular attested agent is still "alive" in order to determine if the associated data can be safely cleared. Unfortunately, a "dirty" datastore with lots of stale information does put some extra burden on certain solutions for entry authorization crawl performance that we've been pursuing, like the full in-memory cache. There was some discussion about a short term bandaid that would prevent an attested node with an expired SVID from being loaded in the cache, which would solve the memory usage issue but wouldn't help with the time it takes to hydrate the cache. I think this issue warrants deeper thought. Certainly we need to come up with something for this. |
Can we start with the non-TOFU agents? add to SIG-SPIRE agenda? |
Core doesn't currently distinguish or know the difference between tofu and non-tofu attestors... but, it may need to in the future: #2203 Teaching core about this somehow is probably a pre-requisite for starting this cleanup? |
@evan2645 can tofu/non-tofu attestation be inferred from AttestationDataType? aws_iid, azure_msi, and gcp_iit are explicitly called out in the documentation as tofu. If/when we no longer require trust on first use on cloud nodes I suspect we would deliver that functionality as new node attestation plugins with new AttestationDataType? |
@bri365 Hmm yes I think that could work for builtins, but we wouldn't be able to clean up after non-builtins. I also think there are likely to be cases where a plugin may have configurable TOFU based on how its deployed. I do think it makes sense for core to know some stuff about a plugin, in general. @azdagron @amartinezfayo any thoughts here on how to accomplish this? Perhaps a TOFU bool on the node attestor response? We can track this in attested node entries, and it will be very useful if we decide to tackle #2203 |
Do we expect a given node attestor to be configured for TOFU or not for all agents, or is this something that would ever be determined on a per agent basis? If the former (which I believe feels safer), adding a TOFU bool to nodeattestor.NodeAttestor and retrieving that from GetNodeAttestorNamed() should work well, and we would not need to store it in every agent record. If the latter, then we do need to store it for every agent. |
We've talked about TOFU being configurable on some node attestors (i.e. x509pop), so I don't think we could reasonably rely on a runtime property of the current "TOFU-ness" of a nodeattestor to evaluate the "TOFU-ness" of agents previously attested with that node attestor (since it could have changed). |
Re-opening this because it was unintentionally closed by GitHub when the linked spire-plugin-sdk PR was merged. GitHub tried to be a little too clever and closed this issue because it picked up the phrase "resolve #1836" in the PR. |
Just an update, here's what's left to do on this task:
|
update: we now have all the parts of this except an actual command line for deleting the old entries |
Hey @dfeldman, are you working on this? |
Hey there, I'm interested in resolving this issue. Can someone provide me with a bit of context about the work that's left to be done? is it just the new command line for deleting stale attested node entries? |
I have a question about what was the consensus that we got here; from what I understood, we will proceed by creating a CLI command that will enable the operator to trigger the non-TOFU stale node agent entries cleanup, right? @rturner3 mentioned a periodic purging process, but from what I can see here, this periodic purge logic would be delegated to the operator. Is this correct? |
So, we discussed this issue in the last contributor sync (2/14/23), and here is what we have: This issue will continue with the closed scope of implementing the action that purges non-TOFU stale agent entries. The periodic process for purging and the stale TOFU node attestation will be treated separately (in other issues). |
Data in the
attested_node_entries
andnode_resolver_map_entries
tables is left in the database indefinitely. There are cases when this data may no longer be valuable to retain, such as when the rows correspond to an expired Agent that attested using aNodeAttestor
plugin that does not have a trust on first use security model, such assshpop
orx509pop
. In those cases, the Agents corresponding to thespiffe_id
in those rows can safely re-attest with SPIRE Server even if this data is deleted from the database.In long-running environments, this can result in a lot of stale data in the database that ultimately hurts SPIRE Server query performance. When the overall size of these tables expands beyond the maximum message size in gRPC, the only way to clean up this old data seems to be manually purging data in the database with SQL
DELETE FROM
statements. It is not necessarily intuitive to an operator of SPIRE that they might need to manually clean up this data from the database.It would be nice if SPIRE Server did some periodic purging of this old data for expired Agents that attested with a non-TOFU
NodeAttestor
plugin to simplify the operational burden of monitoring the table sizes and manually purging the data on an ongoing basis.The text was updated successfully, but these errors were encountered: