-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agent manager retries sync during init #4479
Agent manager retries sync during init #4479
Conversation
Agent manager retries every 5 seconds for a total of 1 minute to fetch SVIDs from the server during initalization. Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>
Hi @nweisenauer-sap, just in case you didn't know, there is some current work in flight to reduce overall propagation delay of entries and the network traffic required to keep the agent and server in sync (#3496, #2182). Since you mentioned that the problem you're seeing is with an agent configured with a higher sync interval, I'm curious if the solutions to the issues I referenced might address any concerns you have with using the default agent sync interval of 5 seconds. Long term, we hope to remove the It would also be helpful if you could share how much of an issue this is for you with your deployment and why you decided to change the agent's sync interval. |
Hi @rturner3, here's some more info about our setup and why we came to the conclusion that we need to increase the sync_interval: Our target is to support a many thousands of workloads/containers (each workload is representing a container - no K8s env) where each container contains a spire agent and the actual workload. The sync_interval parameter seems to be a good match for our use case. Configuring a value like 24h or even 7 - 30 days would work for this setup, and achieve the goal that (1) reduce the load on the server (coming from getBundle & getAuthorizedEntries) and (2) make sure that the agent would still be able to get a new trust bundle (if required). The upcoming solutions (#3496, #2182) will not address the issue to reduce the request rate. We would also be open to consider further alternatives how to reduce the request rate for getBundle and getAuthorizedEntries, in case you have alternative suggestions. Cheers |
…or-the-1st-time-agent
Thanks for explaining your use case a bit more and sorry for the delayed response. One thing I would say first is that running the agent as a sidecar process is not really endorsed as a supported deployment pattern at the moment. There has been some recent preliminary discussion about possible solutions for cases when there might be limiting factors that prevent running a host agent: #4459 (comment). I would encourage you to bring some of your requirements and environment limitations into that discussion to see how we can find a more well-supported path for this in the project (not sure if the issue I referenced is related to your deployment). Going back to this PR specifically though, I have one concern with the changes proposed here. As currently implemented in this PR, the agent's API endpoints will not be available while the agent is retrying its sync due to not receiving SVIDs on init. This can lead to other issues upstream, such as health checks failing (which may result in restart or rollback of the agent) or workloads reporting extended errors while the agent is unavailable. I think it will be safer to do this retry after the agent has initialized its endpoints to avoid these kinds of problems. In general, it should be valid for an agent to be up and healthy and not have any authorized identities cached yet. This can be the case in several scenarios, such as during bootstrap of a node/SPIRE deployment, if there are no workloads requiring identity running on a host, etc. If we move the retries to after the endpoints are started, we can still guarantee that an agent can report itself as healthy while waiting to receive a non-empty update from the server for authorized entries/SVIDs. |
Agent manager retries synchronize every 5 seconds (defaultSyncInterval) if 0 SVIDs are present in the agent's cache after the last sync. Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>
Hi @rturner3, I added a commit to address the problem of delaying the initialization of the agent. With the newest changes the agent's initialization remains unchanged, but the synchronization loop is changed to run every 5 seconds (or every "sync_interval"-seconds, whichever is lower) if the number of currently cached SVIDs in the agent is zero. This makes configuring very large sync_intervals viable and should not change the behavior of the agent when default or low sync_intervals are configured. Regarding your comment against running the agent as a sidecar, while we are running in Kubernetes environments as well, the scenario that I described above (which benefits from high sync_intervals) is not a Kubernetes deployment. There is actually only one agent per "node", even though my description may have been a bit confusing. Nevertheless, for our K8s environment, we are actually interested in exploring a sidecar approach as mentioned in #4459 and will join the discussion on that one. Cheers |
pkg/agent/manager/manager.go
Outdated
default: | ||
m.synchronizeBackoff.Reset() | ||
} | ||
|
||
if m.cache.CountSVIDs() == 0 { | ||
syncInterval = min(m.synchronizeBackoff.NextBackOff(), defaultSyncInterval) | ||
} else { | ||
syncInterval = m.synchronizeBackoff.NextBackOff() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default: | |
m.synchronizeBackoff.Reset() | |
} | |
if m.cache.CountSVIDs() == 0 { | |
syncInterval = min(m.synchronizeBackoff.NextBackOff(), defaultSyncInterval) | |
} else { | |
syncInterval = m.synchronizeBackoff.NextBackOff() | |
} | |
default: | |
m.synchronizeBackoff.Reset() | |
syncInterval = m.synchronizeBackoff.NextBackOff() | |
// Clamp the sync interval to the default value when the agent doesn't have any SVIDs cached | |
// AND the previous sync request succeeded | |
if m.cache.CountSVIDs() == 0 { | |
syncInterval = min(syncInterval, defaultSyncInterval) | |
} | |
} |
This will bypass backoffs when the agent doesn't have any SVIDs cached and the previous sync failed. We need to be pretty careful to not introduce a potential DDoS scenario here in cases where the server is degraded or down for some period of time. See #4490 for some context of what can happen when the agent doesn't backoff properly.
I proposed an alternative way of doing this clamping that will still allow backoffs to work as intended.
pkg/agent/manager/manager_test.go
Outdated
|
||
// set sync interval to a high value to proof that synchronizer retries sync | ||
// with the lower default interval in case 0 entries are returned | ||
m.c.SyncInterval = 100 * defaultSyncInterval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed in favor of setting this property when the Config
struct is first initialized, which is already being done in the test on line 1010.
pkg/agent/manager/manager_test.go
Outdated
require.Equal(t, initTime, m.GetLastSync()) | ||
|
||
// keep clk moving so that synchronizer retries sync | ||
go func(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit inefficient because it will be continuously executed and may have some timing races with the main goroutine. A safer and more efficient way to coordinate between the test goroutine and the goroutine inside the manager calling synchronize()
would be to pass a test hook function. e.g.
Inside manager.go
:
type manager struct {
...
synchronizeCh chan<- struct{}
}
...
func (m *manager) runSynchronizer(ctx context.Context) error {
for {
select {
...
}
err := m.synchronize(ctx)
m.synchronizeCh<-struct{}{}
...
}
}
In this test:
m := m.newManager(c)
syncCh := make(chan struct{}, 1)
m.synchronizeCh = syncCh
...
m.runSynchronizer()
...
// Force a synchronization
clk.Add(defaultSyncInterval)
// Block on synchronization completing
<-syncCh
// Repeat as necessary...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with changing the test to not have races anymore and submitted a new variant in my most recent commit, however, I did not adopt the proposed solution with the sync Channel. The issue I saw with it is that it would block the production code without something to drain the channel, as the test would do. Additionally, it would not really help in getting the "runSynchronizer" function past the "select" statement.
Instead I decided to reduce complexity and just use a real time clock paired with very low defaultSyncInterval settings for the test.
* restored manager.go initialize, removed handylSyncError func * clamp the sync interval to the default value when the agent doesn't have any SVIDs cached AND the previous sync request succeeded * moved syncInterval assignment in manager_test Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>
* improved TestSyncRetriesWithDefaultIntervalOnZeroSVIDSReturned * use realtime clock instead of fake clock to remove race condition Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>
…or-the-1st-time-agent
…or-the-1st-time-agent
…or-the-1st-time-agent
…or-the-1st-time-agent
@@ -294,6 +297,13 @@ func (m *manager) runSynchronizer(ctx context.Context) error { | |||
m.c.Log.WithError(err).Error("Synchronize failed") | |||
default: | |||
m.synchronizeBackoff.Reset() | |||
syncInterval = m.synchronizeBackoff.NextBackOff() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is not it going to result in never changing backoff?
syncInternval will no change when err != nil
and we are getting nextBackOff after reset, so syncInterval will never change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch @MarcosDY , I missed to add a call to NextBackOff() in the event of an error.
Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>
…or-the-1st-time-agent
pkg/agent/manager/manager.go
Outdated
@@ -34,6 +34,9 @@ const ( | |||
synchronizeMaxInterval = 8 * time.Minute | |||
) | |||
|
|||
// default sync interval is used between retries of initial sync | |||
var defaultSyncInterval = 5 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a const
pkg/agent/manager/manager_test.go
Outdated
dir := spiretest.TempDir(t) | ||
km := fakeagentkeymanager.New(t, dir) | ||
|
||
clk := realClock.New() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So as I mentioned in my previous review, we need to come up with a solution here that doesn't involve a real clock. We have run into a lot of flakiness with tests that rely on timing that is problematic for several reasons (unclear if behavior works or not, can require multiple re-runs of CI which is time-consuming and incurs unnecessary cost).
I proposed a solution for this in the past review, did you give that a try or run into any problems with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rturner3,
yes, unfortunately the proposed solution did not work. The channel would block the production code without something to drain the channel, as the test would do. Additionally, it would not really help in getting the "runSynchronizer" function past the "select" statement.
Nevertheless, I can see the issue with using a real clock in this test and we found another way of testing our scenario by injecting a hook into the mock clock. Please take a look at the proposed test from the most recent commit.
Cheers
…or-the-1st-time-agent
Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patience @nweisenauer-sap , this is looking good to me.
…or-the-1st-time-agent
…or-the-1st-time-agent
* Agent manager retries sync during init Agent manager retries every 5 seconds for a total of 1 minute to fetch SVIDs from the server during initalization. Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com> Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Agent manager retries every 5 seconds for a total of 1 minute to fetch SVIDs from the server during initalization.
Pull Request check list
Affected functionality
When an agent finishes the node attestation, it will create a manager object by calling newManager, where a synchronization will be triggered in the initialization and authorized entries will be fetched from the server's full in-memory cache. Then the agent will sync entries from the server every 5 seconds as default sync interval (this can be configured with the experimental config section).
However, if it is the first time this agents performs the node attestation, the synchronization in the manager's initialization will return 0 entries, due to the cache miss in the server. The agent has to wait for a while to get entries in some synchronization thereafter.
If the sync_interval is set to a large duration, like several minutes, the agent has to wait a long time until it receives its entries and is basically useless until then.
Description of change
The spire-agent retries the initial synchronisation with the server every 5 seconds for up to 1 minute until it gets more than 0 SVIDs as a response.
Which issue this PR fixes
fixes #2287