Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent manager retries sync during init #4479

Conversation

nweisenauer-sap
Copy link
Contributor

Agent manager retries every 5 seconds for a total of 1 minute to fetch SVIDs from the server during initalization.

Pull Request check list

  • Commit conforms to CONTRIBUTING.md?
  • Proper tests/regressions included?
  • Documentation updated? not sure if this needs to be added somewhere

Affected functionality
When an agent finishes the node attestation, it will create a manager object by calling newManager, where a synchronization will be triggered in the initialization and authorized entries will be fetched from the server's full in-memory cache. Then the agent will sync entries from the server every 5 seconds as default sync interval (this can be configured with the experimental config section).
However, if it is the first time this agents performs the node attestation, the synchronization in the manager's initialization will return 0 entries, due to the cache miss in the server. The agent has to wait for a while to get entries in some synchronization thereafter.
If the sync_interval is set to a large duration, like several minutes, the agent has to wait a long time until it receives its entries and is basically useless until then.

Description of change
The spire-agent retries the initial synchronisation with the server every 5 seconds for up to 1 minute until it gets more than 0 SVIDs as a response.

Which issue this PR fixes
fixes #2287

Agent manager retries every 5 seconds for a total of 1 minute to fetch SVIDs from the server during initalization.

Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>
@rturner3
Copy link
Collaborator

rturner3 commented Sep 7, 2023

Hi @nweisenauer-sap, just in case you didn't know, there is some current work in flight to reduce overall propagation delay of entries and the network traffic required to keep the agent and server in sync (#3496, #2182). Since you mentioned that the problem you're seeing is with an agent configured with a higher sync interval, I'm curious if the solutions to the issues I referenced might address any concerns you have with using the default agent sync interval of 5 seconds. Long term, we hope to remove the sync_interval configuration once these scale-related challenges are resolved (#4484).

It would also be helpful if you could share how much of an issue this is for you with your deployment and why you decided to change the agent's sync interval.

@nweisenauer-sap
Copy link
Contributor Author

Hi @rturner3,

here's some more info about our setup and why we came to the conclusion that we need to increase the sync_interval:

Our target is to support a many thousands of workloads/containers (each workload is representing a container - no K8s env) where each container contains a spire agent and the actual workload.
Each container requests the agent & workload identity at start up (node attestation is x509pop-like) and the SPIRE agent makes the workload identity available via the workload API and takes care of the rotation.
As the agent/workload relationship is static over the life-time of a container, there is no need that the agent does requests to GetBundle/GetAuthorizedEntries every 5 seconds. Additionally, the request rate will lead to a few thousands of requests per second to the SPIRE server, which is load that seems to be hard to justify (as there is no benefit in having those requests).

The sync_interval parameter seems to be a good match for our use case. Configuring a value like 24h or even 7 - 30 days would work for this setup, and achieve the goal that (1) reduce the load on the server (coming from getBundle & getAuthorizedEntries) and (2) make sure that the agent would still be able to get a new trust bundle (if required).

The upcoming solutions (#3496, #2182) will not address the issue to reduce the request rate.

We would also be open to consider further alternatives how to reduce the request rate for getBundle and getAuthorizedEntries, in case you have alternative suggestions.

Cheers

@rturner3
Copy link
Collaborator

Thanks for explaining your use case a bit more and sorry for the delayed response.

One thing I would say first is that running the agent as a sidecar process is not really endorsed as a supported deployment pattern at the moment. There has been some recent preliminary discussion about possible solutions for cases when there might be limiting factors that prevent running a host agent: #4459 (comment). I would encourage you to bring some of your requirements and environment limitations into that discussion to see how we can find a more well-supported path for this in the project (not sure if the issue I referenced is related to your deployment).

Going back to this PR specifically though, I have one concern with the changes proposed here. As currently implemented in this PR, the agent's API endpoints will not be available while the agent is retrying its sync due to not receiving SVIDs on init. This can lead to other issues upstream, such as health checks failing (which may result in restart or rollback of the agent) or workloads reporting extended errors while the agent is unavailable. I think it will be safer to do this retry after the agent has initialized its endpoints to avoid these kinds of problems.

In general, it should be valid for an agent to be up and healthy and not have any authorized identities cached yet. This can be the case in several scenarios, such as during bootstrap of a node/SPIRE deployment, if there are no workloads requiring identity running on a host, etc. If we move the retries to after the endpoints are started, we can still guarantee that an agent can report itself as healthy while waiting to receive a non-empty update from the server for authorized entries/SVIDs.

Agent manager retries synchronize every 5 seconds (defaultSyncInterval) if 0 SVIDs are present in the agent's cache after the last sync.

Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>
@nweisenauer-sap
Copy link
Contributor Author

Hi @rturner3,

I added a commit to address the problem of delaying the initialization of the agent. With the newest changes the agent's initialization remains unchanged, but the synchronization loop is changed to run every 5 seconds (or every "sync_interval"-seconds, whichever is lower) if the number of currently cached SVIDs in the agent is zero. This makes configuring very large sync_intervals viable and should not change the behavior of the agent when default or low sync_intervals are configured.

Regarding your comment against running the agent as a sidecar, while we are running in Kubernetes environments as well, the scenario that I described above (which benefits from high sync_intervals) is not a Kubernetes deployment. There is actually only one agent per "node", even though my description may have been a bit confusing. Nevertheless, for our K8s environment, we are actually interested in exploring a sidecar approach as mentioned in #4459 and will join the discussion on that one.

Cheers

Comment on lines 307 to 315
default:
m.synchronizeBackoff.Reset()
}

if m.cache.CountSVIDs() == 0 {
syncInterval = min(m.synchronizeBackoff.NextBackOff(), defaultSyncInterval)
} else {
syncInterval = m.synchronizeBackoff.NextBackOff()
}
Copy link
Collaborator

@rturner3 rturner3 Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
default:
m.synchronizeBackoff.Reset()
}
if m.cache.CountSVIDs() == 0 {
syncInterval = min(m.synchronizeBackoff.NextBackOff(), defaultSyncInterval)
} else {
syncInterval = m.synchronizeBackoff.NextBackOff()
}
default:
m.synchronizeBackoff.Reset()
syncInterval = m.synchronizeBackoff.NextBackOff()
// Clamp the sync interval to the default value when the agent doesn't have any SVIDs cached
// AND the previous sync request succeeded
if m.cache.CountSVIDs() == 0 {
syncInterval = min(syncInterval, defaultSyncInterval)
}
}

This will bypass backoffs when the agent doesn't have any SVIDs cached and the previous sync failed. We need to be pretty careful to not introduce a potential DDoS scenario here in cases where the server is degraded or down for some period of time. See #4490 for some context of what can happen when the agent doesn't backoff properly.

I proposed an alternative way of doing this clamping that will still allow backoffs to work as intended.


// set sync interval to a high value to proof that synchronizer retries sync
// with the lower default interval in case 0 entries are returned
m.c.SyncInterval = 100 * defaultSyncInterval
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed in favor of setting this property when the Config struct is first initialized, which is already being done in the test on line 1010.

require.Equal(t, initTime, m.GetLastSync())

// keep clk moving so that synchronizer retries sync
go func(ctx context.Context) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit inefficient because it will be continuously executed and may have some timing races with the main goroutine. A safer and more efficient way to coordinate between the test goroutine and the goroutine inside the manager calling synchronize() would be to pass a test hook function. e.g.

Inside manager.go:

type manager struct {
	...
	synchronizeCh chan<- struct{}
}
...
func (m *manager) runSynchronizer(ctx context.Context) error {
	for {
		select {
			...
		}
		
		err := m.synchronize(ctx)
		m.synchronizeCh<-struct{}{}
		...
	}
}

In this test:

m := m.newManager(c)
syncCh := make(chan struct{}, 1)
m.synchronizeCh = syncCh
...
m.runSynchronizer()
...
// Force a synchronization
clk.Add(defaultSyncInterval)
// Block on synchronization completing
<-syncCh
// Repeat as necessary...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with changing the test to not have races anymore and submitted a new variant in my most recent commit, however, I did not adopt the proposed solution with the sync Channel. The issue I saw with it is that it would block the production code without something to drain the channel, as the test would do. Additionally, it would not really help in getting the "runSynchronizer" function past the "select" statement.
Instead I decided to reduce complexity and just use a real time clock paired with very low defaultSyncInterval settings for the test.

pkg/agent/manager/manager.go Outdated Show resolved Hide resolved
nweisenauer-sap and others added 6 commits September 20, 2023 11:28
* restored manager.go initialize, removed handylSyncError func
* clamp the sync interval to the default value when the agent doesn't have any SVIDs cached AND the previous sync request succeeded
* moved syncInterval assignment in manager_test

Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>
* improved TestSyncRetriesWithDefaultIntervalOnZeroSVIDSReturned
* use realtime clock instead of fake clock to remove race condition

Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>
@@ -294,6 +297,13 @@ func (m *manager) runSynchronizer(ctx context.Context) error {
m.c.Log.WithError(err).Error("Synchronize failed")
default:
m.synchronizeBackoff.Reset()
syncInterval = m.synchronizeBackoff.NextBackOff()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is not it going to result in never changing backoff?
syncInternval will no change when err != nil and we are getting nextBackOff after reset, so syncInterval will never change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @MarcosDY , I missed to add a call to NextBackOff() in the event of an error.

nweisenauer-sap and others added 2 commits October 6, 2023 12:48
Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>
@@ -34,6 +34,9 @@ const (
synchronizeMaxInterval = 8 * time.Minute
)

// default sync interval is used between retries of initial sync
var defaultSyncInterval = 5 * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a const

dir := spiretest.TempDir(t)
km := fakeagentkeymanager.New(t, dir)

clk := realClock.New()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as I mentioned in my previous review, we need to come up with a solution here that doesn't involve a real clock. We have run into a lot of flakiness with tests that rely on timing that is problematic for several reasons (unclear if behavior works or not, can require multiple re-runs of CI which is time-consuming and incurs unnecessary cost).

I proposed a solution for this in the past review, did you give that a try or run into any problems with it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rturner3,
yes, unfortunately the proposed solution did not work. The channel would block the production code without something to drain the channel, as the test would do. Additionally, it would not really help in getting the "runSynchronizer" function past the "select" statement.
Nevertheless, I can see the issue with using a real clock in this test and we found another way of testing our scenario by injecting a hook into the mock clock. Please take a look at the proposed test from the most recent commit.
Cheers

nweisenauer-sap and others added 2 commits October 11, 2023 09:08
Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>
Copy link
Collaborator

@rturner3 rturner3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patience @nweisenauer-sap , this is looking good to me.

@amartinezfayo amartinezfayo merged commit fe1dd06 into spiffe:main Oct 11, 2023
@amartinezfayo amartinezfayo added this to the 1.8.3 milestone Oct 18, 2023
faisal-memon pushed a commit to faisal-memon/spire that referenced this pull request Dec 2, 2023
* Agent manager retries sync during init

Agent manager retries every 5 seconds for a total of 1 minute to fetch SVIDs from the server during initalization.

Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache miss in manager initialization for the 1st time agent
4 participants