Agent manager retries sync during init #4479

nweisenauer-sap · 2023-09-06T12:28:44Z

Agent manager retries every 5 seconds for a total of 1 minute to fetch SVIDs from the server during initalization.

Pull Request check list

Commit conforms to CONTRIBUTING.md?
Proper tests/regressions included?
Documentation updated? not sure if this needs to be added somewhere

Affected functionality
When an agent finishes the node attestation, it will create a manager object by calling newManager, where a synchronization will be triggered in the initialization and authorized entries will be fetched from the server's full in-memory cache. Then the agent will sync entries from the server every 5 seconds as default sync interval (this can be configured with the experimental config section).
However, if it is the first time this agents performs the node attestation, the synchronization in the manager's initialization will return 0 entries, due to the cache miss in the server. The agent has to wait for a while to get entries in some synchronization thereafter.
If the sync_interval is set to a large duration, like several minutes, the agent has to wait a long time until it receives its entries and is basically useless until then.

Description of change
The spire-agent retries the initial synchronisation with the server every 5 seconds for up to 1 minute until it gets more than 0 SVIDs as a response.

Which issue this PR fixes
fixes #2287

Agent manager retries every 5 seconds for a total of 1 minute to fetch SVIDs from the server during initalization. Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>

rturner3 · 2023-09-07T19:23:58Z

Hi @nweisenauer-sap, just in case you didn't know, there is some current work in flight to reduce overall propagation delay of entries and the network traffic required to keep the agent and server in sync (#3496, #2182). Since you mentioned that the problem you're seeing is with an agent configured with a higher sync interval, I'm curious if the solutions to the issues I referenced might address any concerns you have with using the default agent sync interval of 5 seconds. Long term, we hope to remove the sync_interval configuration once these scale-related challenges are resolved (#4484).

It would also be helpful if you could share how much of an issue this is for you with your deployment and why you decided to change the agent's sync interval.

nweisenauer-sap · 2023-09-08T08:32:27Z

Hi @rturner3,

here's some more info about our setup and why we came to the conclusion that we need to increase the sync_interval:

Our target is to support a many thousands of workloads/containers (each workload is representing a container - no K8s env) where each container contains a spire agent and the actual workload.
Each container requests the agent & workload identity at start up (node attestation is x509pop-like) and the SPIRE agent makes the workload identity available via the workload API and takes care of the rotation.
As the agent/workload relationship is static over the life-time of a container, there is no need that the agent does requests to GetBundle/GetAuthorizedEntries every 5 seconds. Additionally, the request rate will lead to a few thousands of requests per second to the SPIRE server, which is load that seems to be hard to justify (as there is no benefit in having those requests).

The sync_interval parameter seems to be a good match for our use case. Configuring a value like 24h or even 7 - 30 days would work for this setup, and achieve the goal that (1) reduce the load on the server (coming from getBundle & getAuthorizedEntries) and (2) make sure that the agent would still be able to get a new trust bundle (if required).

The upcoming solutions (#3496, #2182) will not address the issue to reduce the request rate.

We would also be open to consider further alternatives how to reduce the request rate for getBundle and getAuthorizedEntries, in case you have alternative suggestions.

Cheers

…or-the-1st-time-agent

rturner3 · 2023-09-15T02:22:54Z

Thanks for explaining your use case a bit more and sorry for the delayed response.

One thing I would say first is that running the agent as a sidecar process is not really endorsed as a supported deployment pattern at the moment. There has been some recent preliminary discussion about possible solutions for cases when there might be limiting factors that prevent running a host agent: #4459 (comment). I would encourage you to bring some of your requirements and environment limitations into that discussion to see how we can find a more well-supported path for this in the project (not sure if the issue I referenced is related to your deployment).

Going back to this PR specifically though, I have one concern with the changes proposed here. As currently implemented in this PR, the agent's API endpoints will not be available while the agent is retrying its sync due to not receiving SVIDs on init. This can lead to other issues upstream, such as health checks failing (which may result in restart or rollback of the agent) or workloads reporting extended errors while the agent is unavailable. I think it will be safer to do this retry after the agent has initialized its endpoints to avoid these kinds of problems.

In general, it should be valid for an agent to be up and healthy and not have any authorized identities cached yet. This can be the case in several scenarios, such as during bootstrap of a node/SPIRE deployment, if there are no workloads requiring identity running on a host, etc. If we move the retries to after the endpoints are started, we can still guarantee that an agent can report itself as healthy while waiting to receive a non-empty update from the server for authorized entries/SVIDs.

Agent manager retries synchronize every 5 seconds (defaultSyncInterval) if 0 SVIDs are present in the agent's cache after the last sync. Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>

nweisenauer-sap · 2023-09-15T14:42:42Z

Hi @rturner3,

I added a commit to address the problem of delaying the initialization of the agent. With the newest changes the agent's initialization remains unchanged, but the synchronization loop is changed to run every 5 seconds (or every "sync_interval"-seconds, whichever is lower) if the number of currently cached SVIDs in the agent is zero. This makes configuring very large sync_intervals viable and should not change the behavior of the agent when default or low sync_intervals are configured.

Regarding your comment against running the agent as a sidecar, while we are running in Kubernetes environments as well, the scenario that I described above (which benefits from high sync_intervals) is not a Kubernetes deployment. There is actually only one agent per "node", even though my description may have been a bit confusing. Nevertheless, for our K8s environment, we are actually interested in exploring a sidecar approach as mentioned in #4459 and will join the discussion on that one.

Cheers

rturner3 · 2023-09-19T20:29:03Z

pkg/agent/manager/manager.go

 		default:
 			m.synchronizeBackoff.Reset()
 		}
+
+		if m.cache.CountSVIDs() == 0 {
+			syncInterval = min(m.synchronizeBackoff.NextBackOff(), defaultSyncInterval)
+		} else {
+			syncInterval = m.synchronizeBackoff.NextBackOff()
+		}


Suggested change

default:

m.synchronizeBackoff.Reset()

}

if m.cache.CountSVIDs() == 0 {

syncInterval = min(m.synchronizeBackoff.NextBackOff(), defaultSyncInterval)

} else {

syncInterval = m.synchronizeBackoff.NextBackOff()

}

default:

m.synchronizeBackoff.Reset()

syncInterval = m.synchronizeBackoff.NextBackOff()

// Clamp the sync interval to the default value when the agent doesn't have any SVIDs cached

// AND the previous sync request succeeded

if m.cache.CountSVIDs() == 0 {

syncInterval = min(syncInterval, defaultSyncInterval)

}

}

This will bypass backoffs when the agent doesn't have any SVIDs cached and the previous sync failed. We need to be pretty careful to not introduce a potential DDoS scenario here in cases where the server is degraded or down for some period of time. See #4490 for some context of what can happen when the agent doesn't backoff properly.

I proposed an alternative way of doing this clamping that will still allow backoffs to work as intended.

rturner3 · 2023-09-19T20:34:56Z

pkg/agent/manager/manager_test.go

+
+	// set sync interval to a high value to proof that synchronizer retries sync
+	// with the lower default interval in case 0 entries are returned
+	m.c.SyncInterval = 100 * defaultSyncInterval


This can be removed in favor of setting this property when the Config struct is first initialized, which is already being done in the test on line 1010.

rturner3 · 2023-09-19T20:59:28Z

pkg/agent/manager/manager_test.go

+	require.Equal(t, initTime, m.GetLastSync())
+
+	// keep clk moving so that synchronizer retries sync
+	go func(ctx context.Context) {


This is a bit inefficient because it will be continuously executed and may have some timing races with the main goroutine. A safer and more efficient way to coordinate between the test goroutine and the goroutine inside the manager calling synchronize() would be to pass a test hook function. e.g.

Inside manager.go:

type manager struct { ... synchronizeCh chan<- struct{} } ... func (m *manager) runSynchronizer(ctx context.Context) error { for { select { ... } err := m.synchronize(ctx) m.synchronizeCh<-struct{}{} ... } }

In this test:

m := m.newManager(c) syncCh := make(chan struct{}, 1) m.synchronizeCh = syncCh ... m.runSynchronizer() ... // Force a synchronization clk.Add(defaultSyncInterval) // Block on synchronization completing <-syncCh // Repeat as necessary...

I agree with changing the test to not have races anymore and submitted a new variant in my most recent commit, however, I did not adopt the proposed solution with the sync Channel. The issue I saw with it is that it would block the production code without something to drain the channel, as the test would do. Additionally, it would not really help in getting the "runSynchronizer" function past the "select" statement.
Instead I decided to reduce complexity and just use a real time clock paired with very low defaultSyncInterval settings for the test.

pkg/agent/manager/manager.go

* restored manager.go initialize, removed handylSyncError func * clamp the sync interval to the default value when the agent doesn't have any SVIDs cached AND the previous sync request succeeded * moved syncInterval assignment in manager_test Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>

* improved TestSyncRetriesWithDefaultIntervalOnZeroSVIDSReturned * use realtime clock instead of fake clock to remove race condition Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>

…or-the-1st-time-agent

MarcosDY · 2023-10-04T15:09:06Z

pkg/agent/manager/manager.go

@@ -294,6 +297,13 @@ func (m *manager) runSynchronizer(ctx context.Context) error {
 			m.c.Log.WithError(err).Error("Synchronize failed")
 		default:
 			m.synchronizeBackoff.Reset()
+			syncInterval = m.synchronizeBackoff.NextBackOff()


is not it going to result in never changing backoff?
syncInternval will no change when err != nil and we are getting nextBackOff after reset, so syncInterval will never change

Good catch @MarcosDY , I missed to add a call to NextBackOff() in the event of an error.

Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>

…or-the-1st-time-agent

rturner3 · 2023-10-10T19:22:43Z

pkg/agent/manager/manager.go

@@ -34,6 +34,9 @@ const (
 	synchronizeMaxInterval = 8 * time.Minute
 )

+// default sync interval is used between retries of initial sync
+var defaultSyncInterval = 5 * time.Second


This can be a const

rturner3 · 2023-10-10T20:28:53Z

pkg/agent/manager/manager_test.go

+	dir := spiretest.TempDir(t)
+	km := fakeagentkeymanager.New(t, dir)
+
+	clk := realClock.New()


So as I mentioned in my previous review, we need to come up with a solution here that doesn't involve a real clock. We have run into a lot of flakiness with tests that rely on timing that is problematic for several reasons (unclear if behavior works or not, can require multiple re-runs of CI which is time-consuming and incurs unnecessary cost).

I proposed a solution for this in the past review, did you give that a try or run into any problems with it?

Hi @rturner3,
yes, unfortunately the proposed solution did not work. The channel would block the production code without something to drain the channel, as the test would do. Additionally, it would not really help in getting the "runSynchronizer" function past the "select" statement.
Nevertheless, I can see the issue with using a real clock in this test and we found another way of testing our scenario by injecting a hook into the mock clock. Please take a look at the proposed test from the most recent commit.
Cheers

…or-the-1st-time-agent

Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>

rturner3

Thanks for the patience @nweisenauer-sap , this is looking good to me.

…or-the-1st-time-agent

* Agent manager retries sync during init Agent manager retries every 5 seconds for a total of 1 minute to fetch SVIDs from the server during initalization. Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com> Signed-off-by: Faisal Memon <fymemon@yahoo.com>

Agent manager retries sync during init

1d2ff82

Agent manager retries every 5 seconds for a total of 1 minute to fetch SVIDs from the server during initalization. Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>

nweisenauer-sap requested review from evan2645, amartinezfayo, azdagron, MarcosDY and rturner3 as code owners September 6, 2023 12:28

MarcosDY assigned rturner3 Sep 7, 2023

evan2645 mentioned this pull request Sep 7, 2023

Consider deprecating configurable agent sync interval #4484

Open

Merge branch 'main' into #2287_Cache-miss-in-manager-initialization-f…

82619e9

…or-the-1st-time-agent

Agent manager syncs faster if 0 SVIDs in cache

384fa5b

Agent manager retries synchronize every 5 seconds (defaultSyncInterval) if 0 SVIDs are present in the agent's cache after the last sync. Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>

rturner3 reviewed Sep 19, 2023

View reviewed changes

nweisenauer-sap and others added 6 commits September 20, 2023 11:28

refactor manager_test

511b005

* improved TestSyncRetriesWithDefaultIntervalOnZeroSVIDSReturned * use realtime clock instead of fake clock to remove race condition Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>

Merge branch 'main' into #2287_Cache-miss-in-manager-initialization-f…

fd59020

…or-the-1st-time-agent

Merge branch 'main' into #2287_Cache-miss-in-manager-initialization-f…

eb32ccc

…or-the-1st-time-agent

Merge branch 'main' into #2287_Cache-miss-in-manager-initialization-f…

485c81a

…or-the-1st-time-agent

Merge branch 'main' into #2287_Cache-miss-in-manager-initialization-f…

994e734

…or-the-1st-time-agent

MarcosDY reviewed Oct 4, 2023

View reviewed changes

nweisenauer-sap and others added 2 commits October 6, 2023 12:48

Increase syncInterval in case of error

9dd2d4e

Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>

Merge branch 'main' into #2287_Cache-miss-in-manager-initialization-f…

8b4bd42

…or-the-1st-time-agent

rturner3 reviewed Oct 10, 2023

View reviewed changes

nweisenauer-sap and others added 2 commits October 11, 2023 09:08

Merge branch 'main' into #2287_Cache-miss-in-manager-initialization-f…

05f0701

…or-the-1st-time-agent

improved manager_test by replacing clock w/ mock

4a280c2

Signed-off-by: Nico Weisenauer <nico.weisenauer@sap.com>

rturner3 approved these changes Oct 11, 2023

View reviewed changes

rturner3 and others added 2 commits October 11, 2023 13:28

Merge branch 'main' into #2287_Cache-miss-in-manager-initialization-f…

6a6f291

…or-the-1st-time-agent

Merge branch 'main' into #2287_Cache-miss-in-manager-initialization-f…

a40db8c

…or-the-1st-time-agent

amartinezfayo merged commit fe1dd06 into spiffe:main Oct 11, 2023

amartinezfayo added this to the 1.8.3 milestone Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent manager retries sync during init #4479

Agent manager retries sync during init #4479

nweisenauer-sap commented Sep 6, 2023

rturner3 commented Sep 7, 2023

nweisenauer-sap commented Sep 8, 2023

rturner3 commented Sep 15, 2023

nweisenauer-sap commented Sep 15, 2023

rturner3 Sep 19, 2023 •

edited

Loading

rturner3 Sep 19, 2023

rturner3 Sep 19, 2023

nweisenauer-sap Sep 21, 2023

MarcosDY Oct 4, 2023

nweisenauer-sap Oct 6, 2023

rturner3 Oct 10, 2023

rturner3 Oct 10, 2023

nweisenauer-sap Oct 11, 2023

rturner3 left a comment

Agent manager retries sync during init #4479

Agent manager retries sync during init #4479

Conversation

nweisenauer-sap commented Sep 6, 2023

rturner3 commented Sep 7, 2023

nweisenauer-sap commented Sep 8, 2023

rturner3 commented Sep 15, 2023

nweisenauer-sap commented Sep 15, 2023

rturner3 Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rturner3 left a comment

Choose a reason for hiding this comment

rturner3 Sep 19, 2023 •

edited

Loading