Introduce support to save and load the CA journal from the datastore #4690

amartinezfayo · 2023-11-24T20:01:50Z

This PR introduces support to save and load the CA journal from the datastore. Since the database schema changes required for this were done in v1.8.0, this is being introduced through a feature flag. This functionality is only enabled if the ca_journal_in_datastore feature flag is enabled.

In order to identify which CA journal record belongs to the server running, the server gets the KeyManager keys. It then uses the keys obtained to find if one of the keys matches a record with that key, based on the X509 authority ID of the key.
Server starting without a CA journal record in the datastore: the CA journal is loaded from disk and a new CA journal record is created when the journal is saved with updates. The CA journal is also saved on disk.
Server starting with a CA journal record in the datastore: the CA journal is loaded from the datastore and is saved both in the datastore and on disk on updates.
Saving the CA journal both in the datastore and on disk is needed to support downgrades. We need to keep this until v1.10, when we will be able to delete the CA journal file.
A pruning strategy for stale CA journal records (of servers that are no longer active) has been implemented. CA journal records with all the authorities that have expired since more than two weeks are pruned. A conservative period has been chosen to be able to access historical data if needed, even if the record would not be useful anymore to recover the keys in a server startup.

Fixes #4463, #4464.

Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>

edwbuck · 2023-12-01T14:45:31Z

@MarcosDY Please start the commentary on this one, it is large and will likely take more time than a normal review.

In addition, this functionality only prepares for the next step, which includes no longer maintaining / using the on-disk footprint of the CA Journal. This step can only be performed after this PR has been merged.

MarcosDY · 2023-12-01T17:16:17Z

pkg/common/fflag/fflag.go

@@ -38,6 +38,10 @@ const (
 	// reattestable. In those cases the agent will still renew without reattesting.
 	FlagReattestToRenew Flag = "reattest_to_renew"

+	// FlagCAJournalInDatastore controls whether or not to store the CA journal
+	// of the server in the datastore.
+	FlagCAJournalInDatastore = "ca_journal_in_datastore"


may this be deprecated? since we are going to remove this to make it default

Shouldn't we keep it until we are doing all of our rotation with the database only? I agree it will be deprecated, but it seems like it's too early to remove it, considering this will be experimental in 1.9.0.

@MarcosDY are you suggesting that the flag should be true by default? That would be the process of deprecating a feature flag, and we cannot do that in 1.8.x.
Feature flags are meant to be used by developers while features are still under development or cannot be turned on to all users (like this case, because the schema change required was introduced in 1.8.0).

@edwbuck There is no need to have this feature introduced as an experimental feature in 1.9.0. All users will have the schema changes needed when they either update to 1.9.0 or fresh install 1.9.0.
In 1.9.0, this will be the default behavior to all users.

MarcosDY · 2023-12-01T17:24:59Z

pkg/server/api/localauthority/v1/service_test.go

@@ -1865,7 +1865,7 @@ func (m *fakeCAManager) PrepareJWTKey(context.Context) error {
 	return m.prepareJWTKeyErr
 }

-func (m *fakeCAManager) RotateJWTKey() {
+func (m *fakeCAManager) RotateJWTKey(_ context.Context) {


nit: same comment for RotateX509CA

Suggested change

func (m *fakeCAManager) RotateJWTKey(_ context.Context) {

func (m *fakeCAManager) RotateJWTKey(context.Context) {

MarcosDY · 2023-12-04T13:40:37Z

pkg/server/ca/manager/journal.go

-	mu      sync.RWMutex
-	entries *JournalEntries
+	mu                    sync.RWMutex
+	activeX509AuthorityID string


why is activeX509 authority id requied to be cached? why is it different to jwt authority?

The CA journal record in the database has the active X509 authority ID so we can efficiently compare the obtained X509 authority IDs from the key manager with the record's active X509 authority ID.
When we update an stored X509CA entry to have the ACTIVE status, we keep updated the active X509 authority ID in this struct, so we know what's the active X509 authority ID when we save the record in the database.

MarcosDY · 2023-12-04T13:48:26Z

pkg/server/ca/manager/journal.go

+		if err != nil {
+			return nil, fmt.Errorf("failed to load journal from datastore: %w", err)
+		}
+		if j != nil {


if this is nil, means that it was unable to fetch it from datastore, and we'll be moving into disk approach,
that is fine since we my be migrating here...
but i'm not sure, what happens if we are never able to fetch from ds and we move always to disk?
Since it is a feature that someone enabled intentionally I think we have 2 ways to follow here:
1- add a warning if flag is set but no journal found in DS, in a perferct world this means that it was unable to found and in next iteration warning wil be done.
2- create a migration code that try to load things from disk if not found, and then fails if that is no possible, with this we can be sure that in next iteratrion we'll be sure that we'll load from ds.

what do you think? maybe it is an overkill?

Never fetching from the data store is a scenario that needs to be considered in different migration phases onto the database rotation scheme:

No migration started - Not fetching from the data store is exactly what it wanted.

Migration in process (as it is now) - Fetch from the data store but conditionally so. If the data store doesn't have information, load the information from disk (and update the datastore).

Migration complete (which will come in a future MR) - If we cannot fetch from the data store, we will need to create initial certificate information in the data store (much like the before-effort logic, where we bootstrap the configurations where there is no external PKI and no CA journal on disk).

As we are currently in the second step above, and currently in a migration from disk to datastore, we should load the datastore from disk if we can't fetch the datastore. As for "datastore unavailable" scenarios (in contrast to the data unavailable scenarios), this is the Server's Datastore, so it should eventually trigger a server restart with enough datastore failures.

Since it is a feature that someone enabled intentionally I think we have 2 ways to follow here:
1- add a warning if flag is set but no journal found in DS, in a perferct world this means that it was unable to found and in next iteration warning wil be done.

I don't think that we should log a warning. This is the expected behavior for all the systems running for the first time with this fflag enabled in 1.8.x, and will be the expected behavior for all the systems running for the first time in 1.9.0.

2- create a migration code that try to load things from disk if not found, and then fails if that is no possible, with this we can be sure that in next iteratrion we'll be sure that we'll load from ds.

That's what it does. If there is no CA record for this server, it will load it from disk. If there is a failure when trying to load from disk, we fail with an error and it will try the same the next time. Note that having no journal is also a valid scenario. No CA record, no disk file, doesn't cause an error and will start with an empty journal.
The migration logic is pretty straightforward. The mechanism consists in just writing the next update on the database and on disk, when this feature is enabled. This way, the journal is always backed on disk during the 1.9.x timeframe to allow downgrades to 1.8.x.

MarcosDY · 2023-12-04T13:52:26Z

pkg/server/ca/manager/journal.go

+	// TODO: stop trying to load the journal from disk in v1.10 and delete
+	// the journal file if exists.


mmm not sure about 'delete journal file', I agree we must remove loading code, but removing files looks like an overkill

Moving the journal off the disk is to improve both deployability and security. If we leave the old journal on disk, it might leak the information necessary to perform a rotation to a CA that is very similar to the one in SPIRE.

In this case, after a successful migration of the Journal into the database, removing the Journal from the disk prevents a second attempt at migration, and greatly reduces the chance that an old journal overwrites the current journal (as we expect that the database journal will always be newer than the disk journal after this commit is merged).

We discussed this a little bit during the last maintainer's call and I think that the consensus is that the right thing to do would be to remove the journal from disk in v1.10.

MarcosDY · 2023-12-04T17:28:50Z

pkg/server/datastore/sqlstore/sqlstore.go

+
+func deleteCAJournal(tx *gorm.DB, caJournalID uint) error {
+	model := new(CAJournal)
+	if err := tx.Find(model, "id = ?", caJournalID).Error; err != nil {


are you interested in returning error when not found right? if no, you can do a tx.Where().Delete instead

Exactly, I think that's good to return an error and log it, since it's unexpected.

MarcosDY · 2023-12-04T17:31:57Z

pkg/server/datastore/sqlstore/sqlstore.go

+
+// PruneCAJournals prunes the CA journals that have all of their authorities
+// expired.
+func (ds *Plugin) PruneCAJournals(ctx context.Context, allAuthoritiesExpireBefore int64) error {


allAuthoritiesExpireBefore is confusing to me, it looks like it will remove all journals that expires before now.Add(SOME TIME).Unix() , no all that are really expired

We use "expiresBefore" in PruneBundle, PruneRegistrationEntries and pruneJoinTokens. If we change the naming we should probably do that in all those places. May we revisit that separately?
Note that we subtract the threshold duration (it's not a positive Add operation), we do Clock.Now().Add(-safetyThreshold), so this really is "prune the CA journals that have all of the authorities expired since more than 2 weeks", which is the CA journals that have all of the authorities expired before more than two weeks ago.

MarcosDY · 2023-12-04T17:32:45Z

pkg/server/datastore/sqlstore/sqlstore.go

+	for _, model := range caJournals {
+		entries := new(journal.Entries)
+		if err := proto.Unmarshal(model.Data, entries); err != nil {
+			return errs.New("unable to unmarshal entries from CA journal record: %v", err)


I think that this should be status.Errorf, I'll change that.

MarcosDY · 2023-12-04T17:36:06Z

pkg/server/datastore/sqlstore/sqlstore.go

+
+		for _, x509CA := range entries.X509CAs {
+			if x509CA.NotAfter > allAuthoritiesExpireBefore {
+				continue checkAuthorities


when not expired after that time, will not this retry for ever?

There is no retry logic here. This iterates through all the journals, and inside each journal, over all X509 CAs and JWT keys.
If x509CA.NotAfter > allAuthoritiesExpireBefore, that means that the authority expires after the expiration threshold, and the whole journal must be skipped. Otherwise, it will take the next x509CA and check the same thing. Once there are no more X509 CAs to check in this journal, it will go over the JWT keys and it will do the same.

MarcosDY · 2023-12-04T17:48:08Z

pkg/server/datastore/sqlstore/sqlstore.go

+				continue checkAuthorities
+			}
+		}
+		if err := deleteCAJournal(tx, model.ID); err != nil {


if is not clear to me... but, if a bundle expires tomorrow, then x509CA.NotAfter > allAuthoritiesExpireBefore will be false,
and marked to remove.
it sounds to me that we must search for expired bundles (from today) and then see for how much time it was expired
in our case we look for entries that are expired 2weeks ago,
and to do that we'll need to provide a duration not a date.
makes sense?

The logic is the opposite, if x509CA.NotAfter > allAuthoritiesExpireBefore the CA journal will be skipped. It essentially goes over all the authorities (including JWT keys) and checks if all of them expired more than two weeks ago. If not, the whole journal record will be skipped.

Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>

MarcosDY · 2024-01-02T14:30:01Z

pkg/server/ca/manager/manager_test.go

+	tomorrow := test.clock.Now().Add(time.Hour * 24).Unix()
+	beforeThreshold := test.clock.Now().Add(-safetyThresholdCAJournals).Add(-time.Minute).Unix()


nit:

tomorrow := now.Add(time.Hour * 24).Unix() beforeThreshold := now.Add(-safetyThresholdCAJournals).Add(-time.Minute).Unix()

MarcosDY · 2024-01-02T15:45:36Z

pkg/server/ca/manager/journal.go

+		if err != nil {
+			return 0, fmt.Errorf("failed to get CA journal from database: %w", err)
+		}
+		if caJournal == nil {


this caJournal is never used, since you are using := and never setting the ID, if found must not it be setting the id inside j.caJournalID?

I think that we can safely remove this call to findCAJournal since a save operation will always happen after the loadJournalFromDS operation, which is the one that calls to findCAJournal and will update j.caJournalID.

MarcosDY · 2024-01-02T16:59:14Z

pkg/server/ca/manager/journal.go

+			// There is a CA journal record that has an active X509 authority
+			// ID that matches with one of the public keys of this server. This
+			// means that this record belongs to this server.
+			j.config.log.WithFields(logrus.Fields{


this is connected to a previous comment, but looks like the response from this function is never used, so caJournalID will not be set and this log message is wrong

I think that we can safely remove the call to findCAJournal, and I'll fix to refer to telemetry.CAJournalID: caJournal.ID here.

MarcosDY · 2024-01-02T17:09:12Z

pkg/server/ca/manager/journal_test.go

 		// Verify entries is empty
-		s.AssertProtoEqual(&JournalEntries{}, journal.Entries())
+		spiretest.RequireProtoEqual(t, &journal.Entries{}, j.getEntries())


require.Empty(t, j.getEntries())?

I don't think that we can use require.Empty here.
require.Empty asserts that the specified object is empty. I.e. nil, "", false, 0 or either a slice or a channel with len == 0.
We want to assert that the referenced struct is an empty struct (&journal.Entries{}) but not assert that is nil. We are noy asserting a slice.

MarcosDY · 2024-01-02T17:14:29Z

pkg/server/ca/manager/journal_test.go

+	require.NoError(t, err)
+	require.NoError(t, j.UpdateX509CAStatus(ctx, now, journal.Status_ACTIVE))
+
+	spiretest.RequireProtoEqual(t, j.getEntries(), test.loadJournalFromDS(t).getEntries())


both loads returns pointers, so if this fail for some reason we'll get a panic,
may we verify it is not empty before trying to get entries? (same comment for all load* calls)

MarcosDY · 2024-01-02T18:33:57Z

pkg/server/ca/manager/manager_test.go

+				require.NoError(t, err)
+
+				if !j.shouldBePruned {
+					testCase.expectedCAJournals = append(testCase.expectedCAJournals, caJournal)


since expected is not set when defining testCase, and is calculated using shouldBePruned, may you just add a var at function level?
Or just add the expect list hardcoded instead

Since we need to populate the datastore with journal data, that's saved marshalled and each record is assigned an ID, I found easier to just build the expectedCAJournals list with the items returned from the SetCAJournal operation that we know that should not be pruned.

Moved expectedCAJournals to a var.

Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>

MarcosDY

LGTM!

…piffe#4690) * Save and load the CA journal from datastore Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>

amartinezfayo added 2 commits November 24, 2023 16:10

Save and load the CA journal from datastore

6818fa0

Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>

Merge remote-tracking branch 'origin' into ca-journal-datastore

557fad3

Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>

amartinezfayo added this to the 1.8.6 milestone Nov 24, 2023

amartinezfayo requested review from evan2645, azdagron, MarcosDY and rturner3 as code owners November 24, 2023 20:01

amartinezfayo force-pushed the ca-journal-datastore branch from 2819ae5 to 0e9eff7 Compare November 24, 2023 20:23

Some fixes

2bef377

Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>

amartinezfayo force-pushed the ca-journal-datastore branch from 0e9eff7 to 2bef377 Compare November 25, 2023 13:42

MarcosDY self-assigned this Nov 28, 2023

MarcosDY reviewed Dec 4, 2023

View reviewed changes

azdagron modified the milestones: 1.8.7, 1.9.0 Dec 14, 2023

amartinezfayo added 2 commits December 22, 2023 19:08

Address PR comments

0180980

Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>

Merge remote-tracking branch 'origin/main' into ca-journal-datastore

f05b12a

amartinezfayo mentioned this pull request Dec 27, 2023

[Question] PersistentVolume for /run/spire/data #4751

Closed

MarcosDY reviewed Jan 2, 2024

View reviewed changes

amartinezfayo added 2 commits January 2, 2024 19:44

Address PR comments

4aaa2ac

Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>

Address PR comment

a19480f

Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>

MarcosDY approved these changes Jan 3, 2024

View reviewed changes

Merge branch 'main' into ca-journal-datastore

ee7c907

MarcosDY merged commit 1c8dc49 into spiffe:main Jan 3, 2024
32 checks passed

This was referenced Jan 3, 2024

On-disk data dependency: populate the ca_journals table on CA rotation #4464

Closed

Remove server dependency on on-disk data #1840

Closed

	func (m *fakeCAManager) RotateJWTKey(_ context.Context) {
	func (m *fakeCAManager) RotateJWTKey(context.Context) {

		// TODO: stop trying to load the journal from disk in v1.10 and delete
		// the journal file if exists.

		tomorrow := test.clock.Now().Add(time.Hour * 24).Unix()
		beforeThreshold := test.clock.Now().Add(-safetyThresholdCAJournals).Add(-time.Minute).Unix()

Introduce support to save and load the CA journal from the datastore #4690

Introduce support to save and load the CA journal from the datastore #4690

Conversation

amartinezfayo commented Nov 24, 2023 • edited Loading

edwbuck commented Dec 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amartinezfayo Dec 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcosDY left a comment

Choose a reason for hiding this comment

amartinezfayo commented Nov 24, 2023 •

edited

Loading

amartinezfayo Dec 8, 2023 •

edited

Loading