Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce support to save and load the CA journal from the datastore #4690

Merged
merged 8 commits into from
Jan 3, 2024

Conversation

amartinezfayo
Copy link
Member

@amartinezfayo amartinezfayo commented Nov 24, 2023

This PR introduces support to save and load the CA journal from the datastore. Since the database schema changes required for this were done in v1.8.0, this is being introduced through a feature flag. This functionality is only enabled if the ca_journal_in_datastore feature flag is enabled.

  • In order to identify which CA journal record belongs to the server running, the server gets the KeyManager keys. It then uses the keys obtained to find if one of the keys matches a record with that key, based on the X509 authority ID of the key.
  • Server starting without a CA journal record in the datastore: the CA journal is loaded from disk and a new CA journal record is created when the journal is saved with updates. The CA journal is also saved on disk.
  • Server starting with a CA journal record in the datastore: the CA journal is loaded from the datastore and is saved both in the datastore and on disk on updates.
  • Saving the CA journal both in the datastore and on disk is needed to support downgrades. We need to keep this until v1.10, when we will be able to delete the CA journal file.
  • A pruning strategy for stale CA journal records (of servers that are no longer active) has been implemented. CA journal records with all the authorities that have expired since more than two weeks are pruned. A conservative period has been chosen to be able to access historical data if needed, even if the record would not be useful anymore to recover the keys in a server startup.

Fixes #4463, #4464.

Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>
Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>
Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>
@MarcosDY MarcosDY self-assigned this Nov 28, 2023
@edwbuck
Copy link
Contributor

edwbuck commented Dec 1, 2023

@MarcosDY Please start the commentary on this one, it is large and will likely take more time than a normal review.

In addition, this functionality only prepares for the next step, which includes no longer maintaining / using the on-disk footprint of the CA Journal. This step can only be performed after this PR has been merged.

@@ -38,6 +38,10 @@ const (
// reattestable. In those cases the agent will still renew without reattesting.
FlagReattestToRenew Flag = "reattest_to_renew"

// FlagCAJournalInDatastore controls whether or not to store the CA journal
// of the server in the datastore.
FlagCAJournalInDatastore = "ca_journal_in_datastore"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may this be deprecated? since we are going to remove this to make it default

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we keep it until we are doing all of our rotation with the database only? I agree it will be deprecated, but it seems like it's too early to remove it, considering this will be experimental in 1.9.0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcosDY are you suggesting that the flag should be true by default? That would be the process of deprecating a feature flag, and we cannot do that in 1.8.x.
Feature flags are meant to be used by developers while features are still under development or cannot be turned on to all users (like this case, because the schema change required was introduced in 1.8.0).

@edwbuck There is no need to have this feature introduced as an experimental feature in 1.9.0. All users will have the schema changes needed when they either update to 1.9.0 or fresh install 1.9.0.
In 1.9.0, this will be the default behavior to all users.

@@ -1865,7 +1865,7 @@ func (m *fakeCAManager) PrepareJWTKey(context.Context) error {
return m.prepareJWTKeyErr
}

func (m *fakeCAManager) RotateJWTKey() {
func (m *fakeCAManager) RotateJWTKey(_ context.Context) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same comment for RotateX509CA

Suggested change
func (m *fakeCAManager) RotateJWTKey(_ context.Context) {
func (m *fakeCAManager) RotateJWTKey(context.Context) {

mu sync.RWMutex
entries *JournalEntries
mu sync.RWMutex
activeX509AuthorityID string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is activeX509 authority id requied to be cached? why is it different to jwt authority?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CA journal record in the database has the active X509 authority ID so we can efficiently compare the obtained X509 authority IDs from the key manager with the record's active X509 authority ID.
When we update an stored X509CA entry to have the ACTIVE status, we keep updated the active X509 authority ID in this struct, so we know what's the active X509 authority ID when we save the record in the database.

if err != nil {
return nil, fmt.Errorf("failed to load journal from datastore: %w", err)
}
if j != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is nil, means that it was unable to fetch it from datastore, and we'll be moving into disk approach,
that is fine since we my be migrating here...
but i'm not sure, what happens if we are never able to fetch from ds and we move always to disk?
Since it is a feature that someone enabled intentionally I think we have 2 ways to follow here:
1- add a warning if flag is set but no journal found in DS, in a perferct world this means that it was unable to found and in next iteration warning wil be done.
2- create a migration code that try to load things from disk if not found, and then fails if that is no possible, with this we can be sure that in next iteratrion we'll be sure that we'll load from ds.

what do you think? maybe it is an overkill?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never fetching from the data store is a scenario that needs to be considered in different migration phases onto the database rotation scheme:

  • No migration started - Not fetching from the data store is exactly what it wanted.
  • Migration in process (as it is now) - Fetch from the data store but conditionally so. If the data store doesn't have information, load the information from disk (and update the datastore).
  • Migration complete (which will come in a future MR) - If we cannot fetch from the data store, we will need to create initial certificate information in the data store (much like the before-effort logic, where we bootstrap the configurations where there is no external PKI and no CA journal on disk).

As we are currently in the second step above, and currently in a migration from disk to datastore, we should load the datastore from disk if we can't fetch the datastore. As for "datastore unavailable" scenarios (in contrast to the data unavailable scenarios), this is the Server's Datastore, so it should eventually trigger a server restart with enough datastore failures.

Copy link
Member Author

@amartinezfayo amartinezfayo Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is a feature that someone enabled intentionally I think we have 2 ways to follow here:
1- add a warning if flag is set but no journal found in DS, in a perferct world this means that it was unable to found and in next iteration warning wil be done.

I don't think that we should log a warning. This is the expected behavior for all the systems running for the first time with this fflag enabled in 1.8.x, and will be the expected behavior for all the systems running for the first time in 1.9.0.

2- create a migration code that try to load things from disk if not found, and then fails if that is no possible, with this we can be sure that in next iteratrion we'll be sure that we'll load from ds.

That's what it does. If there is no CA record for this server, it will load it from disk. If there is a failure when trying to load from disk, we fail with an error and it will try the same the next time. Note that having no journal is also a valid scenario. No CA record, no disk file, doesn't cause an error and will start with an empty journal.
The migration logic is pretty straightforward. The mechanism consists in just writing the next update on the database and on disk, when this feature is enabled. This way, the journal is always backed on disk during the 1.9.x timeframe to allow downgrades to 1.8.x.

Comment on lines +72 to +73
// TODO: stop trying to load the journal from disk in v1.10 and delete
// the journal file if exists.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm not sure about 'delete journal file', I agree we must remove loading code, but removing files looks like an overkill

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the journal off the disk is to improve both deployability and security. If we leave the old journal on disk, it might leak the information necessary to perform a rotation to a CA that is very similar to the one in SPIRE.

In this case, after a successful migration of the Journal into the database, removing the Journal from the disk prevents a second attempt at migration, and greatly reduces the chance that an old journal overwrites the current journal (as we expect that the database journal will always be newer than the disk journal after this commit is merged).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this a little bit during the last maintainer's call and I think that the consensus is that the right thing to do would be to remove the journal from disk in v1.10.


func deleteCAJournal(tx *gorm.DB, caJournalID uint) error {
model := new(CAJournal)
if err := tx.Find(model, "id = ?", caJournalID).Error; err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you interested in returning error when not found right? if no, you can do a tx.Where().Delete instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, I think that's good to return an error and log it, since it's unexpected.


// PruneCAJournals prunes the CA journals that have all of their authorities
// expired.
func (ds *Plugin) PruneCAJournals(ctx context.Context, allAuthoritiesExpireBefore int64) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allAuthoritiesExpireBefore is confusing to me, it looks like it will remove all journals that expires before now.Add(SOME TIME).Unix() , no all that are really expired

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use "expiresBefore" in PruneBundle, PruneRegistrationEntries and pruneJoinTokens. If we change the naming we should probably do that in all those places. May we revisit that separately?
Note that we subtract the threshold duration (it's not a positive Add operation), we do Clock.Now().Add(-safetyThreshold), so this really is "prune the CA journals that have all of the authorities expired since more than 2 weeks", which is the CA journals that have all of the authorities expired before more than two weeks ago.

for _, model := range caJournals {
entries := new(journal.Entries)
if err := proto.Unmarshal(model.Data, entries); err != nil {
return errs.New("unable to unmarshal entries from CA journal record: %v", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%w?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should be status.Errorf, I'll change that.


for _, x509CA := range entries.X509CAs {
if x509CA.NotAfter > allAuthoritiesExpireBefore {
continue checkAuthorities
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when not expired after that time, will not this retry for ever?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no retry logic here. This iterates through all the journals, and inside each journal, over all X509 CAs and JWT keys.
If x509CA.NotAfter > allAuthoritiesExpireBefore, that means that the authority expires after the expiration threshold, and the whole journal must be skipped. Otherwise, it will take the next x509CA and check the same thing. Once there are no more X509 CAs to check in this journal, it will go over the JWT keys and it will do the same.

continue checkAuthorities
}
}
if err := deleteCAJournal(tx, model.ID); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if is not clear to me... but, if a bundle expires tomorrow, then x509CA.NotAfter > allAuthoritiesExpireBefore will be false,
and marked to remove.
it sounds to me that we must search for expired bundles (from today) and then see for how much time it was expired
in our case we look for entries that are expired 2weeks ago,
and to do that we'll need to provide a duration not a date.
makes sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is the opposite, if x509CA.NotAfter > allAuthoritiesExpireBefore the CA journal will be skipped. It essentially goes over all the authorities (including JWT keys) and checks if all of them expired more than two weeks ago. If not, the whole journal record will be skipped.

@azdagron azdagron modified the milestones: 1.8.7, 1.9.0 Dec 14, 2023
Comment on lines 608 to 609
tomorrow := test.clock.Now().Add(time.Hour * 24).Unix()
beforeThreshold := test.clock.Now().Add(-safetyThresholdCAJournals).Add(-time.Minute).Unix()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

	tomorrow := now.Add(time.Hour * 24).Unix()
	beforeThreshold := now.Add(-safetyThresholdCAJournals).Add(-time.Minute).Unix()

if err != nil {
return 0, fmt.Errorf("failed to get CA journal from database: %w", err)
}
if caJournal == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this caJournal is never used, since you are using := and never setting the ID, if found must not it be setting the id inside j.caJournalID?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can safely remove this call to findCAJournal since a save operation will always happen after the loadJournalFromDS operation, which is the one that calls to findCAJournal and will update j.caJournalID.

// There is a CA journal record that has an active X509 authority
// ID that matches with one of the public keys of this server. This
// means that this record belongs to this server.
j.config.log.WithFields(logrus.Fields{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is connected to a previous comment, but looks like the response from this function is never used, so caJournalID will not be set and this log message is wrong

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can safely remove the call to findCAJournal, and I'll fix to refer to telemetry.CAJournalID: caJournal.ID here.

// Verify entries is empty
s.AssertProtoEqual(&JournalEntries{}, journal.Entries())
spiretest.RequireProtoEqual(t, &journal.Entries{}, j.getEntries())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require.Empty(t, j.getEntries())?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we can use require.Empty here.
require.Empty asserts that the specified object is empty. I.e. nil, "", false, 0 or either a slice or a channel with len == 0.
We want to assert that the referenced struct is an empty struct (&journal.Entries{}) but not assert that is nil. We are noy asserting a slice.

require.NoError(t, err)
require.NoError(t, j.UpdateX509CAStatus(ctx, now, journal.Status_ACTIVE))

spiretest.RequireProtoEqual(t, j.getEntries(), test.loadJournalFromDS(t).getEntries())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both loads returns pointers, so if this fail for some reason we'll get a panic,
may we verify it is not empty before trying to get entries? (same comment for all load* calls)

require.NoError(t, err)

if !j.shouldBePruned {
testCase.expectedCAJournals = append(testCase.expectedCAJournals, caJournal)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since expected is not set when defining testCase, and is calculated using shouldBePruned, may you just add a var at function level?
Or just add the expect list hardcoded instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we need to populate the datastore with journal data, that's saved marshalled and each record is assigned an ID, I found easier to just build the expectedCAJournals list with the items returned from the SetCAJournal operation that we know that should not be pruned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved expectedCAJournals to a var.

Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>
Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>
Copy link
Collaborator

@MarcosDY MarcosDY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@MarcosDY MarcosDY merged commit 1c8dc49 into spiffe:main Jan 3, 2024
32 checks passed
sriyer pushed a commit to spire-vault/spire that referenced this pull request Feb 23, 2024
…piffe#4690)

* Save and load the CA journal from datastore

Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>
rushi47 pushed a commit to rushi47/spire that referenced this pull request Apr 11, 2024
…piffe#4690)

* Save and load the CA journal from datastore

Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

On-disk data dependency: import journal into ca_journals table
4 participants