Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Federation Graduation #1619

Closed
kunzimariano opened this issue Jun 4, 2020 · 9 comments · Fixed by #1656
Closed

[RFC] Federation Graduation #1619

kunzimariano opened this issue Jun 4, 2020 · 9 comments · Fixed by #1656

Comments

@kunzimariano
Copy link
Member

Enabling SPIFFE Federation in SPIRE today is done by the use of an experimental configuration flag. With the goal of promoting it to a stable feature, we need to discuss some issues with the current implementation that might be relevant for various members of the community.

Spire Server currently relies on a file based configuration that is only applied during startup. There are consequences associated with this approach:

  • Enabling federation on an already running server requires a restart.
  • When running in HA mode, it opens the possibility of a configuration mismatch between different servers in the same cluster. This creates issues with federation, like not all servers exposing the bundle endpoint, or when starting from scratch, two servers in the same cluster could be started with a different trust domain. Although the latter might be beyond the scope of graduating federation.

Configuration mismatch possible fix

Introducing some form of configuration validation could deal with the mismatch issue. For that purpose, we will need a way of knowing the global cluster configuration so we can compare with what we are trying to configure. Doing this will address the problem of adding a new server, or changing an already existing one, into an improper configuration that doesn't align with the rest of the cluster.

In order to enforce changes to the cluster configuration, we will need a way of expressing that we want to apply configuration changes to the whole cluster, so we don't get rejected by the configuration validation. We will also need to implement the propagation of this new configuration to every server in the cluster. A flag that we can pass when starting the server could suffice to express we want a configuration change.

Restart possible fix

Alternatively, we could move the federation configuration altogether. Instead of handling it in the configuration file, we could define an API for this purpose. Doing this will have the benefit of not needing to restart the server, and removes the possibility of servers with different federation configurations. It also reduces the cluster management, when it comes to federation, to just one entry point. The main difference will be that at least one server should be running in order to enable federation, as opposed to configuring it in advance.

Request for Comments

Does this sound reasonable? Are there other approaches to consider? Should we only address one of the issues or both of them?

@azdagron
Copy link
Member

azdagron commented Jun 9, 2020

There are two parts to the federation configuration:

  1. configuration for the bundle endpoint served by the SPIRE server
  2. configuration for the bundle endpoints for trust domains the SPIRE server should federate with

I think number 1 is not likely to change (with any sort of frequency) over time. I think it is a great candidate to remain in the configuration file.

Number 2 however, may require some debate. Federation relationships aren't static and may change over time. The configuration file may not be the right approach.

Regarding "cluster" configuration and mismatch detection, what would be the source of truth? How would servers interact with that source of truth? The federation related configuration isn't the only thing that needs to match in HA. Other examples include the datastore configuration. I kind of feel that solving this problem is out of scope for graduating federation.

Regarding an API driven configuration, I think that has some promise. It does require that the cluster be up and running before you can make modifications to the federation relationships, but I think that would be ok. An outside process could "reconcile" the federation relationships using the API offered by the cluster.

Outside of how federation is configured, I think we have an additional problem that we might need solved before we can "graduate" federation. Namely, what happens when two or more servers race on updating the federated bundle. For example, consider the following:

  1. Server A obtains bundle revision 1.
  2. Bundle is revised.
  3. Server B obtains bundle revision 2.
  4. Server B stores bundle revision 2.
  5. Server A stores bundle revision 1.

The above race would cause the bundle to stay at revision 1 until the next poll period. This may be an acceptable race since bundle revisions should be published well in advance, but nevertheless I think its good to think through.

@APTy
Copy link
Contributor

APTy commented Jun 11, 2020

While I agree federation relationships are likely to change more frequently than the bundle endpoint itself, the frequency of those changes are infrequent on an absolute scale - i.e. I'd expect to need to update federated bundles a few times a month, not a few times a day (though correct me if there are use cases that could warrant daily federated bundle updates). Given this, I'd lean towards config that can be checked in to source control, as IMHO it's simpler to reason about and doesn't require another process for reconciliation.

Regarding source of truth and races on writing the bundle, would an explicit revision number (or timestamp) solve this problem? Servers would always respect a higher revision number, and would use something like a select for update to ensure the revision being written is >= the current version number. The problem with this is that the revision number would probably leak into the config file (and require updates from users) which isn't great UX. But without some kind of optimistic locking mechanism like this, I think we need to live with the eventual consistency of the bundle object.

@azdagron
Copy link
Member

We (maintainers) talked about this quite a bit. There is consensus that for many environments, the federation relationships between trust domains would change infrequently and that the config file is good enough for likely a majority of deployments. For those deployments that manage frequently changing federation relationships, nothing stops them from not using the config file for managing those relationships and instead building a sidecar or admin workload that manages fetching and updating the federated bundles via the SPIRE server APIs.

I think we're ok moving forward with a configuration file based approach based on the current experimental configuration.

Regarding the race condition around updating bundles, there is certainly more we can do, in particular with a little more support from the datastore to detect and correct the race. However, the SPIFFE Federation specification (in progress) gives some specific guidance that a new key for a trust domain be published in advance, between 3-5x the bundle refresh hint. This means that even if the race condition is hit, and a bundle update is lost until the next poll period, that SPIRE should be able to correct itself well before any new keys are put in use.

This doesn't mean we don't try and mitigate the race condition, but I think it does mean that we can carry forward with "graduating" the federation support in SPIRE and make this more robust down the road.

@azdagron
Copy link
Member

Right now the experimental configuration looks something like this:

experimental {
    bundle_endpoint_enabled = true
    bundle_endpoint_address = "0.0.0.0"
    bundle_endpoint_port = 8443
    bundle_endpoint_acme = {
        domain_name = "example.org"
        tos_accepted = true
    }

    federates_with "spiffe://domain1.test" {
        bundle_endpoint_address = "1.2.3.4"
    }
    federates_with "spiffe://domain2.test" {
        bundle_endpoint_address = "5.6.7.8"
        use_web_pki = true
    }
}

How do people feel about changing it to the following for configuring the server's bundle endpoint?

server {
    bundle_endpoint {
        address = "0.0.0.0"
        port = 8443
        acme {
            domain_name = "example.org"
        }
    }
}

And for the configuration for consuming the bundle endpoint of another trust domain?

server {
    federate_with "spiffe://domain1.test" {
        bundle_endpoint {
            address = "1.2.3.4
            use_web_pki = true
        }
    }
}

@martincapello
Copy link
Contributor

Looks good to me. I'll start working on refactoring the server configuration as specified, and document it as I go.

@martincapello
Copy link
Contributor

martincapello commented Jun 17, 2020

Per #1656 (comment) the configuration would now look like:

server {
    federation {
        bundle_endpoint {
            address = "0.0.0.0"
            port = 8443
            acme {
                domain_name = "example.org"
            }
        }
        federate_with "spiffe://domain1.test" {
            bundle_endpoint {
                address = "1.2.3.4"
                use_web_pki = true
            }
        }
    }
}

@martincapello
Copy link
Contributor

Is there any issue in using federates_with instead of federate_with? We already use the former in other places, as @amartinezfayo pointed out.

@azdagron
Copy link
Member

I think keeping it consistent with federates_with is probably the right call.

@martincapello
Copy link
Contributor

Cool, then the configuration ends up looking like:

server {
    federation {
        bundle_endpoint {
            address = "0.0.0.0"
            port = 8443
            acme {
                domain_name = "example.org"
            }
        }
        federates_with "spiffe://domain1.test" {
            bundle_endpoint {
                address = "1.2.3.4"
                use_web_pki = true
            }
        }
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants