[RFC] Federation Graduation #1619

kunzimariano · 2020-06-04T20:01:06Z

Enabling SPIFFE Federation in SPIRE today is done by the use of an experimental configuration flag. With the goal of promoting it to a stable feature, we need to discuss some issues with the current implementation that might be relevant for various members of the community.

Spire Server currently relies on a file based configuration that is only applied during startup. There are consequences associated with this approach:

Enabling federation on an already running server requires a restart.
When running in HA mode, it opens the possibility of a configuration mismatch between different servers in the same cluster. This creates issues with federation, like not all servers exposing the bundle endpoint, or when starting from scratch, two servers in the same cluster could be started with a different trust domain. Although the latter might be beyond the scope of graduating federation.

Configuration mismatch possible fix

Introducing some form of configuration validation could deal with the mismatch issue. For that purpose, we will need a way of knowing the global cluster configuration so we can compare with what we are trying to configure. Doing this will address the problem of adding a new server, or changing an already existing one, into an improper configuration that doesn't align with the rest of the cluster.

In order to enforce changes to the cluster configuration, we will need a way of expressing that we want to apply configuration changes to the whole cluster, so we don't get rejected by the configuration validation. We will also need to implement the propagation of this new configuration to every server in the cluster. A flag that we can pass when starting the server could suffice to express we want a configuration change.

Restart possible fix

Alternatively, we could move the federation configuration altogether. Instead of handling it in the configuration file, we could define an API for this purpose. Doing this will have the benefit of not needing to restart the server, and removes the possibility of servers with different federation configurations. It also reduces the cluster management, when it comes to federation, to just one entry point. The main difference will be that at least one server should be running in order to enable federation, as opposed to configuring it in advance.

Request for Comments

Does this sound reasonable? Are there other approaches to consider? Should we only address one of the issues or both of them?

azdagron · 2020-06-09T19:52:12Z

There are two parts to the federation configuration:

configuration for the bundle endpoint served by the SPIRE server
configuration for the bundle endpoints for trust domains the SPIRE server should federate with

I think number 1 is not likely to change (with any sort of frequency) over time. I think it is a great candidate to remain in the configuration file.

Number 2 however, may require some debate. Federation relationships aren't static and may change over time. The configuration file may not be the right approach.

Regarding "cluster" configuration and mismatch detection, what would be the source of truth? How would servers interact with that source of truth? The federation related configuration isn't the only thing that needs to match in HA. Other examples include the datastore configuration. I kind of feel that solving this problem is out of scope for graduating federation.

Regarding an API driven configuration, I think that has some promise. It does require that the cluster be up and running before you can make modifications to the federation relationships, but I think that would be ok. An outside process could "reconcile" the federation relationships using the API offered by the cluster.

Outside of how federation is configured, I think we have an additional problem that we might need solved before we can "graduate" federation. Namely, what happens when two or more servers race on updating the federated bundle. For example, consider the following:

Server A obtains bundle revision 1.
Bundle is revised.
Server B obtains bundle revision 2.
Server B stores bundle revision 2.
Server A stores bundle revision 1.

The above race would cause the bundle to stay at revision 1 until the next poll period. This may be an acceptable race since bundle revisions should be published well in advance, but nevertheless I think its good to think through.

APTy · 2020-06-11T14:56:24Z

While I agree federation relationships are likely to change more frequently than the bundle endpoint itself, the frequency of those changes are infrequent on an absolute scale - i.e. I'd expect to need to update federated bundles a few times a month, not a few times a day (though correct me if there are use cases that could warrant daily federated bundle updates). Given this, I'd lean towards config that can be checked in to source control, as IMHO it's simpler to reason about and doesn't require another process for reconciliation.

Regarding source of truth and races on writing the bundle, would an explicit revision number (or timestamp) solve this problem? Servers would always respect a higher revision number, and would use something like a select for update to ensure the revision being written is >= the current version number. The problem with this is that the revision number would probably leak into the config file (and require updates from users) which isn't great UX. But without some kind of optimistic locking mechanism like this, I think we need to live with the eventual consistency of the bundle object.

azdagron · 2020-06-12T16:33:18Z

We (maintainers) talked about this quite a bit. There is consensus that for many environments, the federation relationships between trust domains would change infrequently and that the config file is good enough for likely a majority of deployments. For those deployments that manage frequently changing federation relationships, nothing stops them from not using the config file for managing those relationships and instead building a sidecar or admin workload that manages fetching and updating the federated bundles via the SPIRE server APIs.

I think we're ok moving forward with a configuration file based approach based on the current experimental configuration.

Regarding the race condition around updating bundles, there is certainly more we can do, in particular with a little more support from the datastore to detect and correct the race. However, the SPIFFE Federation specification (in progress) gives some specific guidance that a new key for a trust domain be published in advance, between 3-5x the bundle refresh hint. This means that even if the race condition is hit, and a bundle update is lost until the next poll period, that SPIRE should be able to correct itself well before any new keys are put in use.

This doesn't mean we don't try and mitigate the race condition, but I think it does mean that we can carry forward with "graduating" the federation support in SPIRE and make this more robust down the road.

azdagron · 2020-06-12T19:14:57Z

Right now the experimental configuration looks something like this:

experimental {
    bundle_endpoint_enabled = true
    bundle_endpoint_address = "0.0.0.0"
    bundle_endpoint_port = 8443
    bundle_endpoint_acme = {
        domain_name = "example.org"
        tos_accepted = true
    }

    federates_with "spiffe://domain1.test" {
        bundle_endpoint_address = "1.2.3.4"
    }
    federates_with "spiffe://domain2.test" {
        bundle_endpoint_address = "5.6.7.8"
        use_web_pki = true
    }
}

How do people feel about changing it to the following for configuring the server's bundle endpoint?

server {
    bundle_endpoint {
        address = "0.0.0.0"
        port = 8443
        acme {
            domain_name = "example.org"
        }
    }
}

And for the configuration for consuming the bundle endpoint of another trust domain?

server {
    federate_with "spiffe://domain1.test" {
        bundle_endpoint {
            address = "1.2.3.4
            use_web_pki = true
        }
    }
}

martincapello · 2020-06-15T13:09:55Z

Looks good to me. I'll start working on refactoring the server configuration as specified, and document it as I go.

martincapello · 2020-06-17T20:11:07Z

Per #1656 (comment) the configuration would now look like:

server {
    federation {
        bundle_endpoint {
            address = "0.0.0.0"
            port = 8443
            acme {
                domain_name = "example.org"
            }
        }
        federate_with "spiffe://domain1.test" {
            bundle_endpoint {
                address = "1.2.3.4"
                use_web_pki = true
            }
        }
    }
}

martincapello · 2020-06-19T21:26:39Z

Is there any issue in using federates_with instead of federate_with? We already use the former in other places, as @amartinezfayo pointed out.

azdagron · 2020-06-22T16:01:24Z

I think keeping it consistent with federates_with is probably the right call.

martincapello · 2020-06-22T17:27:55Z

Cool, then the configuration ends up looking like:

server {
    federation {
        bundle_endpoint {
            address = "0.0.0.0"
            port = 8443
            acme {
                domain_name = "example.org"
            }
        }
        federates_with "spiffe://domain1.test" {
            bundle_endpoint {
                address = "1.2.3.4"
                use_web_pki = true
            }
        }
    }
}

martincapello mentioned this issue Jun 15, 2020

Document Federation configuration options #1514

Closed

martincapello mentioned this issue Jun 15, 2020

Refactor federation configuration #1656

Merged

azdagron closed this as completed in #1656 Jun 24, 2020

martincapello mentioned this issue Jul 23, 2020

Federation Graduation (the missing bits) #1743

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Federation Graduation #1619

[RFC] Federation Graduation #1619

kunzimariano commented Jun 4, 2020

azdagron commented Jun 9, 2020

APTy commented Jun 11, 2020

azdagron commented Jun 12, 2020

azdagron commented Jun 12, 2020

martincapello commented Jun 15, 2020

martincapello commented Jun 17, 2020 •

edited

Loading

martincapello commented Jun 19, 2020

azdagron commented Jun 22, 2020

martincapello commented Jun 22, 2020

[RFC] Federation Graduation #1619

[RFC] Federation Graduation #1619

Comments

kunzimariano commented Jun 4, 2020

Configuration mismatch possible fix

Restart possible fix

Request for Comments

azdagron commented Jun 9, 2020

APTy commented Jun 11, 2020

azdagron commented Jun 12, 2020

azdagron commented Jun 12, 2020

martincapello commented Jun 15, 2020

martincapello commented Jun 17, 2020 • edited Loading

martincapello commented Jun 19, 2020

azdagron commented Jun 22, 2020

martincapello commented Jun 22, 2020

martincapello commented Jun 17, 2020 •

edited

Loading