|
| 1 | +A83: xDS GCP Authentication Filter |
| 2 | +---- |
| 3 | +* Author(s): @markdroth |
| 4 | +* Approver: @ejona86, @dfawley |
| 5 | +* Status: {Draft, In Review, Ready for Implementation, Implemented} |
| 6 | +* Implemented in: <language, ...> |
| 7 | +* Last updated: 2024-10-25 |
| 8 | +* Discussion at: https://groups.google.com/g/grpc-io/c/76a0zWJChX4 |
| 9 | + |
| 10 | +## Abstract |
| 11 | + |
| 12 | +In service mesh environments, there are cases where intermediate proxies |
| 13 | +make it impossible to rely on mTLS for end-to-end authentication. These |
| 14 | +cases can be addressed instead by the use of service account identity |
| 15 | +JWT tokens. The xDS [GCP Authentication |
| 16 | +filter](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/gcp_authn_filter) |
| 17 | +provides a mechanism for attaching such JWT tokens as gRPC call |
| 18 | +credentials on GCP. We will add support for this filter in gRPC. |
| 19 | + |
| 20 | +## Background |
| 21 | + |
| 22 | +gRPC already supports a framework for xDS HTTP filters, as described in |
| 23 | +[gRFC A39][A39]. We will support the GCP Authentication filter under |
| 24 | +this framework. |
| 25 | + |
| 26 | +### Related Proposals: |
| 27 | +* [gRFC A39: xDS HTTP Filters][A39] |
| 28 | +* [gRFC A60: xDS-Based Stateful Session Affinity for Weighted Clusters][A60] |
| 29 | +* [gRFC A74: xDS Config Tears][A74] |
| 30 | +* [RFC-7519: JSON Web Token (JWT)][RFC-7519] |
| 31 | + |
| 32 | +[A39]: A39-xds-http-filters.md |
| 33 | +[A60]: A60-xds-stateful-session-affinity-weighted-clusters.md |
| 34 | +[A74]: A74-xds-config-tears.md |
| 35 | +[RFC-7519]: https://datatracker.ietf.org/doc/html/rfc7519 |
| 36 | + |
| 37 | +## Proposal |
| 38 | + |
| 39 | +We will support the GCP Authentication xDS HTTP filter in the gRPC client. |
| 40 | + |
| 41 | +### Call Credentials |
| 42 | + |
| 43 | +Note: This section is intended for gRPC implementations that need to |
| 44 | +implement a new call credential type for GCP service account identity |
| 45 | +tokens. Implementations that already support this functionality (e.g., |
| 46 | +by depending on an external Google Auth library) may continue to use |
| 47 | +their existing functionality, even if the behavior differs in small ways |
| 48 | +from what is described in this section. |
| 49 | + |
| 50 | +gRPC should support a GcpServiceAccountIdentityCallCredentials call |
| 51 | +credentials type, which is not xDS-specific. This credential type will |
| 52 | +be instantiated with one parameter, which is the audience to be encoded |
| 53 | +into the JWT token. The credential object will handle fetching the |
| 54 | +token on-demand and caching it based on the token's expiration time. |
| 55 | + |
| 56 | +To handle potential clock skew issues and to account for processing time |
| 57 | +on the server, the credential will set the cache expiration time to be |
| 58 | +30 seconds before the expiration time encoded in the token. All logic |
| 59 | +in the call credential code will use this modified expiration time |
| 60 | +instead of the expiration time encoded in the token. |
| 61 | + |
| 62 | +When the credential is asked for a token for a data |
| 63 | +plane RPC, if the token is not yet cached or the cached |
| 64 | +token will expire within some fixed refresh interval |
| 65 | +(typically 1 minute), the credential will start an HTTP request (if there |
| 66 | +is not already one pending) to |
| 67 | +`http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/identity?audience=[AUDIENCE]`, |
| 68 | +where `[AUDIENCE]` is replaced with the audience specified when the |
| 69 | +credential object was instantiated. The HTTP request will include the |
| 70 | +header `Metadata-Flavor: Google`. |
| 71 | + |
| 72 | +When a data plane RPC starts, if the token is cached and is not expired, |
| 73 | +the token will immediately be added to the RPC, and the RPC will continue. |
| 74 | +Otherwise (i.e., before the token is initially obtained or after the |
| 75 | +cached token has expired), the data plane RPC will be queued until the |
| 76 | +HTTP request completes. When the HTTP request completes, the result |
| 77 | +(either success or failure, as described below) will be applied to all |
| 78 | +queued data plane RPCs. |
| 79 | + |
| 80 | +Note that when the token's expiration time is less than the refresh |
| 81 | +interval in the future, a new data plane RPC being started will trigger |
| 82 | +a new HTTP request, but the cached token value will still be used for |
| 83 | +that data plane RPC. This pre-emptive re-fetching is intended to avoid |
| 84 | +periodic latency spikes when refreshing the token. |
| 85 | + |
| 86 | +If the HTTP request fails, all queued data plane RPCs will be failed |
| 87 | +with a gRPC status determined based on the returned HTTP status. If the |
| 88 | +returned HTTP status maps to `UNAVAILABLE` in [HTTP to gRPC Status Code |
| 89 | +Mapping](https://github.com/grpc/grpc/blob/master/doc/http-grpc-status-mapping.md), |
| 90 | +then the data plane RPCs will be failed with status `UNAVAILABLE`; |
| 91 | +otherwise, they will be failed with status `UNAUTHENTICATED`. If the |
| 92 | +request fails without an HTTP status (e.g., an I/O error), all queued |
| 93 | +data plane RPCs will be failed with `UNAVAILABLE` status. |
| 94 | + |
| 95 | +If the HTTP request succeeds, the body of the response will contain the |
| 96 | +JWT token. which the client will cache. The client does not need to |
| 97 | +do full [RFC-7519] validation of the token (that is the responsibility |
| 98 | +of the server side), but it does need to extract the `exp` field for |
| 99 | +caching purposes. If the `exp` field cannot be extracted (i.e., the JWT |
| 100 | +token is invalid), all queued data plane RPCs will be failed with status |
| 101 | +`UNAUTHENTICATED`. Otherwise, the cache is updated, and the returned |
| 102 | +token is added to all queued data plane RPCs, which may then continue. |
| 103 | + |
| 104 | +If the HTTP request does not result in the cache being updated (i.e., |
| 105 | +if the HTTP request fails or if it returns an invalid JWT token), |
| 106 | +[backoff](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) |
| 107 | +must be applied before the next attempt may be started. If a data |
| 108 | +plane RPC is started when there is no cached token available and while |
| 109 | +in backoff delay, it will be failed with the status from the last HTTP |
| 110 | +request attempt. When the backoff delay expires, the next data plane |
| 111 | +RPC will trigger a new attempt. Note that no attempt should be started |
| 112 | +until and unless a data plane RPC is started, since we do not want to |
| 113 | +unnecessarily retry if the channel is idle. The backoff state will be |
| 114 | +reset once there is a successful HTTP request. |
| 115 | + |
| 116 | +To add the token to a data plane RPC, the call credential will add a |
| 117 | +header named `authorization`. The header value will be the string |
| 118 | +`Bearer ` (note trailing space) followed by the token value. |
| 119 | + |
| 120 | +### xDS HTTP Filter Configuration |
| 121 | + |
| 122 | +The xDS HTTP filter will be configured via the |
| 123 | +[`extensions.filters.http.gcp_authn.v3.GcpAuthnFilterConfig` |
| 124 | +message](https://github.com/envoyproxy/envoy/blob/c16faca3619fb44c24b12d15aad8a797b9e210ab/api/envoy/extensions/filters/http/gcp_authn/v3/gcp_authn.proto#L27). |
| 125 | +The fields will be interpretted as follows: |
| 126 | +- `cache_config`: Optional. Within this message: |
| 127 | + - `cache_size`: Optional. If set, must be greater than 0. Defaults |
| 128 | + to 10. Implementations that cannot support caches as large as |
| 129 | + `UINT64_MAX` may cap this value at their maximum supported size. |
| 130 | +- `http_uri`: Ignored by gRPC. |
| 131 | +- `token_header`: Ignored by gRPC. |
| 132 | +- `retry_policy`: Ignored by gRPC. |
| 133 | +- `cluster`: Ignored by gRPC. |
| 134 | +- `timeout`: Ignored by gRPC. |
| 135 | + |
| 136 | +Note that this filter does not support having its config overridden in a |
| 137 | +`typed_per_filter_config` field on a per-route, per-virtualhost, or |
| 138 | +per-clusterweight basis. If the filter's config message appears in a |
| 139 | +`typed_per_filter_config` field, it will be validated as part of the |
| 140 | +normal resource validation, but the configuration will not actually be |
| 141 | +used. |
| 142 | + |
| 143 | +### xDS Cluster Metadata |
| 144 | + |
| 145 | +The GCP Authentication filter uses cluster metadata from the |
| 146 | +[`Cluster.metadata` |
| 147 | +field](https://github.com/envoyproxy/envoy/blob/7436690884f70b5550b6953988d05818bae3d087/api/envoy/config/cluster/v3/cluster.proto#L1092) |
| 148 | +to configure the audience. We will process this field when validating |
| 149 | +the CDS resource and convert it into a map, which will be added to |
| 150 | +the parsed cluster resource that is passed to the XdsClient watcher. |
| 151 | + |
| 152 | +The metadata field is a message that actually contains two maps: |
| 153 | +- [`filter_metadata`](https://github.com/envoyproxy/envoy/blob/7436690884f70b5550b6953988d05818bae3d087/api/envoy/config/core/v3/base.proto#L248): |
| 154 | + This map contains `google.protobuf.Struct` values, which we will |
| 155 | + convert to parsed JSON form. |
| 156 | +- [`typed_filter_metadata`](https://github.com/envoyproxy/envoy/blob/7436690884f70b5550b6953988d05818bae3d087/api/envoy/config/core/v3/base.proto#L257): |
| 157 | + This map contains `google.protobuf.Any` fields. To support this, we |
| 158 | + will use a registry-like approach for metadata types (may be an actual |
| 159 | + registry or just a block of code that supports the known protobuf |
| 160 | + message types) that handles parsing the `google.protobuf.Any` field |
| 161 | + and converting it some internal form appropriate for the implementation |
| 162 | + (e.g., JSON or a native struct). |
| 163 | + |
| 164 | +The value for a given metadata key will come from only one of the |
| 165 | +two maps; the value from `filter_metadata` will be used only if the |
| 166 | +key is not present or is of an unknown protobuf message type in |
| 167 | +`typed_filter_metadata`. In the resulting map in the parsed cluster |
| 168 | +resource, the map value will contain the type of the original message |
| 169 | +(`google.protobuf.Struct` if it came from the `filter_metadata` map) and |
| 170 | +a parsed representation of the content. The parsed representation may |
| 171 | +be either JSON or the appropriate internal form, depending on which of |
| 172 | +the two maps the entry came from. |
| 173 | + |
| 174 | +The logic to validate cluster metadata will look something like this |
| 175 | +(pseudo-code): |
| 176 | + |
| 177 | +```python |
| 178 | +parsed_metadata = {} # Value is either JSON or parsed object |
| 179 | +# First process typed_filter_metadata. |
| 180 | +for key, any_field in cluster_metadata.typed_filter_metadata.items(): |
| 181 | + parser = metadata_registry.FindParser(any_field.type_url) |
| 182 | + if parser is not None: |
| 183 | + value = parser.Parse(any_field.value) |
| 184 | + if value is None: |
| 185 | + return NACK # Parsing failed, reject resource |
| 186 | + parsed_metadata[key] = value |
| 187 | +# Now process filter_metadata. We look only at keys that were not |
| 188 | +# already added from typed_filter_metadata. |
| 189 | +for key, struct_field in cluster_metadata.filter_metadata.items(): |
| 190 | + if key not in parsed_metadata: |
| 191 | + parsed_metadata[key] = ConvertToJson(struct_field) |
| 192 | +``` |
| 193 | + |
| 194 | +For now, the only registered metadata type we support is |
| 195 | +[`extensions.filters.http.gcp_authn.v3.Audience`](https://github.com/envoyproxy/envoy/blob/c16faca3619fb44c24b12d15aad8a797b9e210ab/api/envoy/extensions/filters/http/gcp_authn/v3/gcp_authn.proto#L66). |
| 196 | +In this message, the `url` field must be non-empty; if empty, the |
| 197 | +resource will be NACKed. The parsed representation of this message can |
| 198 | +be a simple string. |
| 199 | + |
| 200 | +### xDS ConfigSelector Behavior |
| 201 | + |
| 202 | +As per [gRFC A60][A60], we currently pass the selected cluster name via |
| 203 | +a call attribute for access in filters. However, the filters will now |
| 204 | +also need access to the CDS resource for the selected cluster, so that |
| 205 | +the GCP Authentication filter can access the cluster metadata for the |
| 206 | +selected cluster. This data is available via the `XdsConfig` attribute |
| 207 | +introduced in [A74]. If the xDS ConfigSelector is not already passing |
| 208 | +that attribute to the filters, it will need to be changed to do so. |
| 209 | + |
| 210 | +### Filter Call Credentials Cache |
| 211 | + |
| 212 | +The filter will maintain a cache of |
| 213 | +GcpServiceAccountIdentityCallCredentials instances, one for each audience, |
| 214 | +along with a last-used list that tracks how recently the entries |
| 215 | +have been used. As an entry is used, it is moved to the front of the |
| 216 | +last-used list. The maximum number of entries in the cache is bounded |
| 217 | +by the config field `cache_config.cache_size`; if the cache exceeds that |
| 218 | +size, then entries will be removed starting from the end of the |
| 219 | +last-used list. |
| 220 | + |
| 221 | +Note that the `cache_config.cache_size` parameter in the filter config |
| 222 | +is a channel-level parameter, not settable per-route, and we want the |
| 223 | +cache itself to be shared across all routes. Implementations that create |
| 224 | +separate filter/interceptor instances for each route should share the |
| 225 | +cache between those instances. |
| 226 | + |
| 227 | +It is desirable to avoid losing this cache when we get an xDS Listener or |
| 228 | +RouteConfiguration update, so that we don't wind up needlessly refetching |
| 229 | +tokens after the update. Implementations should provide a mechanism for |
| 230 | +new instances of the filter to retain the cache from previous instances. |
| 231 | + |
| 232 | +If an LDS update changes the cache size, the filter must apply that |
| 233 | +change to the cache. If the cache currently has more entries in it than |
| 234 | +the new cache size, then the least recently used entries will be removed |
| 235 | +to make the cache adhere to the new size limit. |
| 236 | + |
| 237 | +### Filter Behavior |
| 238 | + |
| 239 | +When the filter processes the RPC's initial metadata, it will first |
| 240 | +check to see what cluster the RPC is being sent to. If the RPC is being |
| 241 | +sent to a route that uses a cluster specifier plugin instead of a fixed |
| 242 | +cluster, then the filter is a no-op. Otherwise, the filter will attempt |
| 243 | +to determine the audience by looking at the CDS resource for the cluster |
| 244 | +that the RPC is being sent to. |
| 245 | + |
| 246 | +If the CDS resource is not available (e.g., because the client received an |
| 247 | +error without having previously received a valid resource, or because the |
| 248 | +server indicated that the resource has been deleted), then the filter will |
| 249 | +fail the RPC with status `UNAVAILABLE`. Note that this does yield |
| 250 | +sub-optimal behavior for wait_for_ready RPCs, since we will fail them |
| 251 | +instead of queuing them, but we don't currently have a good alternative: |
| 252 | +the filter cannot queue the call until the client gets a valid CDS |
| 253 | +resource, because once that happens, a new instance of the filter will be |
| 254 | +swapped in for subsequent calls, but the queued call would already be tied |
| 255 | +to the original filter instance, which will never see the update. |
| 256 | + |
| 257 | +Otherwise, the filter will look in the CDS resource's metadata for |
| 258 | +a key corresponding to the filter's instance name. Note that |
| 259 | +in Envoy, the cluster metadata keys must exactly match the |
| 260 | +legacy filter name (e.g., "envoy.filters.http.gcp_authn"). |
| 261 | +However, as per envoyproxy/envoy#34251, it is desirable |
| 262 | +to instead use the HTTP filter instance name from the [`HttpFilter.name` |
| 263 | +field](https://github.com/envoyproxy/envoy/blob/7436690884f70b5550b6953988d05818bae3d087/api/envoy/extensions/filters/network/http_connection_manager/v3/http_connection_manager.proto#L1149). |
| 264 | +We will implement that behavior in gRPC. |
| 265 | + |
| 266 | +If the cluster metadata does not contain a key matching the filter's |
| 267 | +instance name, then the filter is a no-op. If a cluster metadata entry |
| 268 | +exists for the filter's instance name, but the entry is of a type other |
| 269 | +than `extensions.filters.http.gcp_authn.v3.Audience`, then the filter |
| 270 | +will fail data plane RPCs with status `UNAVAILABLE`. Otherwise, the |
| 271 | +audience is the value of the `url` field in the `Audience` proto. |
| 272 | + |
| 273 | +The filter will then check to see if it already has a cached |
| 274 | +GcpServiceAccountIdentityCallCredentials instance for the specified |
| 275 | +audience. If it does not, it will create a new instance, adding it to |
| 276 | +its cache, removing the least recently used entry from the cache if the |
| 277 | +cache is already at its max size. It will then attach that |
| 278 | +GcpServiceAccountIdentityCallCredentials instance to the RPC. |
| 279 | + |
| 280 | +Note that implementations must ensure that the token is not added to |
| 281 | +RPCs sent on insecure connections. However, the GCP Authentication |
| 282 | +filter will run before load balancing has chosen a connection, so the |
| 283 | +filter cannot directly add the token to the RPC. Instead, it must add |
| 284 | +the call credential to the RPC, and the call credential will do the work |
| 285 | +of adding the token to the RPC later, after load balancing has chosen a |
| 286 | +connection. |
| 287 | + |
| 288 | +### Temporary environment variable protection |
| 289 | + |
| 290 | +Support for the GCP Authentication filter in the xDS HTTP filter |
| 291 | +registry and the `extensions.filters.http.gcp_authn.v3.Audience` |
| 292 | +entry in the metadata registry will be guarded by the |
| 293 | +`GRPC_EXPERIMENTAL_XDS_GCP_AUTHENTICATION_FILTER` env var. The env var |
| 294 | +guard will be removed once the feature passes interop tests. |
| 295 | + |
| 296 | +## Rationale |
| 297 | + |
| 298 | +It is not our intention to support this mechanism for GCP only; in |
| 299 | +principle, it should be possible to support JWT identity tokens for any |
| 300 | +cloud provider. However, at present, the existing xDS HTTP filter |
| 301 | +supports only GCP, so that's what we're initially focusing on, for |
| 302 | +compatibility with Envoy. We would be open to future contributions from |
| 303 | +the OSS community to provide similar functionality for other cloud |
| 304 | +providers, in both gRPC and Envoy. |
| 305 | + |
| 306 | +Note that the cache structure in this design is a bit different from |
| 307 | +Envoy's implementation. In Envoy, the GCP Authentication filter directly |
| 308 | +maintains a single cache containing the tokens for each audience, with |
| 309 | +expiration based on the tokens' expiration times. In contrast, gRPC |
| 310 | +will essentially have a two-level cache here: the filter will maintain a |
| 311 | +cache of GcpServiceAccountIdentityCallCredentials instances for each |
| 312 | +audience with expiration based on their respective last-used times, |
| 313 | +and each of those GcpServiceAccountIdentityCallCredentials instances |
| 314 | +will internally cache the token for its audience based on the token's |
| 315 | +expiration time. In the majority of cases, this is expected to result |
| 316 | +in the same behavior, although it is conceivably possible for there to |
| 317 | +be edge cases where a given GcpServiceAccountIdentityCallCredentials |
| 318 | +instance is retained in the cache due to being used more recently even |
| 319 | +though it has actually been failing to obtain a token. However, this |
| 320 | +approach allows for cleaner code and better reuse of existing call |
| 321 | +credentials implementations in some languages. |
| 322 | + |
| 323 | +## Implementation |
| 324 | + |
| 325 | +C-core implementation: |
| 326 | +- generalize CDS metadata handling (https://github.com/grpc/grpc/pull/37468) |
| 327 | +- implement GcpServiceAccountIdentityCredentials |
| 328 | + (https://github.com/grpc/grpc/pull/37544) |
| 329 | +- validate Audience cluster metadata (https://github.com/grpc/grpc/pull/37566) |
| 330 | +- implement GCP auth filter (https://github.com/grpc/grpc/pull/37550) |
| 331 | +- mechanism for retaining cache across xDS updates |
| 332 | + (https://github.com/grpc/grpc/pull/37646) |
| 333 | + |
| 334 | +Will be implemented in all other languages, timelines TBD. |
0 commit comments