Skip to content

Commit b638bb0

Browse files
authored
A83: xDS GCP Authentication Filter (#438)
* A83: xDS GCP Authentication Filter * add mailing list link * use standalone call credential type * note that A74 is a prereq * various updates * review comments * clarify wording * refresh interval can vary a bit * include both early expiration and pre-emptive refresh * fix status mapping * clarify cluster metadata handling * reuse existing attribute for A74 instead of adding a new one * cache size defaults to 10 and cannot be set to 0 * note that the filter does not support config overrides * clarify wording on CDS metadata access * cluster metadata map must contain proto type of value * edge cases in filter behavior * more info on cache behavior, and add list of PRs in C-core * code review comments * fix handling of RPCs while in backoff * generalize wording about cache size
1 parent 71c5851 commit b638bb0

File tree

1 file changed

+334
-0
lines changed

1 file changed

+334
-0
lines changed

A83-xds-gcp-authn-filter.md

Lines changed: 334 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,334 @@
1+
A83: xDS GCP Authentication Filter
2+
----
3+
* Author(s): @markdroth
4+
* Approver: @ejona86, @dfawley
5+
* Status: {Draft, In Review, Ready for Implementation, Implemented}
6+
* Implemented in: <language, ...>
7+
* Last updated: 2024-10-25
8+
* Discussion at: https://groups.google.com/g/grpc-io/c/76a0zWJChX4
9+
10+
## Abstract
11+
12+
In service mesh environments, there are cases where intermediate proxies
13+
make it impossible to rely on mTLS for end-to-end authentication. These
14+
cases can be addressed instead by the use of service account identity
15+
JWT tokens. The xDS [GCP Authentication
16+
filter](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/gcp_authn_filter)
17+
provides a mechanism for attaching such JWT tokens as gRPC call
18+
credentials on GCP. We will add support for this filter in gRPC.
19+
20+
## Background
21+
22+
gRPC already supports a framework for xDS HTTP filters, as described in
23+
[gRFC A39][A39]. We will support the GCP Authentication filter under
24+
this framework.
25+
26+
### Related Proposals:
27+
* [gRFC A39: xDS HTTP Filters][A39]
28+
* [gRFC A60: xDS-Based Stateful Session Affinity for Weighted Clusters][A60]
29+
* [gRFC A74: xDS Config Tears][A74]
30+
* [RFC-7519: JSON Web Token (JWT)][RFC-7519]
31+
32+
[A39]: A39-xds-http-filters.md
33+
[A60]: A60-xds-stateful-session-affinity-weighted-clusters.md
34+
[A74]: A74-xds-config-tears.md
35+
[RFC-7519]: https://datatracker.ietf.org/doc/html/rfc7519
36+
37+
## Proposal
38+
39+
We will support the GCP Authentication xDS HTTP filter in the gRPC client.
40+
41+
### Call Credentials
42+
43+
Note: This section is intended for gRPC implementations that need to
44+
implement a new call credential type for GCP service account identity
45+
tokens. Implementations that already support this functionality (e.g.,
46+
by depending on an external Google Auth library) may continue to use
47+
their existing functionality, even if the behavior differs in small ways
48+
from what is described in this section.
49+
50+
gRPC should support a GcpServiceAccountIdentityCallCredentials call
51+
credentials type, which is not xDS-specific. This credential type will
52+
be instantiated with one parameter, which is the audience to be encoded
53+
into the JWT token. The credential object will handle fetching the
54+
token on-demand and caching it based on the token's expiration time.
55+
56+
To handle potential clock skew issues and to account for processing time
57+
on the server, the credential will set the cache expiration time to be
58+
30 seconds before the expiration time encoded in the token. All logic
59+
in the call credential code will use this modified expiration time
60+
instead of the expiration time encoded in the token.
61+
62+
When the credential is asked for a token for a data
63+
plane RPC, if the token is not yet cached or the cached
64+
token will expire within some fixed refresh interval
65+
(typically 1 minute), the credential will start an HTTP request (if there
66+
is not already one pending) to
67+
`http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/identity?audience=[AUDIENCE]`,
68+
where `[AUDIENCE]` is replaced with the audience specified when the
69+
credential object was instantiated. The HTTP request will include the
70+
header `Metadata-Flavor: Google`.
71+
72+
When a data plane RPC starts, if the token is cached and is not expired,
73+
the token will immediately be added to the RPC, and the RPC will continue.
74+
Otherwise (i.e., before the token is initially obtained or after the
75+
cached token has expired), the data plane RPC will be queued until the
76+
HTTP request completes. When the HTTP request completes, the result
77+
(either success or failure, as described below) will be applied to all
78+
queued data plane RPCs.
79+
80+
Note that when the token's expiration time is less than the refresh
81+
interval in the future, a new data plane RPC being started will trigger
82+
a new HTTP request, but the cached token value will still be used for
83+
that data plane RPC. This pre-emptive re-fetching is intended to avoid
84+
periodic latency spikes when refreshing the token.
85+
86+
If the HTTP request fails, all queued data plane RPCs will be failed
87+
with a gRPC status determined based on the returned HTTP status. If the
88+
returned HTTP status maps to `UNAVAILABLE` in [HTTP to gRPC Status Code
89+
Mapping](https://github.com/grpc/grpc/blob/master/doc/http-grpc-status-mapping.md),
90+
then the data plane RPCs will be failed with status `UNAVAILABLE`;
91+
otherwise, they will be failed with status `UNAUTHENTICATED`. If the
92+
request fails without an HTTP status (e.g., an I/O error), all queued
93+
data plane RPCs will be failed with `UNAVAILABLE` status.
94+
95+
If the HTTP request succeeds, the body of the response will contain the
96+
JWT token. which the client will cache. The client does not need to
97+
do full [RFC-7519] validation of the token (that is the responsibility
98+
of the server side), but it does need to extract the `exp` field for
99+
caching purposes. If the `exp` field cannot be extracted (i.e., the JWT
100+
token is invalid), all queued data plane RPCs will be failed with status
101+
`UNAUTHENTICATED`. Otherwise, the cache is updated, and the returned
102+
token is added to all queued data plane RPCs, which may then continue.
103+
104+
If the HTTP request does not result in the cache being updated (i.e.,
105+
if the HTTP request fails or if it returns an invalid JWT token),
106+
[backoff](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md)
107+
must be applied before the next attempt may be started. If a data
108+
plane RPC is started when there is no cached token available and while
109+
in backoff delay, it will be failed with the status from the last HTTP
110+
request attempt. When the backoff delay expires, the next data plane
111+
RPC will trigger a new attempt. Note that no attempt should be started
112+
until and unless a data plane RPC is started, since we do not want to
113+
unnecessarily retry if the channel is idle. The backoff state will be
114+
reset once there is a successful HTTP request.
115+
116+
To add the token to a data plane RPC, the call credential will add a
117+
header named `authorization`. The header value will be the string
118+
`Bearer ` (note trailing space) followed by the token value.
119+
120+
### xDS HTTP Filter Configuration
121+
122+
The xDS HTTP filter will be configured via the
123+
[`extensions.filters.http.gcp_authn.v3.GcpAuthnFilterConfig`
124+
message](https://github.com/envoyproxy/envoy/blob/c16faca3619fb44c24b12d15aad8a797b9e210ab/api/envoy/extensions/filters/http/gcp_authn/v3/gcp_authn.proto#L27).
125+
The fields will be interpretted as follows:
126+
- `cache_config`: Optional. Within this message:
127+
- `cache_size`: Optional. If set, must be greater than 0. Defaults
128+
to 10. Implementations that cannot support caches as large as
129+
`UINT64_MAX` may cap this value at their maximum supported size.
130+
- `http_uri`: Ignored by gRPC.
131+
- `token_header`: Ignored by gRPC.
132+
- `retry_policy`: Ignored by gRPC.
133+
- `cluster`: Ignored by gRPC.
134+
- `timeout`: Ignored by gRPC.
135+
136+
Note that this filter does not support having its config overridden in a
137+
`typed_per_filter_config` field on a per-route, per-virtualhost, or
138+
per-clusterweight basis. If the filter's config message appears in a
139+
`typed_per_filter_config` field, it will be validated as part of the
140+
normal resource validation, but the configuration will not actually be
141+
used.
142+
143+
### xDS Cluster Metadata
144+
145+
The GCP Authentication filter uses cluster metadata from the
146+
[`Cluster.metadata`
147+
field](https://github.com/envoyproxy/envoy/blob/7436690884f70b5550b6953988d05818bae3d087/api/envoy/config/cluster/v3/cluster.proto#L1092)
148+
to configure the audience. We will process this field when validating
149+
the CDS resource and convert it into a map, which will be added to
150+
the parsed cluster resource that is passed to the XdsClient watcher.
151+
152+
The metadata field is a message that actually contains two maps:
153+
- [`filter_metadata`](https://github.com/envoyproxy/envoy/blob/7436690884f70b5550b6953988d05818bae3d087/api/envoy/config/core/v3/base.proto#L248):
154+
This map contains `google.protobuf.Struct` values, which we will
155+
convert to parsed JSON form.
156+
- [`typed_filter_metadata`](https://github.com/envoyproxy/envoy/blob/7436690884f70b5550b6953988d05818bae3d087/api/envoy/config/core/v3/base.proto#L257):
157+
This map contains `google.protobuf.Any` fields. To support this, we
158+
will use a registry-like approach for metadata types (may be an actual
159+
registry or just a block of code that supports the known protobuf
160+
message types) that handles parsing the `google.protobuf.Any` field
161+
and converting it some internal form appropriate for the implementation
162+
(e.g., JSON or a native struct).
163+
164+
The value for a given metadata key will come from only one of the
165+
two maps; the value from `filter_metadata` will be used only if the
166+
key is not present or is of an unknown protobuf message type in
167+
`typed_filter_metadata`. In the resulting map in the parsed cluster
168+
resource, the map value will contain the type of the original message
169+
(`google.protobuf.Struct` if it came from the `filter_metadata` map) and
170+
a parsed representation of the content. The parsed representation may
171+
be either JSON or the appropriate internal form, depending on which of
172+
the two maps the entry came from.
173+
174+
The logic to validate cluster metadata will look something like this
175+
(pseudo-code):
176+
177+
```python
178+
parsed_metadata = {} # Value is either JSON or parsed object
179+
# First process typed_filter_metadata.
180+
for key, any_field in cluster_metadata.typed_filter_metadata.items():
181+
parser = metadata_registry.FindParser(any_field.type_url)
182+
if parser is not None:
183+
value = parser.Parse(any_field.value)
184+
if value is None:
185+
return NACK # Parsing failed, reject resource
186+
parsed_metadata[key] = value
187+
# Now process filter_metadata. We look only at keys that were not
188+
# already added from typed_filter_metadata.
189+
for key, struct_field in cluster_metadata.filter_metadata.items():
190+
if key not in parsed_metadata:
191+
parsed_metadata[key] = ConvertToJson(struct_field)
192+
```
193+
194+
For now, the only registered metadata type we support is
195+
[`extensions.filters.http.gcp_authn.v3.Audience`](https://github.com/envoyproxy/envoy/blob/c16faca3619fb44c24b12d15aad8a797b9e210ab/api/envoy/extensions/filters/http/gcp_authn/v3/gcp_authn.proto#L66).
196+
In this message, the `url` field must be non-empty; if empty, the
197+
resource will be NACKed. The parsed representation of this message can
198+
be a simple string.
199+
200+
### xDS ConfigSelector Behavior
201+
202+
As per [gRFC A60][A60], we currently pass the selected cluster name via
203+
a call attribute for access in filters. However, the filters will now
204+
also need access to the CDS resource for the selected cluster, so that
205+
the GCP Authentication filter can access the cluster metadata for the
206+
selected cluster. This data is available via the `XdsConfig` attribute
207+
introduced in [A74]. If the xDS ConfigSelector is not already passing
208+
that attribute to the filters, it will need to be changed to do so.
209+
210+
### Filter Call Credentials Cache
211+
212+
The filter will maintain a cache of
213+
GcpServiceAccountIdentityCallCredentials instances, one for each audience,
214+
along with a last-used list that tracks how recently the entries
215+
have been used. As an entry is used, it is moved to the front of the
216+
last-used list. The maximum number of entries in the cache is bounded
217+
by the config field `cache_config.cache_size`; if the cache exceeds that
218+
size, then entries will be removed starting from the end of the
219+
last-used list.
220+
221+
Note that the `cache_config.cache_size` parameter in the filter config
222+
is a channel-level parameter, not settable per-route, and we want the
223+
cache itself to be shared across all routes. Implementations that create
224+
separate filter/interceptor instances for each route should share the
225+
cache between those instances.
226+
227+
It is desirable to avoid losing this cache when we get an xDS Listener or
228+
RouteConfiguration update, so that we don't wind up needlessly refetching
229+
tokens after the update. Implementations should provide a mechanism for
230+
new instances of the filter to retain the cache from previous instances.
231+
232+
If an LDS update changes the cache size, the filter must apply that
233+
change to the cache. If the cache currently has more entries in it than
234+
the new cache size, then the least recently used entries will be removed
235+
to make the cache adhere to the new size limit.
236+
237+
### Filter Behavior
238+
239+
When the filter processes the RPC's initial metadata, it will first
240+
check to see what cluster the RPC is being sent to. If the RPC is being
241+
sent to a route that uses a cluster specifier plugin instead of a fixed
242+
cluster, then the filter is a no-op. Otherwise, the filter will attempt
243+
to determine the audience by looking at the CDS resource for the cluster
244+
that the RPC is being sent to.
245+
246+
If the CDS resource is not available (e.g., because the client received an
247+
error without having previously received a valid resource, or because the
248+
server indicated that the resource has been deleted), then the filter will
249+
fail the RPC with status `UNAVAILABLE`. Note that this does yield
250+
sub-optimal behavior for wait_for_ready RPCs, since we will fail them
251+
instead of queuing them, but we don't currently have a good alternative:
252+
the filter cannot queue the call until the client gets a valid CDS
253+
resource, because once that happens, a new instance of the filter will be
254+
swapped in for subsequent calls, but the queued call would already be tied
255+
to the original filter instance, which will never see the update.
256+
257+
Otherwise, the filter will look in the CDS resource's metadata for
258+
a key corresponding to the filter's instance name. Note that
259+
in Envoy, the cluster metadata keys must exactly match the
260+
legacy filter name (e.g., "envoy.filters.http.gcp_authn").
261+
However, as per envoyproxy/envoy#34251, it is desirable
262+
to instead use the HTTP filter instance name from the [`HttpFilter.name`
263+
field](https://github.com/envoyproxy/envoy/blob/7436690884f70b5550b6953988d05818bae3d087/api/envoy/extensions/filters/network/http_connection_manager/v3/http_connection_manager.proto#L1149).
264+
We will implement that behavior in gRPC.
265+
266+
If the cluster metadata does not contain a key matching the filter's
267+
instance name, then the filter is a no-op. If a cluster metadata entry
268+
exists for the filter's instance name, but the entry is of a type other
269+
than `extensions.filters.http.gcp_authn.v3.Audience`, then the filter
270+
will fail data plane RPCs with status `UNAVAILABLE`. Otherwise, the
271+
audience is the value of the `url` field in the `Audience` proto.
272+
273+
The filter will then check to see if it already has a cached
274+
GcpServiceAccountIdentityCallCredentials instance for the specified
275+
audience. If it does not, it will create a new instance, adding it to
276+
its cache, removing the least recently used entry from the cache if the
277+
cache is already at its max size. It will then attach that
278+
GcpServiceAccountIdentityCallCredentials instance to the RPC.
279+
280+
Note that implementations must ensure that the token is not added to
281+
RPCs sent on insecure connections. However, the GCP Authentication
282+
filter will run before load balancing has chosen a connection, so the
283+
filter cannot directly add the token to the RPC. Instead, it must add
284+
the call credential to the RPC, and the call credential will do the work
285+
of adding the token to the RPC later, after load balancing has chosen a
286+
connection.
287+
288+
### Temporary environment variable protection
289+
290+
Support for the GCP Authentication filter in the xDS HTTP filter
291+
registry and the `extensions.filters.http.gcp_authn.v3.Audience`
292+
entry in the metadata registry will be guarded by the
293+
`GRPC_EXPERIMENTAL_XDS_GCP_AUTHENTICATION_FILTER` env var. The env var
294+
guard will be removed once the feature passes interop tests.
295+
296+
## Rationale
297+
298+
It is not our intention to support this mechanism for GCP only; in
299+
principle, it should be possible to support JWT identity tokens for any
300+
cloud provider. However, at present, the existing xDS HTTP filter
301+
supports only GCP, so that's what we're initially focusing on, for
302+
compatibility with Envoy. We would be open to future contributions from
303+
the OSS community to provide similar functionality for other cloud
304+
providers, in both gRPC and Envoy.
305+
306+
Note that the cache structure in this design is a bit different from
307+
Envoy's implementation. In Envoy, the GCP Authentication filter directly
308+
maintains a single cache containing the tokens for each audience, with
309+
expiration based on the tokens' expiration times. In contrast, gRPC
310+
will essentially have a two-level cache here: the filter will maintain a
311+
cache of GcpServiceAccountIdentityCallCredentials instances for each
312+
audience with expiration based on their respective last-used times,
313+
and each of those GcpServiceAccountIdentityCallCredentials instances
314+
will internally cache the token for its audience based on the token's
315+
expiration time. In the majority of cases, this is expected to result
316+
in the same behavior, although it is conceivably possible for there to
317+
be edge cases where a given GcpServiceAccountIdentityCallCredentials
318+
instance is retained in the cache due to being used more recently even
319+
though it has actually been failing to obtain a token. However, this
320+
approach allows for cleaner code and better reuse of existing call
321+
credentials implementations in some languages.
322+
323+
## Implementation
324+
325+
C-core implementation:
326+
- generalize CDS metadata handling (https://github.com/grpc/grpc/pull/37468)
327+
- implement GcpServiceAccountIdentityCredentials
328+
(https://github.com/grpc/grpc/pull/37544)
329+
- validate Audience cluster metadata (https://github.com/grpc/grpc/pull/37566)
330+
- implement GCP auth filter (https://github.com/grpc/grpc/pull/37550)
331+
- mechanism for retaining cache across xDS updates
332+
(https://github.com/grpc/grpc/pull/37646)
333+
334+
Will be implemented in all other languages, timelines TBD.

0 commit comments

Comments
 (0)