Using cached EDS resources #26749

adisuissa · 2023-04-13T20:01:45Z

Title: Using cached EDS resources

Description:
Currently after an EDS-cluster update, Envoy waits for an EDS response. If a timeout occurs, the EDS-cluster will be used with an empty assignment. This may break ongoing traffic, if the xDS server fails to reply in a timely manner. As noted in #13009 this can be solved if Envoy will cache the EDS resources.

Unfortunately Envoy cannot immediately use the cached resource because there are scenarios where the Cluster update requires an updated ClusterLoadAssignment (CLA), and using the cached resource will break traffic. An example is when a non-TLS cluster is upgraded to supporting TLS (see #11877, and this test) and using the previously assigned CLA will break traffic. A possible way to solve this while still adhering to the xDS protocol is to require that upon a material change to the Cluster the xDS-server will also modify the [Eds service_name]

envoy/api/envoy/config/cluster/v3/cluster.proto

Line 207 in 25c0e7c

string service_name = 2;

). However, many servers do not adhere to this, and to ensure backwards compatibility, a different approach is needed.

Suggested approach
The idea is to allow fallback to a cached EDS resource as follows:

When receiving a new/updated EDS-Cluster, wait for a ClusterLoadAssignment (CLA) for that cluster (Cluster is in a warming state).
If the CLA doesn't arrive within a certain timeframe, fetch the CLA resource from the cache, if exists, and make the Cluster active.
Otherwise, make the Cluster active with an empty assignment.

Note that this doesn't introduce a new way to break traffic, just prevents breaking traffic in the case where the CLA isn't modified and/or isn't sent from the xDS server. If a material change to the cluster occurs, but an updated assignment isn't received, traffic will be broken, just as happens today.

Plan

Introduce a cache for EDS resources.
Plumb the cache into the ClusterManagerFactory.
Update EdsClusterImpl to:
- Store a resource in the cache.
- Fetch the resource upon timeout.
- Ensure that the resource TTL is enforced for cached resources.

Special considerations
The solution will need to clarify the lifetime of the cached resources.
The solution will target EDS resources only, as other resource (such as RDS) are currently "cached" and adhere to the protocol. In the future, the cache may be used for other types of resources, if needed.

Other implications
As with any cache, this will require additional memory. An additional O(N) memory will be required, where N is the number of CLA resources (or EDS clusters). Note that the resource is only accessed by the main thread (no copies to worker threads are needed).

The text was updated successfully, but these errors were encountered:

github-actions · 2023-05-14T00:03:14Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions · 2023-05-21T04:01:47Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

Continuation of PR #28079 (as part of the work for issue #26749). Currently after an EDS-cluster update, Envoy waits for an EDS response. If a timeout occurs, the EDS-cluster will be used without endpoints. This PR adds the use of caching into the GrpcMux. The GrpcMux object adds an EDS resource to the cache when it is received/updated, and removes it when there are no longer subscriptions (watchers). A runtime flag is added to disable the use of the cache, and will be enabled in a future PR when ADS is used. Next PR will plumb this into ADS, and add fetching of resources from the cache as part of the EdsClusterImpl. The entire change can be looked here: adisuissa/envoy@f0b7ac8 Risk Level: Low - the disabled runtime flag should prevent the use of the cache in non-tests code. Testing: Added unit tests. Docs Changes: N/A. Release Notes: N/A (future PR). Platform Specific Features: N/A. Runtime guard: disabled by default: envoy_restart_features_use_eds_cache_for_ads Signed-off-by: Adi Suissa-Peleg <adip@google.com>

This is the last PR to close #26749. In this PR the EdsClusterImpl is modified to use the cached resource when a warming timeout occurs. The PR only includes support for EDS caching when ADS is used. The runtime guard envoy.restart_features.use_eds_cache_for_ads was introduced for backward compatibility. Although this change modifies Envoy's behavior with EDS, the intention is to not breaking the current behavior (e.g., modifying a cluster from non-TLS to TLS will still work as it did previously, see #26749 for more information). The cache will incur more memory. EDIT: Following an internal discussion, the runtime guard is set to false by default to allow gradual rollout of this feature, and testing it in production. Risk Level: Medium - changes behavior of EDS over ADS. Testing: Added unit and integration tests. Signed-off-by: Adi Suissa-Peleg <adip@google.com>

adisuissa added bug triage Issue requires triage area/xds area/eds and removed bug triage Issue requires triage labels Apr 13, 2023

adisuissa self-assigned this Apr 13, 2023

adisuissa mentioned this issue Apr 13, 2023

eds: introducing EDS resources cache #26748

Closed

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label May 14, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 21, 2023

adisuissa mentioned this issue Jun 21, 2023

eds: introducing EDS resources cache #28079

Merged

KBaichoo reopened this Jun 22, 2023

KBaichoo added no stalebot Disables stalebot from closing an issue and removed stale stalebot believes this issue/PR has not been touched recently labels Jun 22, 2023

adisuissa mentioned this issue Jul 7, 2023

eds: Adding eds caching support to grpc-mux #28273

Merged

adisuissa mentioned this issue Jul 17, 2023

Converge to unified gRPC mux #28442

Open

adisuissa mentioned this issue Aug 7, 2023

eds-caching: introduce the EDS caching into EdsClusterImpl #28877

Merged

htuch closed this as completed in #28877 Aug 14, 2023

valerian-roche mentioned this issue Jul 29, 2024

Support multiple ADS configuration to allow distinct control-planes to serve parts of the configuration #35483

Closed

valerian-roche mentioned this issue Sep 12, 2024

Unexpected gRPC Timeout on EDS Update with Delta xDS envoyproxy/go-control-plane#1001

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using cached EDS resources #26749

Using cached EDS resources #26749

adisuissa commented Apr 13, 2023

github-actions bot commented May 14, 2023

github-actions bot commented May 21, 2023

Using cached EDS resources #26749

Using cached EDS resources #26749

Comments

adisuissa commented Apr 13, 2023

github-actions bot commented May 14, 2023

github-actions bot commented May 21, 2023