Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using cached EDS resources #26749

Closed
adisuissa opened this issue Apr 13, 2023 · 2 comments · Fixed by #28877
Closed

Using cached EDS resources #26749

adisuissa opened this issue Apr 13, 2023 · 2 comments · Fixed by #28877
Assignees
Labels
area/eds area/xds no stalebot Disables stalebot from closing an issue

Comments

@adisuissa
Copy link
Contributor

Title: Using cached EDS resources

Description:
Currently after an EDS-cluster update, Envoy waits for an EDS response. If a timeout occurs, the EDS-cluster will be used with an empty assignment. This may break ongoing traffic, if the xDS server fails to reply in a timely manner. As noted in #13009 this can be solved if Envoy will cache the EDS resources.

Unfortunately Envoy cannot immediately use the cached resource because there are scenarios where the Cluster update requires an updated ClusterLoadAssignment (CLA), and using the cached resource will break traffic. An example is when a non-TLS cluster is upgraded to supporting TLS (see #11877, and this test) and using the previously assigned CLA will break traffic. A possible way to solve this while still adhering to the xDS protocol is to require that upon a material change to the Cluster the xDS-server will also modify the [Eds service_name]

string service_name = 2;
). However, many servers do not adhere to this, and to ensure backwards compatibility, a different approach is needed.

Suggested approach
The idea is to allow fallback to a cached EDS resource as follows:

  1. When receiving a new/updated EDS-Cluster, wait for a ClusterLoadAssignment (CLA) for that cluster (Cluster is in a warming state).
  2. If the CLA doesn't arrive within a certain timeframe, fetch the CLA resource from the cache, if exists, and make the Cluster active.
  3. Otherwise, make the Cluster active with an empty assignment.

Note that this doesn't introduce a new way to break traffic, just prevents breaking traffic in the case where the CLA isn't modified and/or isn't sent from the xDS server. If a material change to the cluster occurs, but an updated assignment isn't received, traffic will be broken, just as happens today.

Plan

  1. Introduce a cache for EDS resources.
  2. Plumb the cache into the ClusterManagerFactory.
  3. Update EdsClusterImpl to:
    • Store a resource in the cache.
    • Fetch the resource upon timeout.
    • Ensure that the resource TTL is enforced for cached resources.

Special considerations
The solution will need to clarify the lifetime of the cached resources.
The solution will target EDS resources only, as other resource (such as RDS) are currently "cached" and adhere to the protocol. In the future, the cache may be used for other types of resources, if needed.

Other implications
As with any cache, this will require additional memory. An additional O(N) memory will be required, where N is the number of CLA resources (or EDS clusters). Note that the resource is only accessed by the main thread (no copies to worker threads are needed).

@adisuissa adisuissa added bug triage Issue requires triage area/xds area/eds and removed bug triage Issue requires triage labels Apr 13, 2023
@adisuissa adisuissa self-assigned this Apr 13, 2023
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label May 14, 2023
@github-actions
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 21, 2023
@KBaichoo KBaichoo reopened this Jun 22, 2023
@KBaichoo KBaichoo added no stalebot Disables stalebot from closing an issue and removed stale stalebot believes this issue/PR has not been touched recently labels Jun 22, 2023
htuch pushed a commit that referenced this issue Aug 2, 2023
Continuation of PR #28079 (as part of the work for issue #26749).

Currently after an EDS-cluster update, Envoy waits for an EDS response. If a timeout occurs, the EDS-cluster will be used without endpoints.

This PR adds the use of caching into the GrpcMux. The GrpcMux object adds an EDS resource to the cache when it is received/updated, and removes it when there are no longer subscriptions (watchers).

A runtime flag is added to disable the use of the cache, and will be enabled in a future PR when ADS is used.

Next PR will plumb this into ADS, and add fetching of resources from the cache as part of the EdsClusterImpl.

The entire change can be looked here: adisuissa/envoy@f0b7ac8

Risk Level: Low - the disabled runtime flag should prevent the use of the cache in non-tests code.
Testing: Added unit tests.
Docs Changes: N/A.
Release Notes: N/A (future PR).
Platform Specific Features: N/A.
Runtime guard: disabled by default: envoy_restart_features_use_eds_cache_for_ads

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
htuch pushed a commit that referenced this issue Aug 14, 2023
This is the last PR to close #26749.

In this PR the EdsClusterImpl is modified to use the cached resource when a warming timeout occurs.
The PR only includes support for EDS caching when ADS is used.
The runtime guard envoy.restart_features.use_eds_cache_for_ads was introduced for backward compatibility.
Although this change modifies Envoy's behavior with EDS, the intention is to not breaking the current behavior (e.g., modifying a cluster from non-TLS to TLS will still work as it did previously, see #26749 for more information).
The cache will incur more memory.

EDIT: Following an internal discussion, the runtime guard is set to false by default to allow gradual rollout of this feature, and testing it in production.

Risk Level: Medium - changes behavior of EDS over ADS.
Testing: Added unit and integration tests.

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/eds area/xds no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants