-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using cached EDS resources #26749
Comments
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions. |
Continuation of PR #28079 (as part of the work for issue #26749). Currently after an EDS-cluster update, Envoy waits for an EDS response. If a timeout occurs, the EDS-cluster will be used without endpoints. This PR adds the use of caching into the GrpcMux. The GrpcMux object adds an EDS resource to the cache when it is received/updated, and removes it when there are no longer subscriptions (watchers). A runtime flag is added to disable the use of the cache, and will be enabled in a future PR when ADS is used. Next PR will plumb this into ADS, and add fetching of resources from the cache as part of the EdsClusterImpl. The entire change can be looked here: adisuissa/envoy@f0b7ac8 Risk Level: Low - the disabled runtime flag should prevent the use of the cache in non-tests code. Testing: Added unit tests. Docs Changes: N/A. Release Notes: N/A (future PR). Platform Specific Features: N/A. Runtime guard: disabled by default: envoy_restart_features_use_eds_cache_for_ads Signed-off-by: Adi Suissa-Peleg <adip@google.com>
This is the last PR to close #26749. In this PR the EdsClusterImpl is modified to use the cached resource when a warming timeout occurs. The PR only includes support for EDS caching when ADS is used. The runtime guard envoy.restart_features.use_eds_cache_for_ads was introduced for backward compatibility. Although this change modifies Envoy's behavior with EDS, the intention is to not breaking the current behavior (e.g., modifying a cluster from non-TLS to TLS will still work as it did previously, see #26749 for more information). The cache will incur more memory. EDIT: Following an internal discussion, the runtime guard is set to false by default to allow gradual rollout of this feature, and testing it in production. Risk Level: Medium - changes behavior of EDS over ADS. Testing: Added unit and integration tests. Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Title: Using cached EDS resources
Description:
Currently after an EDS-cluster update, Envoy waits for an EDS response. If a timeout occurs, the EDS-cluster will be used with an empty assignment. This may break ongoing traffic, if the xDS server fails to reply in a timely manner. As noted in #13009 this can be solved if Envoy will cache the EDS resources.
Unfortunately Envoy cannot immediately use the cached resource because there are scenarios where the Cluster update requires an updated ClusterLoadAssignment (CLA), and using the cached resource will break traffic. An example is when a non-TLS cluster is upgraded to supporting TLS (see #11877, and this test) and using the previously assigned CLA will break traffic. A possible way to solve this while still adhering to the xDS protocol is to require that upon a material change to the Cluster the xDS-server will also modify the [Eds
service_name
]envoy/api/envoy/config/cluster/v3/cluster.proto
Line 207 in 25c0e7c
Suggested approach
The idea is to allow fallback to a cached EDS resource as follows:
Note that this doesn't introduce a new way to break traffic, just prevents breaking traffic in the case where the CLA isn't modified and/or isn't sent from the xDS server. If a material change to the cluster occurs, but an updated assignment isn't received, traffic will be broken, just as happens today.
Plan
Special considerations
The solution will need to clarify the lifetime of the cached resources.
The solution will target EDS resources only, as other resource (such as RDS) are currently "cached" and adhere to the protocol. In the future, the cache may be used for other types of resources, if needed.
Other implications
As with any cache, this will require additional memory. An additional O(N) memory will be required, where N is the number of CLA resources (or EDS clusters). Note that the resource is only accessed by the main thread (no copies to worker threads are needed).
The text was updated successfully, but these errors were encountered: