-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support large clusters without throttling #247
Comments
/cc @craiglpeters |
We've also suffered from this problem when we had 9+ small clusters with autoscaler. Each cluster had no more than 5 VMs. |
We are also getting this issue also in AKS Engine we documented the whole issue, base replication case, current environment, and attempts tried over at: Azure/aks-engine#1860 (comment) Kubernetes Version: v1.15.5 (1 master, 4 nodes) |
Looks like since |
In an effort to reduce the # of calls that are made in an idle cluster, we have merged kubernetes/kubernetes#83685. What this PR changes?This PR introduces 2 types of read types for data in cache - Default read type - If the data in cache is expired, then the new data is fetched from cloud, stored in cache and returned to caller. For instance, In a cluster with the configuration 1 VMSS, 100 nodes and 100 disks majority of calls from the cloud-provider are performed as part of the reconciliation loop that's run by the volume controller. This loop runs every 1min to ensure all the disks mounted are attached. This means every min, a call is made for each disk to ensure it's attached which in this scenario can account for ~6k-7k calls/h. Every time the disk is attached/detached the cache entry for the VM instance is invalidated. The next time the reconcile loop is run, new data will be fetched and we can validate if the state looks fine after the initial Attach/Detach. This means it's safe to read stalled data from cache for the disk reconcile loop. With this change introduced in the PR, the total number of calls to ARM for an idle cluster with 100 disks reduces to Behavioral changes
|
Is action 2 (Switching to LIST instead of GET for VMSS, VMSSVM and VM) applicable for VMAS implementation of cluster-autoscaler? |
@tdihp we're planning to do that, but it's not added yet in cluster-autoscaler. |
/cc |
Does kubernetes/autoscaler#2527 address the usage of LIST instead of GET in the cluster-autoscaler? |
/milestone v1.18 |
yeah I think so. |
Status update: validation results on Kubernetes v1.18Validation scenario: scaling one VMSS up and down between 1 node and 100 nodes. Without the fix (v1.18.0-alpha.1):
With the fix (latest master branch):
Here is chart of number of throttled and successful requests: |
Since code freeze has been started in kubernetes, the unit tests part would be moved to v1.19 (tracking by #310). |
most items have been done and the remaining tests (including e2e and unit tests) have been tracked by different issues. /close |
@feiskyer: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Client side rate limiting is being problematic for fresh installs and scaling operations [1] Azure ARM throttling is applied at subscription level, so client side rate limiting helps to prevent cluster sharing the same subscription from disrupting each other. However there's lower limits which apply at the SP/tenant and resource level e.g ARM limits the number of write calls per service principal to 1200/hour [2]. Since we ensure particular SPs per cluster via Cloud Credential Operator it should be relatively safe to disable the client rate limiting Orthogonally to this some improvements on the rate limiting and back off mechanisms are being added to the cloud provider. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1782516. [2] https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling [3] kubernetes-sigs/cloud-provider-azure#247
Client side rate limiting is being problematic for fresh installs and scaling operations [1] Azure ARM throttling is applied at subscription level, so client side rate limiting helps to prevent cluster sharing the same subscription from disrupting each other. However there's lower limits which apply at the SP/tenant and resource level e.g ARM limits the number of write calls per service principal to 1200/hour [2]. Since we ensure particular SPs per cluster via Cloud Credential Operator it should be relatively safe to disable the client rate limiting Orthogonally to this some improvements on the rate limiting and back off mechanisms are being added to the cloud provider. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1782516. [2] https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling [3] kubernetes-sigs/cloud-provider-azure#247
Client side rate limiting is being problematic for fresh installs and scaling operations [1] Azure ARM throttling is applied at subscription level, so client side rate limiting helps to prevent cluster sharing the same subscription from disrupting each other. However there's lower limits which apply at the SP/tenant and resource level e.g ARM limits the number of write calls per service principal to 1200/hour [2]. Since we ensure particular SPs per cluster via Cloud Credential Operator it should be relatively safe to disable the client rate limiting Orthogonally to this some improvements on the rate limiting and back off mechanisms are being added to the cloud provider. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1782516. [2] https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling [3] kubernetes-sigs/cloud-provider-azure#247
Is your feature request related to a problem?/Why is this needed
From Kubernetes cloud provider side, when kubernetes cluster raises the number of nodes to 500+, throttling issues are observed from ARM and CRP by getting 429 back. What's worse, cluster is unable to recover from the state unless we manually shutdown controller manager. This is happening because of a couple of reasons:
From CRP side, when the call rate limit overage is severe, CRP will stop accepting any requests and will block all client calls for the particular operation group until the call rate goes down significantly for some min “penalty window”. This penalty window is what’s returned in the
Retry-After
header. If a high number of concurrent callers rush to make the same calls again at later time, they can cause the hard throttling to start again. Exponential back-off on throttling errors is recommended to mitigate the throttling issue.Describe the solution you'd like in detail
To support large clusters without throttling, the following changes are proposed for short-term (during Kubernetes v1.17):
1) Honor
Retry-After
headers and get rid of Go SDK silent retriesWhen cloud provider reaches request limit, ARM would return 429 HTTP status code and a
Retry-After
header. Cloud provider should wait for theRetry-After
seconds before sending the next request. To get this header, current Go SDK logic should be replaced with our own implementations.go-autorest has added send decorator via context in #417. Kubernetes cloud provider also needs to adopt it and switch to cloud provider's own retry logic.
2) Switching to LIST instead of GET for VMSS, VMSSVM and VM
After switching to LIST API, cloud provider would invoke LIST VMSS first and then LIST instances for each VMSS. In such way, the request count would be improved to
O(number of VMSS)
. Multiple VMs are paged by LIST, and VMSS returns 120 VMs with instanceView in per page.When updating the caches for a VMSS, we should also avoid concurrent GETs by adding per entry locks (entry means List of VMs in a VMSS).
3) Reuse the outdated caches on throttling
Though there're already caches in cloud provider, when caches are outdated or have just been PUT, ARM requests are required to refresh them. On the cases when subscription is throttled, all the caches may be outdated and we have no way to recover until ARM starts to accept requests again.
Azure cloud provider refreshes the cache mainly for three changes: the powerstate, the LB configuration and the data disks. It's ok to reuse the outdated caches with some assumptions below:
When the requests are throttled, cloud provider would retry with backoff configurations. After retries, if it's still failed with throttling errors, then cloud provider would extend the cache TTL based on
Retry-After
header (which means the outdated caches would be still used with anotherRetry-After
delay).Reusing the caches would work for most cases, but there's one exception: AzureDisk attach/detach failure. There're couple of reasons may cause this, and PUT again with cached VM usually would still fail. So for this case, cloud provider would invalidate and refresh the cache.
When throttled,
x-ms-failure-cause
header determines which service is throttling. We should log this so that it's easy to know why the request is throttled.4) Extend cache TTL from 1 minute to 10 minutes
To reduce the total number of requests, 10 minutes would be used by default. The cache TTL would also be configurable from cloud provider config file, so that it could be easily tuned by customer.
We should also make the TTL configurable by cloud-config, so that customers could tune it later easily.
Describe alternatives you've considered
Azure Resource Graph (ARG) also provides a higher volume of queries, however it's not considered because a few issues are not resolved now:
Cross projects improvements
Cloud provider is not the only source of rate limit issues, there're also a lot of addons running on the cluster, e.g. cluster-autoscaler and pod identity.
All of those projects should reduce ARM requests and fix 429 issues. We're planning to implement this as a library in cloud provider, so that other projects could easily vendor and reuse the same logic.
Work items
Those work items are targeting at Kubernetes v1.18 and would be cherry-picked to v1.15-v1.17:
Rewrite Azure clients that support per-client rate limiting, backoff retries and honor of retry-after headers. Those items would only be included in v1.18:
The text was updated successfully, but these errors were encountered: