Support large clusters without throttling #247

feiskyer · 2019-10-08T09:01:42Z

Is your feature request related to a problem?/Why is this needed

From Kubernetes cloud provider side, when kubernetes cluster raises the number of nodes to 500+, throttling issues are observed from ARM and CRP by getting 429 back. What's worse, cluster is unable to recover from the state unless we manually shutdown controller manager. This is happening because of a couple of reasons:

Azure Cloud Provider code iterates through each vm instance instead of using VMSS.List, hence the requests would increase as O(N) when node number increases.
Azure generated Go SDK retries by itself. However, it does not count 429 in number of attempts and hence it will blindly retry when getting throttled and that gets us stuck in throttled state forever.
Azure cloud provider caches VMSS and VM in 1 minute. However, when it needs to do a PUT, it will need to refresh cache if cache is older than 1 minute to make sure it does not overwrite existing changes.
Azure cloud provider needs to fill the caches when kube-controller-manager is restarted. However, when 429 has already been happening, the cache initialization won't succeed until ARM accepts those requests.

From CRP side, when the call rate limit overage is severe, CRP will stop accepting any requests and will block all client calls for the particular operation group until the call rate goes down significantly for some min “penalty window”. This penalty window is what’s returned in the Retry-After header. If a high number of concurrent callers rush to make the same calls again at later time, they can cause the hard throttling to start again. Exponential back-off on throttling errors is recommended to mitigate the throttling issue.

Describe the solution you'd like in detail

To support large clusters without throttling, the following changes are proposed for short-term (during Kubernetes v1.17):

1) Honor Retry-After headers and get rid of Go SDK silent retries

When cloud provider reaches request limit, ARM would return 429 HTTP status code and a Retry-After header. Cloud provider should wait for the Retry-After seconds before sending the next request. To get this header, current Go SDK logic should be replaced with our own implementations.

go-autorest has added send decorator via context in #417. Kubernetes cloud provider also needs to adopt it and switch to cloud provider's own retry logic.

2) Switching to LIST instead of GET for VMSS, VMSSVM and VM

After switching to LIST API, cloud provider would invoke LIST VMSS first and then LIST instances for each VMSS. In such way, the request count would be improved to O(number of VMSS). Multiple VMs are paged by LIST, and VMSS returns 120 VMs with instanceView in per page.

When updating the caches for a VMSS, we should also avoid concurrent GETs by adding per entry locks (entry means List of VMs in a VMSS).

3) Reuse the outdated caches on throttling

Though there're already caches in cloud provider, when caches are outdated or have just been PUT, ARM requests are required to refresh them. On the cases when subscription is throttled, all the caches may be outdated and we have no way to recover until ARM starts to accept requests again.

Note: when a data disk is attaching to the VM, or the VM is adding to a LoadBalancer backend address pool, cloud provider would make a PUT request for the VM/VMSSVM.

Azure cloud provider refreshes the cache mainly for three changes: the powerstate, the LB configuration and the data disks. It's ok to reuse the outdated caches with some assumptions below:

The LB configuration has been ready during provision steps and there're not many changes after provisioning. On node scaling (up or down), the LB configuration is already managed by cloud provider, so cloud provider already knows what new updates has been made.
The powerstate changes could only be get by new query, but it's ok to treat the node as shutdown or removed with more delay because the node would be NotReady anyway (new Pods won't be able to schedule on it).
Azure cloud provider also has the information of newly attached AzureDisk PVs. However, if users have also attched other data disks, then these new disks may be detached by cloud provider. It's ok to assume users won't attach disks because Kubernetes PV are encouraged.

When the requests are throttled, cloud provider would retry with backoff configurations. After retries, if it's still failed with throttling errors, then cloud provider would extend the cache TTL based on Retry-After header (which means the outdated caches would be still used with another Retry-After delay).

Reusing the caches would work for most cases, but there's one exception: AzureDisk attach/detach failure. There're couple of reasons may cause this, and PUT again with cached VM usually would still fail. So for this case, cloud provider would invalidate and refresh the cache.

When throttled, x-ms-failure-cause header determines which service is throttling. We should log this so that it's easy to know why the request is throttled.

4) Extend cache TTL from 1 minute to 10 minutes

To reduce the total number of requests, 10 minutes would be used by default. The cache TTL would also be configurable from cloud provider config file, so that it could be easily tuned by customer.

We should also make the TTL configurable by cloud-config, so that customers could tune it later easily.

Describe alternatives you've considered

Azure Resource Graph (ARG) also provides a higher volume of queries, however it's not considered because a few issues are not resolved now:

Not all data are supported there, e.g. instanceView is missing for "microsoft.compute/virtualmachines" and "microsoft.compute/virtualmachinescalesets/virtualmachines" is not supported.
ARG resource may have 1 minutes delay after it's updated, while we expect to get the latest data.
The API versions for each resource are not defined in ARG, so there may be broken changes there while we are not aware of.

Cross projects improvements

Cloud provider is not the only source of rate limit issues, there're also a lot of addons running on the cluster, e.g. cluster-autoscaler and pod identity.

All of those projects should reduce ARM requests and fix 429 issues. We're planning to implement this as a library in cloud provider, so that other projects could easily vendor and reuse the same logic.

Work items

Those work items are targeting at Kubernetes v1.18 and would be cherry-picked to v1.15-v1.17:

Change cache TTL to 10 minutes and switch to LIST for VMSS caches
Avoid ARM requests when getting lun for AzureDisk
Reuse the caches that won’t change (Allow unsafe read from cache for Azure kubernetes/kubernetes#83685)
Cache nil data entry on cache refresh to avoid aggressive ARM requests on node scaling down cases (azure: set nil cache entry based on old cache kubernetes/kubernetes#87531)
Set cache to nil data if its provisioning state deleting (Set cache to nil data when Azure node provisioning state is deleting kubernetes/kubernetes#87635)

Rewrite Azure clients that support per-client rate limiting, backoff retries and honor of retry-after headers. Those items would only be included in v1.18:

The text was updated successfully, but these errors were encountered:

feiskyer · 2019-10-08T12:14:59Z

/cc @craiglpeters

pawelpabich · 2019-10-15T06:01:29Z

We've also suffered from this problem when we had 9+ small clusters with autoscaler. Each cluster had no more than 5 VMs.

sylus · 2019-10-24T01:54:58Z

We are also getting this issue also in AKS Engine we documented the whole issue, base replication case, current environment, and attempts tried over at:

Azure/aks-engine#1860 (comment)

Kubernetes Version: v1.15.5 (1 master, 4 nodes)
AKS Engine: v0.42.2
Kernel Version: 5.0.0-1023-azure
OS Image: Ubuntu 18.04.3 LTS
Date Issue Started: Last week (Oct 15, 2019) on clusters that running w/out issue for 3 months

devigned · 2019-10-25T22:16:10Z

Looks like since v12.2.0 (https://github.com/Azure/go-autorest/blob/740293c019d8314ce3378d456b4327fa646297e6/CHANGELOG.md#bug-fixes-5) go-autorest has respected Retry-After headers via autorest.DelayWithRetryAfter. Might just be able to update the go.mod in cloud provider azure. It's currently at v0.9.0 for autorest.

aramase · 2019-10-30T20:09:35Z

In an effort to reduce the # of calls that are made in an idle cluster, we have merged kubernetes/kubernetes#83685.

What this PR changes?

This PR introduces 2 types of read types for data in cache -

Default read type - If the data in cache is expired, then the new data is fetched from cloud, stored in cache and returned to caller.
Unsafe read type - If the data exists in the cache (expired/not expired), it's returned to the caller.
This segregation is done because there are calls in the cloud-provider which are safe with reading stalled cache data.

For instance, In a cluster with the configuration 1 VMSS, 100 nodes and 100 disks majority of calls from the cloud-provider are performed as part of the reconciliation loop that's run by the volume controller. This loop runs every 1min to ensure all the disks mounted are attached. This means every min, a call is made for each disk to ensure it's attached which in this scenario can account for ~6k-7k calls/h.

Every time the disk is attached/detached the cache entry for the VM instance is invalidated. The next time the reconcile loop is run, new data will be fetched and we can validate if the state looks fine after the initial Attach/Detach. This means it's safe to read stalled data from cache for the disk reconcile loop. With this change introduced in the PR, the total number of calls to ARM for an idle cluster with 100 disks reduces to 100 calls/h from the previous 7k calls/h.

Behavioral changes

Since the reconcile loop reads from stalled cache, it's recommended the user doesn't make any changes to the disk outside of the cluster primitives (using pvc).
All other disk operations attach/detach continue to use default read type.
GetInstanceTypeByNodeName, GetZoneByNodeName, GetZone now read stalled data from cache as these values don't change.

tdihp · 2019-11-14T11:53:35Z

Is action 2 (Switching to LIST instead of GET for VMSS, VMSSVM and VM) applicable for VMAS implementation of cluster-autoscaler?

feiskyer · 2019-11-15T06:27:58Z

@tdihp we're planning to do that, but it's not added yet in cluster-autoscaler.

nilo19 · 2019-11-30T14:20:48Z

/cc

jcrowthe · 2019-12-02T22:10:58Z

Does kubernetes/autoscaler#2527 address the usage of LIST instead of GET in the cluster-autoscaler?

feiskyer · 2019-12-17T02:15:56Z

/milestone v1.18

nilo19 · 2019-12-17T05:13:48Z

Does kubernetes/autoscaler#2527 address the usage of LIST instead of GET in the cluster-autoscaler?

yeah I think so.

feiskyer · 2020-01-23T05:59:22Z

Status update: validation results on Kubernetes v1.18

Validation scenario: scaling one VMSS up and down between 1 node and 100 nodes.

Without the fix (v1.18.0-alpha.1):

HighCostGetVMScaleSet3Min: allowedRequestCount 180 and measuredRequestCount 1044
HighCostGetVMScaleSet30Min: allowedRequestCount 900 and measuredRequestCount 6404
it's hard to recover because of retries in kube-controller-manager and "measuredRequestCount" would increase quickly to tens of thousands (e.g. 9000)
a lot of scaled down nodes are pending on NotReady state because node lifecyle controller unable to determine their state

With the fix (latest master branch):

No more HighCostGetVMScaleSet3Min and HighCostGetVMScaleSet30Min errors since Retry-After headers are honored and it is usually within 2 minutes
Number of throttled requests decrease to less than 10 and most requests have succeeded
No more cluster hanging issues, the clients would retry after Retry-After expires

Here is chart of number of throttled and successful requests:

feiskyer · 2020-03-06T06:41:48Z

Since code freeze has been started in kubernetes, the unit tests part would be moved to v1.19 (tracking by #310).

feiskyer · 2020-03-06T06:43:41Z

most items have been done and the remaining tests (including e2e and unit tests) have been tracked by different issues.

/close

k8s-ci-robot · 2020-03-06T06:43:43Z

@feiskyer: Closing this issue.

In response to this:

most items have been done and the remaining tests (including e2e and unit tests) have been tracked by different issues.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Client side rate limiting is being problematic for fresh installs and scaling operations [1] Azure ARM throttling is applied at subscription level, so client side rate limiting helps to prevent cluster sharing the same subscription from disrupting each other. However there's lower limits which apply at the SP/tenant and resource level e.g ARM limits the number of write calls per service principal to 1200/hour [2]. Since we ensure particular SPs per cluster via Cloud Credential Operator it should be relatively safe to disable the client rate limiting Orthogonally to this some improvements on the rate limiting and back off mechanisms are being added to the cloud provider. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1782516. [2] https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling [3] kubernetes-sigs/cloud-provider-azure#247

feiskyer added the area/provider/azure Issues or PRs related to azure provider label Oct 8, 2019

feiskyer self-assigned this Oct 8, 2019

feiskyer added this to the v1.17 milestone Oct 11, 2019

aramase mentioned this issue Oct 11, 2019

Allow unsafe read from cache for Azure kubernetes/kubernetes#83685

Merged

sylus mentioned this issue Oct 24, 2019

Provisioning of VM extension 'vmssCSE' has timed out Azure/aks-engine#1860

Closed

k8s-ci-robot modified the milestones: v1.17, v1.18 Dec 17, 2019

This was referenced Dec 23, 2019

Return typed errors for Azure client interfaces kubernetes/kubernetes#86555

Merged

Add backoff retries and client config for Azure cloud provider kubernetes/kubernetes#86719

Merged

feiskyer mentioned this issue Dec 31, 2019

Add VMSS and VMSSVM clients with backoff retry kubernetes/kubernetes#86740

Merged

This was referenced Jan 13, 2020

Add Azure network and VM clients with backoff retry kubernetes/kubernetes#87122

Merged

Add Azure storage clients with backoff retry kubernetes/kubernetes#87306

Merged

aramase mentioned this issue Jan 27, 2020

azure: set nil cache entry based on old cache kubernetes/kubernetes#87531

Merged

feiskyer mentioned this issue Feb 14, 2020

Add deletion interfaces for VM, VMSS and interface clients kubernetes/kubernetes#88155

Merged

feiskyer mentioned this issue Feb 21, 2020

Swith to mock clients for route/routetable/networkinterface tests kubernetes/kubernetes#88405

Merged

enxebre mentioned this issue Mar 5, 2020

BUG 1810446: UPSTREAM: 87591: set nil cache entry based on old cache openshift/origin#24637

Closed

k8s-ci-robot closed this as completed Mar 6, 2020

enxebre mentioned this issue Mar 9, 2020

BUG 1782516: Disable client side rate limiting in Azure. openshift/installer#3259

Merged

amodm mentioned this issue Apr 21, 2020

Getting 429 TooManyRequests when doing Managed disk attach operation Azure/aks-engine#3097

Closed

feiskyer mentioned this issue Aug 10, 2022

The node could be deleted unexpectedly because of a nil entry in the VMSS VM cache #2156

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support large clusters without throttling #247

Support large clusters without throttling #247

feiskyer commented Oct 8, 2019 •

edited

Loading

feiskyer commented Oct 8, 2019

pawelpabich commented Oct 15, 2019

sylus commented Oct 24, 2019 •

edited

Loading

devigned commented Oct 25, 2019

aramase commented Oct 30, 2019

tdihp commented Nov 14, 2019

feiskyer commented Nov 15, 2019 •

edited

Loading

nilo19 commented Nov 30, 2019

jcrowthe commented Dec 2, 2019

feiskyer commented Dec 17, 2019

nilo19 commented Dec 17, 2019

feiskyer commented Jan 23, 2020

feiskyer commented Mar 6, 2020

feiskyer commented Mar 6, 2020

k8s-ci-robot commented Mar 6, 2020

Support large clusters without throttling #247

Support large clusters without throttling #247

Comments

feiskyer commented Oct 8, 2019 • edited Loading

Is your feature request related to a problem?/Why is this needed

Describe the solution you'd like in detail

Describe alternatives you've considered

Cross projects improvements

Work items

feiskyer commented Oct 8, 2019

pawelpabich commented Oct 15, 2019

sylus commented Oct 24, 2019 • edited Loading

devigned commented Oct 25, 2019

aramase commented Oct 30, 2019

What this PR changes?

Behavioral changes

tdihp commented Nov 14, 2019

feiskyer commented Nov 15, 2019 • edited Loading

nilo19 commented Nov 30, 2019

jcrowthe commented Dec 2, 2019

feiskyer commented Dec 17, 2019

nilo19 commented Dec 17, 2019

feiskyer commented Jan 23, 2020

Status update: validation results on Kubernetes v1.18

feiskyer commented Mar 6, 2020

feiskyer commented Mar 6, 2020

k8s-ci-robot commented Mar 6, 2020

feiskyer commented Oct 8, 2019 •

edited

Loading

sylus commented Oct 24, 2019 •

edited

Loading

feiskyer commented Nov 15, 2019 •

edited

Loading