CA(AliCloud):Large sudden increase in instances not managed by ASG can impact CA performance #6748
Labels
area/cluster-autoscaler
kind/feature
Categorizes issue or PR as related to a new feature.
lifecycle/rotten
Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Currently, the implementation of the NodeGroupForNode function in the AliCloudProvider has two caches within autoScalingGroups:
instanceToAsg
andinstancesNotInManagedAsg
.instanceToAsg
records instances in Managed ASGs.instancesNotInManagedAsg
records instances not in Managed ASGs.When an unknown instance is encountered, a
regenerateCache
is triggered. If it’s determined to be an instance not in a Managed ASG, it’s saved toinstancesNotInManagedAsg
, avoiding the need to callregenerateCache
on subsequent occurrences.This functionality is correct. However, if a large number of nodes are created via instances not in Managed ASGs, every call to
NodeGroupForNode->GetAsgForInstance->FindForInstance
will trigger aregenerateCache
. If there are a significant number of instances in Managed ASGs at that time, this could lead to frequent calls, each with a considerable time overhead forregenerateCache
.For example, if there are 1000 machines in Managed ASGs and suddenly 2000 instances not in Managed ASGs are added:
UpdateNodes->updateReadinessStats->NodeGroupForNode
This call is executed with every
runOnce
, and each node invokesNodeGroupForNode
once. The 2000 newly added instances not in Managed ASGs will triggerregenerateCache
, but it will only count instances in Managed ASGs. However, it will still execute 2000 times.Assuming each
regenerateCache
call takes 10 seconds, this function will take:2000 instances * 10 seconds/instance = 20000 seconds.
This significantly extends the duration of a single
runOnce
operation. During this period, if new pending pods appear, the Cluster Autoscaler might not function as expected, posing a severe risk.The text was updated successfully, but these errors were encountered: