Description
ISSUE TYPE
- Bug Report
COMPONENT NAME
VM Autoscale Feature
CLOUDSTACK VERSION
4.19.0
CONFIGURATION
Any Autoscale VM Setup
OS / ENVIRONMENT
Linux Ubuntu
SUMMARY
When a AutoscaleGroup is setup, on normal days, it works fine.
However, when a template is provided to an Autoscale Group that is unable to start up a VM (for whatever reason eg. Corrupted Template, Storage issues etc), the VM will enter start -> stop -> error state.
When a VM enter error state, the Autoscale Group will attempt to create another VM in an attempt to retry.
But if it fails, it tries again, and it ends up going into an infinite loop.
We personally had this issue when our storage service went down. We have approximately 15 VMs in our setup, but due to this bug, we ended up with 30,000 VMs due to autoscaling infinitely. Refer to screenshot below:
Note, our MAX Replica = 4 and Min Replica = 2 for each Autoscale Rule.
In order to resolve this issue, we had to disable the autoscale rules, and manually delete affected VMs from DB using below command:
update cloud.vm_instance set state='Destroyed', removed ='2024-06-30 02:59:49' where name like 'autoscale%' and state = 'error'
We managed to resolve it ourselves as we are using it within known groups of users. However, if this issue happened when operating a public cloud, where the users are unknown, it would have been catastrophic. Because we technically would have to contact each and every customer to repeat the above steps. (technically cloud operator should not have access/interfere with customer environments)
STEPS TO REPRODUCE
Upload a Template that cannot be spinned up (Eg. Corrupted), or simulate a storage failure.
EXPECTED RESULTS
To have a max number of retries when scalling up and resulting in an 'Error' state. To prevent infinite scaling.
ACTUAL RESULTS
Infinite VM scaling causing huge consumption and bottlenecking of our cloud services.
Metadata
Metadata
Assignees
Type
Projects
Status