-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[Autoscaler] [Azure] Cleaning up extra resources (MSI, VNET, NSG) during cluster teardown #57610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Autoscaler] [Azure] Cleaning up extra resources (MSI, VNET, NSG) during cluster teardown #57610
Conversation
…uster teardown Signed-off-by: Mark Rossett <marosset@microsoft.com>
It looks like preserving some of these resources may have been by design but I've also ran into issues trying to run |
"Failed to delete role assignment %s before MSI creation: %s", | ||
role_assignment_guid, | ||
e, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Role Deletion Race Condition
The role assignment deletion logic has a race condition. If the initial query for an existing role assignment fails, the code still attempts deletion but then skips the post-deletion verification polling. This means deployment might proceed before the role assignment is fully removed, potentially causing failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a robust cleanup mechanism for Azure resources (MSI, VNet, NSG, Role Assignments) during cluster teardown (ray down
). This is a valuable improvement that prevents resource leakage and subsequent errors on ray up
. The implementation is thorough, using deterministic naming for cleanup, pre-emptive deletion to avoid conflicts, and retry logic to handle Azure's eventual consistency. My review includes a few suggestions to improve code clarity and reduce redundancy, such as refactoring a large cleanup function and removing some duplicate code.
cached_config["_config_cache_path"] = cache_key | ||
if "provider" in cached_config: | ||
cached_config["provider"]["_config_cache_path"] = cache_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _config_cache_path
is being added to both the root of cached_config
and the provider
sub-dictionary. The Azure node provider, which uses this path for cleanup, only reads it from provider_config
(which corresponds to cached_config[\"provider\"]
). To avoid redundancy and improve clarity, you can remove the assignment to the root cached_config
.
cached_config["_config_cache_path"] = cache_key | |
if "provider" in cached_config: | |
cached_config["provider"]["_config_cache_path"] = cache_key | |
if "provider" in cached_config: | |
cached_config["provider"]["_config_cache_path"] = cache_key |
resolved_config["_config_cache_path"] = cache_key | ||
resolved_config["provider"]["_config_cache_path"] = cache_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the change for cached configs, the _config_cache_path
is added to both the root resolved_config
and the provider
sub-dictionary. Since the Azure node provider reads this from the provider
config, the assignment to the root dictionary appears to be redundant and can be removed for clarity.
resolved_config["_config_cache_path"] = cache_key | |
resolved_config["provider"]["_config_cache_path"] = cache_key | |
resolved_config["provider"]["_config_cache_path"] = cache_key |
def cleanup_cluster_resources(self): | ||
"""Delete shared cluster infrastructure (MSI, NSG, Subnet, VNet). | ||
This should only be called after all cluster nodes have been terminated. | ||
These resources are referenced by their Azure resource IDs in provider_config. | ||
Note: Includes retry logic with exponential backoff to handle cases where | ||
NICs are still being deleted asynchronously. | ||
""" | ||
resource_group = self.provider_config["resource_group"] | ||
msi_principal_id: Optional[str] = None | ||
|
||
# Extract resource names from Azure resource IDs | ||
# Format: /subscriptions/{sub}/resourceGroups/{rg}/providers/{provider}/{type}/{name} | ||
def get_resource_name_from_id(resource_id): | ||
if resource_id: | ||
return resource_id.split("/")[-1] | ||
return None | ||
|
||
# Helper function to retry deletion with exponential backoff | ||
def retry_delete(delete_fn, max_retries=5, initial_delay=2): | ||
"""Retry a delete operation with exponential backoff.""" | ||
delay = initial_delay | ||
for attempt in range(max_retries): | ||
try: | ||
return delete_fn() | ||
except Exception as e: | ||
error_msg = str(e) | ||
# Check if error is due to resources still being deleted | ||
if "InUse" in error_msg and attempt < max_retries - 1: | ||
logger.info( | ||
f"Resource still in use, retrying in {delay}s (attempt {attempt + 1}/{max_retries})..." | ||
) | ||
time.sleep(delay) | ||
delay *= 2 # Exponential backoff | ||
else: | ||
raise | ||
|
||
# Delete MSI (Managed Service Identity) | ||
msi_id = self.provider_config.get("msi") | ||
if msi_id: | ||
msi_name = get_resource_name_from_id(msi_id) | ||
if msi_name: | ||
try: | ||
get_identity = get_azure_sdk_function( | ||
client=self.resource_client.resources, | ||
function_name="get_by_id", | ||
) | ||
existing_msi = get_identity(msi_id, "2023-01-31") | ||
msi_principal_id = getattr(existing_msi, "properties", {}).get( | ||
"principalId" | ||
) | ||
except ResourceNotFoundError: | ||
msi_principal_id = None | ||
except Exception as exc: | ||
logger.warning( | ||
"Failed to query MSI %s for principal ID prior to deletion: %s", | ||
msi_name, | ||
exc, | ||
) | ||
try: | ||
logger.info(f"Deleting Managed Service Identity: {msi_name}") | ||
delete = get_azure_sdk_function( | ||
client=self.resource_client.resources, | ||
function_name="delete_by_id", | ||
) | ||
delete(resource_id=msi_id, api_version="2023-01-31").wait() | ||
logger.info(f"Successfully deleted MSI: {msi_name}") | ||
except ResourceNotFoundError: | ||
logger.info( | ||
f"MSI {msi_name} not found, may have been already deleted" | ||
) | ||
except Exception as e: | ||
logger.warning(f"Failed to delete MSI {msi_name}: {e}") | ||
|
||
# Delete Subnet first (must be deleted before NSG and VNet) | ||
# Subnet ID format: .../virtualNetworks/{vnet}/subnets/{subnet} | ||
subnet_id = self.provider_config.get("subnet") | ||
vnet_name = None | ||
if subnet_id: | ||
subnet_name = get_resource_name_from_id(subnet_id) | ||
# Extract vnet name (second to last component) | ||
if subnet_id and "/virtualNetworks/" in subnet_id: | ||
parts = subnet_id.split("/") | ||
vnet_idx = parts.index("virtualNetworks") | ||
if vnet_idx + 1 < len(parts): | ||
vnet_name = parts[vnet_idx + 1] | ||
|
||
if subnet_name and vnet_name: | ||
try: | ||
logger.info(f"Deleting Subnet: {subnet_name} in VNet: {vnet_name}") | ||
|
||
def delete_subnet(): | ||
delete = get_azure_sdk_function( | ||
client=self.network_client.subnets, function_name="delete" | ||
) | ||
delete( | ||
resource_group_name=resource_group, | ||
virtual_network_name=vnet_name, | ||
subnet_name=subnet_name, | ||
).wait() | ||
|
||
retry_delete(delete_subnet) | ||
logger.info(f"Successfully deleted Subnet: {subnet_name}") | ||
except ResourceNotFoundError: | ||
logger.info( | ||
f"Subnet {subnet_name} not found, may have been already deleted" | ||
) | ||
except Exception as e: | ||
logger.warning(f"Failed to delete Subnet {subnet_name}: {e}") | ||
|
||
# Delete NSG (Network Security Group) after subnet is deleted | ||
nsg_id = self.provider_config.get("nsg") | ||
if nsg_id: | ||
nsg_name = get_resource_name_from_id(nsg_id) | ||
if nsg_name: | ||
try: | ||
logger.info(f"Deleting Network Security Group: {nsg_name}") | ||
|
||
def delete_nsg(): | ||
delete = get_azure_sdk_function( | ||
client=self.network_client.network_security_groups, | ||
function_name="delete", | ||
) | ||
delete( | ||
resource_group_name=resource_group, | ||
network_security_group_name=nsg_name, | ||
).wait() | ||
|
||
retry_delete(delete_nsg) | ||
logger.info(f"Successfully deleted NSG: {nsg_name}") | ||
except ResourceNotFoundError: | ||
logger.info( | ||
f"NSG {nsg_name} not found, may have been already deleted" | ||
) | ||
except Exception as e: | ||
logger.warning(f"Failed to delete NSG {nsg_name}: {e}") | ||
|
||
# Delete VNet (Virtual Network) after subnet and NSG are deleted | ||
if subnet_id and vnet_name: | ||
try: | ||
logger.info(f"Deleting Virtual Network: {vnet_name}") | ||
|
||
def delete_vnet(): | ||
delete = get_azure_sdk_function( | ||
client=self.network_client.virtual_networks, | ||
function_name="delete", | ||
) | ||
delete( | ||
resource_group_name=resource_group, | ||
virtual_network_name=vnet_name, | ||
).wait() | ||
|
||
retry_delete(delete_vnet) | ||
logger.info(f"Successfully deleted VNet: {vnet_name}") | ||
except ResourceNotFoundError: | ||
logger.info( | ||
f"VNet {vnet_name} not found, may have been already deleted" | ||
) | ||
except Exception as e: | ||
logger.warning(f"Failed to delete VNet {vnet_name}: {e}") | ||
|
||
# Delete the role assignment associated with this cluster (if any) | ||
subscription_id = self.provider_config.get("subscription_id") | ||
unique_id = self.provider_config.get("unique_id") | ||
if subscription_id and unique_id: | ||
cluster_id = f"{self.cluster_name}-{unique_id}" | ||
role_assignment_name = f"ray-{cluster_id}-ra" | ||
role_assignment_guid = _generate_arm_guid(role_assignment_name) | ||
role_assignment_id = ( | ||
f"/subscriptions/{subscription_id}/resourceGroups/{resource_group}/providers" | ||
f"/Microsoft.Authorization/roleAssignments/{role_assignment_guid}" | ||
) | ||
|
||
if msi_principal_id: | ||
_delete_role_assignments_for_principal( | ||
self.resource_client, resource_group, msi_principal_id | ||
) | ||
|
||
delete_role_assignment = get_azure_sdk_function( | ||
client=self.resource_client.resources, function_name="delete_by_id" | ||
) | ||
try: | ||
delete_lro = delete_role_assignment( | ||
resource_id=role_assignment_id, | ||
api_version="2022-04-01", | ||
) | ||
if hasattr(delete_lro, "wait"): | ||
delete_lro.wait() | ||
logger.info( | ||
"Deleted role assignment %s for cluster %s", | ||
role_assignment_guid, | ||
self.cluster_name, | ||
) | ||
except ResourceNotFoundError: | ||
logger.debug( | ||
"Role assignment %s not found during cleanup", | ||
role_assignment_guid, | ||
) | ||
except Exception as e: | ||
logger.warning( | ||
"Failed to delete role assignment %s: %s", | ||
role_assignment_guid, | ||
e, | ||
) | ||
|
||
# Remove cached references to deleted shared resources. | ||
for key in ("msi", "nsg", "subnet"): | ||
if key in self.provider_config: | ||
self.provider_config.pop(key, None) | ||
|
||
cache_path = self.provider_config.get("_config_cache_path") | ||
if cache_path: | ||
try: | ||
if os.path.exists(cache_path): | ||
os.remove(cache_path) | ||
logger.info( | ||
"Deleted cached Ray config at %s after resource cleanup", | ||
cache_path, | ||
) | ||
except Exception as e: | ||
logger.warning( | ||
"Failed to delete cached Ray config %s: %s", cache_path, e | ||
) | ||
finally: | ||
self.provider_config.pop("_config_cache_path", None) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is quite long and handles the deletion of multiple distinct resource types. To improve readability and maintainability, consider refactoring this into smaller, more focused private methods for each resource type being cleaned up (e.g., _cleanup_msi
, _cleanup_subnet
, _cleanup_nsg
, _cleanup_vnet
, and _cleanup_role_assignments
). Each of these helper methods could encapsulate the logic for retrieving the resource name, deleting it, and handling specific errors, returning any necessary information (like the MSI principal ID) to the main cleanup_cluster_resources
function. This would make the main function a much clearer, high-level sequence of cleanup steps.
@jackfrancis could you review this one? |
Why are these changes needed?
After running
ray down
several resources (a managed service identiy, network security group, etc) at left in the subscription and re-runningray up
(without specifying --no-config-cache) will sometimes run into errors because of this.Related issue number
Fixes: #55392
Checks
git commit -s
) in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.