[Autoscaler] [Azure] Cleaning up extra resources (MSI, VNET, NSG) during cluster teardown #57610

marosset · 2025-10-09T20:07:22Z

Why are these changes needed?

After running ray down several resources (a managed service identiy, network security group, etc) at left in the subscription and re-running ray up (without specifying --no-config-cache) will sometimes run into errors because of this.

Related issue number

Fixes: #55392

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…uster teardown Signed-off-by: Mark Rossett <marosset@microsoft.com>

marosset · 2025-10-09T20:08:06Z

It looks like preserving some of these resources may have been by design but I've also ran into issues trying to run ray up with the same config file multiple times due to these resources staying around.

cursor · 2025-10-09T20:09:05Z

python/ray/autoscaler/_private/_azure/config.py

+                    "Failed to delete role assignment %s before MSI creation: %s",
+                    role_assignment_guid,
+                    e,
+                )


Bug: Role Deletion Race Condition

The role assignment deletion logic has a race condition. If the initial query for an existing role assignment fails, the code still attempts deletion but then skips the post-deletion verification polling. This means deployment might proceed before the role assignment is fully removed, potentially causing failures.

gemini-code-assist

Code Review

This pull request introduces a robust cleanup mechanism for Azure resources (MSI, VNet, NSG, Role Assignments) during cluster teardown (ray down). This is a valuable improvement that prevents resource leakage and subsequent errors on ray up. The implementation is thorough, using deterministic naming for cleanup, pre-emptive deletion to avoid conflicts, and retry logic to handle Azure's eventual consistency. My review includes a few suggestions to improve code clarity and reduce redundancy, such as refactoring a large cleanup function and removing some duplicate code.

gemini-code-assist · 2025-10-09T20:09:31Z

python/ray/autoscaler/_private/commands.py

+            cached_config["_config_cache_path"] = cache_key
+            if "provider" in cached_config:
+                cached_config["provider"]["_config_cache_path"] = cache_key


The _config_cache_path is being added to both the root of cached_config and the provider sub-dictionary. The Azure node provider, which uses this path for cleanup, only reads it from provider_config (which corresponds to cached_config[\"provider\"]). To avoid redundancy and improve clarity, you can remove the assignment to the root cached_config.

Suggested change

cached_config["_config_cache_path"] = cache_key

if "provider" in cached_config:

cached_config["provider"]["_config_cache_path"] = cache_key

if "provider" in cached_config:

cached_config["provider"]["_config_cache_path"] = cache_key

gemini-code-assist · 2025-10-09T20:09:31Z

python/ray/autoscaler/_private/commands.py

+    resolved_config["_config_cache_path"] = cache_key
+    resolved_config["provider"]["_config_cache_path"] = cache_key


Similar to the change for cached configs, the _config_cache_path is added to both the root resolved_config and the provider sub-dictionary. Since the Azure node provider reads this from the provider config, the assignment to the root dictionary appears to be redundant and can be removed for clarity.

Suggested change

resolved_config["_config_cache_path"] = cache_key

resolved_config["provider"]["_config_cache_path"] = cache_key

resolved_config["provider"]["_config_cache_path"] = cache_key

gemini-code-assist · 2025-10-09T20:09:31Z

python/ray/autoscaler/_private/_azure/node_provider.py

+    def cleanup_cluster_resources(self):
+        """Delete shared cluster infrastructure (MSI, NSG, Subnet, VNet).
+
+        This should only be called after all cluster nodes have been terminated.
+        These resources are referenced by their Azure resource IDs in provider_config.
+
+        Note: Includes retry logic with exponential backoff to handle cases where
+        NICs are still being deleted asynchronously.
+        """
+        resource_group = self.provider_config["resource_group"]
+        msi_principal_id: Optional[str] = None
+
+        # Extract resource names from Azure resource IDs
+        # Format: /subscriptions/{sub}/resourceGroups/{rg}/providers/{provider}/{type}/{name}
+        def get_resource_name_from_id(resource_id):
+            if resource_id:
+                return resource_id.split("/")[-1]
+            return None
+
+        # Helper function to retry deletion with exponential backoff
+        def retry_delete(delete_fn, max_retries=5, initial_delay=2):
+            """Retry a delete operation with exponential backoff."""
+            delay = initial_delay
+            for attempt in range(max_retries):
+                try:
+                    return delete_fn()
+                except Exception as e:
+                    error_msg = str(e)
+                    # Check if error is due to resources still being deleted
+                    if "InUse" in error_msg and attempt < max_retries - 1:
+                        logger.info(
+                            f"Resource still in use, retrying in {delay}s (attempt {attempt + 1}/{max_retries})..."
+                        )
+                        time.sleep(delay)
+                        delay *= 2  # Exponential backoff
+                    else:
+                        raise
+
+        # Delete MSI (Managed Service Identity)
+        msi_id = self.provider_config.get("msi")
+        if msi_id:
+            msi_name = get_resource_name_from_id(msi_id)
+            if msi_name:
+                try:
+                    get_identity = get_azure_sdk_function(
+                        client=self.resource_client.resources,
+                        function_name="get_by_id",
+                    )
+                    existing_msi = get_identity(msi_id, "2023-01-31")
+                    msi_principal_id = getattr(existing_msi, "properties", {}).get(
+                        "principalId"
+                    )
+                except ResourceNotFoundError:
+                    msi_principal_id = None
+                except Exception as exc:
+                    logger.warning(
+                        "Failed to query MSI %s for principal ID prior to deletion: %s",
+                        msi_name,
+                        exc,
+                    )
+                try:
+                    logger.info(f"Deleting Managed Service Identity: {msi_name}")
+                    delete = get_azure_sdk_function(
+                        client=self.resource_client.resources,
+                        function_name="delete_by_id",
+                    )
+                    delete(resource_id=msi_id, api_version="2023-01-31").wait()
+                    logger.info(f"Successfully deleted MSI: {msi_name}")
+                except ResourceNotFoundError:
+                    logger.info(
+                        f"MSI {msi_name} not found, may have been already deleted"
+                    )
+                except Exception as e:
+                    logger.warning(f"Failed to delete MSI {msi_name}: {e}")
+
+        # Delete Subnet first (must be deleted before NSG and VNet)
+        # Subnet ID format: .../virtualNetworks/{vnet}/subnets/{subnet}
+        subnet_id = self.provider_config.get("subnet")
+        vnet_name = None
+        if subnet_id:
+            subnet_name = get_resource_name_from_id(subnet_id)
+            # Extract vnet name (second to last component)
+            if subnet_id and "/virtualNetworks/" in subnet_id:
+                parts = subnet_id.split("/")
+                vnet_idx = parts.index("virtualNetworks")
+                if vnet_idx + 1 < len(parts):
+                    vnet_name = parts[vnet_idx + 1]
+
+            if subnet_name and vnet_name:
+                try:
+                    logger.info(f"Deleting Subnet: {subnet_name} in VNet: {vnet_name}")
+
+                    def delete_subnet():
+                        delete = get_azure_sdk_function(
+                            client=self.network_client.subnets, function_name="delete"
+                        )
+                        delete(
+                            resource_group_name=resource_group,
+                            virtual_network_name=vnet_name,
+                            subnet_name=subnet_name,
+                        ).wait()
+
+                    retry_delete(delete_subnet)
+                    logger.info(f"Successfully deleted Subnet: {subnet_name}")
+                except ResourceNotFoundError:
+                    logger.info(
+                        f"Subnet {subnet_name} not found, may have been already deleted"
+                    )
+                except Exception as e:
+                    logger.warning(f"Failed to delete Subnet {subnet_name}: {e}")
+
+        # Delete NSG (Network Security Group) after subnet is deleted
+        nsg_id = self.provider_config.get("nsg")
+        if nsg_id:
+            nsg_name = get_resource_name_from_id(nsg_id)
+            if nsg_name:
+                try:
+                    logger.info(f"Deleting Network Security Group: {nsg_name}")
+
+                    def delete_nsg():
+                        delete = get_azure_sdk_function(
+                            client=self.network_client.network_security_groups,
+                            function_name="delete",
+                        )
+                        delete(
+                            resource_group_name=resource_group,
+                            network_security_group_name=nsg_name,
+                        ).wait()
+
+                    retry_delete(delete_nsg)
+                    logger.info(f"Successfully deleted NSG: {nsg_name}")
+                except ResourceNotFoundError:
+                    logger.info(
+                        f"NSG {nsg_name} not found, may have been already deleted"
+                    )
+                except Exception as e:
+                    logger.warning(f"Failed to delete NSG {nsg_name}: {e}")
+
+        # Delete VNet (Virtual Network) after subnet and NSG are deleted
+        if subnet_id and vnet_name:
+            try:
+                logger.info(f"Deleting Virtual Network: {vnet_name}")
+
+                def delete_vnet():
+                    delete = get_azure_sdk_function(
+                        client=self.network_client.virtual_networks,
+                        function_name="delete",
+                    )
+                    delete(
+                        resource_group_name=resource_group,
+                        virtual_network_name=vnet_name,
+                    ).wait()
+
+                retry_delete(delete_vnet)
+                logger.info(f"Successfully deleted VNet: {vnet_name}")
+            except ResourceNotFoundError:
+                logger.info(
+                    f"VNet {vnet_name} not found, may have been already deleted"
+                )
+            except Exception as e:
+                logger.warning(f"Failed to delete VNet {vnet_name}: {e}")
+
+        # Delete the role assignment associated with this cluster (if any)
+        subscription_id = self.provider_config.get("subscription_id")
+        unique_id = self.provider_config.get("unique_id")
+        if subscription_id and unique_id:
+            cluster_id = f"{self.cluster_name}-{unique_id}"
+            role_assignment_name = f"ray-{cluster_id}-ra"
+            role_assignment_guid = _generate_arm_guid(role_assignment_name)
+            role_assignment_id = (
+                f"/subscriptions/{subscription_id}/resourceGroups/{resource_group}/providers"
+                f"/Microsoft.Authorization/roleAssignments/{role_assignment_guid}"
+            )
+
+            if msi_principal_id:
+                _delete_role_assignments_for_principal(
+                    self.resource_client, resource_group, msi_principal_id
+                )
+
+            delete_role_assignment = get_azure_sdk_function(
+                client=self.resource_client.resources, function_name="delete_by_id"
+            )
+            try:
+                delete_lro = delete_role_assignment(
+                    resource_id=role_assignment_id,
+                    api_version="2022-04-01",
+                )
+                if hasattr(delete_lro, "wait"):
+                    delete_lro.wait()
+                logger.info(
+                    "Deleted role assignment %s for cluster %s",
+                    role_assignment_guid,
+                    self.cluster_name,
+                )
+            except ResourceNotFoundError:
+                logger.debug(
+                    "Role assignment %s not found during cleanup",
+                    role_assignment_guid,
+                )
+            except Exception as e:
+                logger.warning(
+                    "Failed to delete role assignment %s: %s",
+                    role_assignment_guid,
+                    e,
+                )
+
+        # Remove cached references to deleted shared resources.
+        for key in ("msi", "nsg", "subnet"):
+            if key in self.provider_config:
+                self.provider_config.pop(key, None)
+
+        cache_path = self.provider_config.get("_config_cache_path")
+        if cache_path:
+            try:
+                if os.path.exists(cache_path):
+                    os.remove(cache_path)
+                    logger.info(
+                        "Deleted cached Ray config at %s after resource cleanup",
+                        cache_path,
+                    )
+            except Exception as e:
+                logger.warning(
+                    "Failed to delete cached Ray config %s: %s", cache_path, e
+                )
+            finally:
+                self.provider_config.pop("_config_cache_path", None)
+


This function is quite long and handles the deletion of multiple distinct resource types. To improve readability and maintainability, consider refactoring this into smaller, more focused private methods for each resource type being cleaned up (e.g., _cleanup_msi, _cleanup_subnet, _cleanup_nsg, _cleanup_vnet, and _cleanup_role_assignments). Each of these helper methods could encapsulate the logic for retrieving the resource name, deleting it, and handling specific errors, returning any necessary information (like the MSI principal ID) to the main cleanup_cluster_resources function. This would make the main function a much clearer, high-level sequence of cleanup steps.

jjyao · 2025-10-10T19:06:48Z

@jackfrancis could you review this one?

[Core] [Azure] Cleaning up extra resources (MSI, VNET, NSG) during cl…

a91b19d

…uster teardown Signed-off-by: Mark Rossett <marosset@microsoft.com>

marosset requested a review from a team as a code owner October 9, 2025 20:07

marosset added the azure label Oct 9, 2025

github-project-automation bot added this to Azure-Windows and Ray Intersection (go/azure-windows-ray) Oct 9, 2025

cursor bot reviewed Oct 9, 2025

View reviewed changes

gemini-code-assist bot reviewed Oct 9, 2025

View reviewed changes

ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Oct 10, 2025

jjyao added the go add ONLY when ready to merge, run all tests label Oct 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Autoscaler] [Azure] Cleaning up extra resources (MSI, VNET, NSG) during cluster teardown #57610

[Autoscaler] [Azure] Cleaning up extra resources (MSI, VNET, NSG) during cluster teardown #57610

marosset commented Oct 9, 2025 •

edited

Loading

Uh oh!

marosset commented Oct 9, 2025

Uh oh!

cursor bot Oct 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Uh oh!

gemini-code-assist bot Oct 9, 2025

Uh oh!

gemini-code-assist bot Oct 9, 2025

Uh oh!

jjyao commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		resolved_config["_config_cache_path"] = cache_key
		resolved_config["provider"]["_config_cache_path"] = cache_key

[Autoscaler] [Azure] Cleaning up extra resources (MSI, VNET, NSG) during cluster teardown #57610

Are you sure you want to change the base?

[Autoscaler] [Azure] Cleaning up extra resources (MSI, VNET, NSG) during cluster teardown #57610

Conversation

marosset commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

marosset commented Oct 9, 2025

Uh oh!

cursor bot Oct 9, 2025

Choose a reason for hiding this comment

Bug: Role Deletion Race Condition

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

jjyao commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

marosset commented Oct 9, 2025 •

edited

Loading