[Spot] Stop the cluster if the cancellation fails #1998

Michaelvll · 2023-05-31T01:21:36Z

This PR is an attempt to fix a potential bug when running multi-node spot job. If the head node is preempted, the worker nodes may still have the job remaining there holding the system resources (e.g. GPU memory) and our original cancellation of the job will take no effect as the head node does not exist. We add a safeguard to stop all the nodes if the cancellation fails.

Previously, the sky.cancel in the recovery_strategy does not take effect even if the worker node is preempted. To reproduce:

sky launch -c test-4-nodes --cloud aws --use-spot --cpus 2+ sleep 10000
manually terminate a worker node
python -c "sky.cancel('test-4-nodes', all=True) The cancel fails, and if we log into the cluster and run ray job list the job is still running.

To fix this issue, we add _ignore_server_aliveness to the sky.cancel function, so that we try our best to cancel the job, even if the cluster is partially up. If the cancellation fails, in the recovery_strategy, we will stop the cluster to be safe.

Tested (run the relevant ones):

Any manual or new tests for this PR (please specify below)
- sky spot launch --num-nodes 4 sleep 1000; kill the head node in the console; the whole cluster is terminated and re-launched.
- sky spot launch --num-nodes 4 sleep 1000; kill the worker node in the console; the worker node is re-launched and check the ray job in the head node and the previous job is cancelled correctly.
All spot smoke tests: pytest tests/test_smoke.py --managed-spot

concretevitamin

Thanks @Michaelvll, some comments. The PR title needs to change "stop".

concretevitamin · 2023-06-02T00:04:36Z

sky/core.py

+            operation='cancelling jobs',
+        )
+    else:
+        _, handle = backend_utils.refresh_cluster_status_handle(cluster_name)


Can this return None, if the entire spot cluster is preempted?

Good point! Just changed the implementation to make it more robust.

concretevitamin · 2023-06-02T00:05:19Z

sky/core.py

-        operation='cancelling jobs',
-    )
+    if not _ignore_cluster_aliveness:
+        handle = backend_utils.check_cluster_available(


Can we document the diff between check_cluster_available vs. refresh_cluster_status_handle? The Perhaps add it to the former's docstr?

I re-implemented this part and added a comment below. PTAL.

sky/core.py

sky/spot/recovery_strategy.py

concretevitamin

LGTM @Michaelvll - please ensure spot tests pass.

concretevitamin · 2023-06-02T00:59:45Z

sky/spot/recovery_strategy.py

-                        f'\n  Detailed exception: {e}')
+            logger.info(
+                'Failed to cancel the job on the cluster. The cluster '
+                'might be already down or the head node is preempted. '


Suggested change

'might be already down or the head node is preempted. '

'might be already down or the head node is preempted.'

concretevitamin · 2023-06-02T01:03:53Z

sky/core.py

+    except exceptions.ClusterNotUpError as e:
+        if not _ignore_cluster_aliveness:
+            raise
+        if (not isinstance(e.handle, backends.CloudVmRayResourceHandle) or


From the docstr of check_cluster_available() it looks like

exceptions.NotSupportedError: if the cluster is not based on CloudVmRayBackend.

will be thrown, and this check here is not needed?

sky/core.py

* stop the cluster if the cancellation fails * Allow cancel for partial cluster * format * terminate instead of stop * format * rename * Address coments * address comments * format

Michaelvll added 6 commits May 30, 2023 18:07

stop the cluster if the cancellation fails

180cc5f

Allow cancel for partial cluster

5e931fc

format

15953f2

terminate instead of stop

70a1c53

format

3e26990

rename

341b9c2

concretevitamin reviewed Jun 2, 2023

View reviewed changes

Address coments

2f888bd

concretevitamin approved these changes Jun 2, 2023

View reviewed changes

Michaelvll added 2 commits June 1, 2023 18:12

address comments

097328d

format

3f57e7e

Michaelvll merged commit 21fd00b into master Jun 2, 2023

Michaelvll deleted the stop-spot-if-cancel-fails branch June 2, 2023 08:18

Michaelvll mentioned this pull request Jun 18, 2023

[Spot] Kill the process on workers when head node preempted in multi-node spot clusters #1774

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spot] Stop the cluster if the cancellation fails #1998

[Spot] Stop the cluster if the cancellation fails #1998

Michaelvll commented May 31, 2023 •

edited

Loading

concretevitamin left a comment

concretevitamin Jun 2, 2023

Michaelvll Jun 2, 2023

concretevitamin Jun 2, 2023

Michaelvll Jun 2, 2023

concretevitamin left a comment

concretevitamin Jun 2, 2023

concretevitamin Jun 2, 2023

	'might be already down or the head node is preempted. '
	'might be already down or the head node is preempted.'

[Spot] Stop the cluster if the cancellation fails #1998

[Spot] Stop the cluster if the cancellation fails #1998

Conversation

Michaelvll commented May 31, 2023 • edited Loading

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin Jun 2, 2023

Choose a reason for hiding this comment

Michaelvll Jun 2, 2023

Choose a reason for hiding this comment

concretevitamin Jun 2, 2023

Choose a reason for hiding this comment

Michaelvll Jun 2, 2023

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin Jun 2, 2023

Choose a reason for hiding this comment

concretevitamin Jun 2, 2023

Choose a reason for hiding this comment

Michaelvll commented May 31, 2023 •

edited

Loading