-
Notifications
You must be signed in to change notification settings - Fork 511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spot] Stop the cluster if the cancellation fails #1998
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll, some comments. The PR title needs to change "stop".
sky/core.py
Outdated
operation='cancelling jobs', | ||
) | ||
else: | ||
_, handle = backend_utils.refresh_cluster_status_handle(cluster_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this return None, if the entire spot cluster is preempted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Just changed the implementation to make it more robust.
operation='cancelling jobs', | ||
) | ||
if not _ignore_cluster_aliveness: | ||
handle = backend_utils.check_cluster_available( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document the diff between check_cluster_available vs. refresh_cluster_status_handle? The Perhaps add it to the former's docstr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-implemented this part and added a comment below. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @Michaelvll - please ensure spot tests pass.
sky/spot/recovery_strategy.py
Outdated
f'\n Detailed exception: {e}') | ||
logger.info( | ||
'Failed to cancel the job on the cluster. The cluster ' | ||
'might be already down or the head node is preempted. ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'might be already down or the head node is preempted. ' | |
'might be already down or the head node is preempted.' |
sky/core.py
Outdated
except exceptions.ClusterNotUpError as e: | ||
if not _ignore_cluster_aliveness: | ||
raise | ||
if (not isinstance(e.handle, backends.CloudVmRayResourceHandle) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the docstr of check_cluster_available() it looks like
exceptions.NotSupportedError: if the cluster is not based on
CloudVmRayBackend.
will be thrown, and this check here is not needed?
* stop the cluster if the cancellation fails * Allow cancel for partial cluster * format * terminate instead of stop * format * rename * Address coments * address comments * format
* stop the cluster if the cancellation fails * Allow cancel for partial cluster * format * terminate instead of stop * format * rename * Address coments * address comments * format
This PR is an attempt to fix a potential bug when running multi-node spot job. If the head node is preempted, the worker nodes may still have the job remaining there holding the system resources (e.g. GPU memory) and our original cancellation of the job will take no effect as the head node does not exist. We add a safeguard to stop all the nodes if the cancellation fails.
Previously, the
sky.cancel
in the recovery_strategy does not take effect even if the worker node is preempted. To reproduce:sky launch -c test-4-nodes --cloud aws --use-spot --cpus 2+ sleep 10000
python -c "sky.cancel('test-4-nodes', all=True)
The cancel fails, and if we log into the cluster and runray job list
the job is still running.To fix this issue, we add
_ignore_server_aliveness
to thesky.cancel
function, so that we try our best to cancel the job, even if the cluster is partially up. If the cancellation fails, in the recovery_strategy, we will stop the cluster to be safe.Tested (run the relevant ones):
sky spot launch --num-nodes 4 sleep 1000
; kill the head node in the console; the whole cluster is terminated and re-launched.sky spot launch --num-nodes 4 sleep 1000
; kill the worker node in the console; the worker node is re-launched and check the ray job in the head node and the previous job is cancelled correctly.pytest tests/test_smoke.py --managed-spot