Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Stop the cluster if the cancellation fails #1998

Merged
merged 9 commits into from
Jun 2, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented May 31, 2023

This PR is an attempt to fix a potential bug when running multi-node spot job. If the head node is preempted, the worker nodes may still have the job remaining there holding the system resources (e.g. GPU memory) and our original cancellation of the job will take no effect as the head node does not exist. We add a safeguard to stop all the nodes if the cancellation fails.

Previously, the sky.cancel in the recovery_strategy does not take effect even if the worker node is preempted. To reproduce:

  1. sky launch -c test-4-nodes --cloud aws --use-spot --cpus 2+ sleep 10000
  2. manually terminate a worker node
  3. python -c "sky.cancel('test-4-nodes', all=True) The cancel fails, and if we log into the cluster and run ray job list the job is still running.

To fix this issue, we add _ignore_server_aliveness to the sky.cancel function, so that we try our best to cancel the job, even if the cluster is partially up. If the cancellation fails, in the recovery_strategy, we will stop the cluster to be safe.

Tested (run the relevant ones):

  • Any manual or new tests for this PR (please specify below)
    • sky spot launch --num-nodes 4 sleep 1000; kill the head node in the console; the whole cluster is terminated and re-launched.
    • sky spot launch --num-nodes 4 sleep 1000; kill the worker node in the console; the worker node is re-launched and check the ray job in the head node and the previous job is cancelled correctly.
  • All spot smoke tests: pytest tests/test_smoke.py --managed-spot

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll, some comments. The PR title needs to change "stop".

sky/core.py Outdated
operation='cancelling jobs',
)
else:
_, handle = backend_utils.refresh_cluster_status_handle(cluster_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this return None, if the entire spot cluster is preempted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Just changed the implementation to make it more robust.

operation='cancelling jobs',
)
if not _ignore_cluster_aliveness:
handle = backend_utils.check_cluster_available(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document the diff between check_cluster_available vs. refresh_cluster_status_handle? The Perhaps add it to the former's docstr?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-implemented this part and added a comment below. PTAL.

sky/core.py Outdated Show resolved Hide resolved
sky/spot/recovery_strategy.py Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @Michaelvll - please ensure spot tests pass.

f'\n Detailed exception: {e}')
logger.info(
'Failed to cancel the job on the cluster. The cluster '
'might be already down or the head node is preempted. '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'might be already down or the head node is preempted. '
'might be already down or the head node is preempted.'

sky/core.py Outdated
except exceptions.ClusterNotUpError as e:
if not _ignore_cluster_aliveness:
raise
if (not isinstance(e.handle, backends.CloudVmRayResourceHandle) or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the docstr of check_cluster_available() it looks like

        exceptions.NotSupportedError: if the cluster is not based on
          CloudVmRayBackend.

will be thrown, and this check here is not needed?

sky/core.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll merged commit 21fd00b into master Jun 2, 2023
@Michaelvll Michaelvll deleted the stop-spot-if-cancel-fails branch June 2, 2023 08:18
concretevitamin pushed a commit that referenced this pull request Jun 3, 2023
* stop the cluster if the cancellation fails

* Allow cancel for partial cluster

* format

* terminate instead of stop

* format

* rename

* Address coments

* address comments

* format
concretevitamin pushed a commit that referenced this pull request Jun 4, 2023
* stop the cluster if the cancellation fails

* Allow cancel for partial cluster

* format

* terminate instead of stop

* format

* rename

* Address coments

* address comments

* format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants