Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UX] Better logging message for operators on the clusters terminated manually in the cloud #2389

Merged
merged 7 commits into from
Aug 30, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Aug 11, 2023

Fixes #2298.

Note this only works for the spot clusters or the clusters with autostop/down setup, as the status refresh will not be triggered for the operators on the normal clusters.

Previously:

sky logs test-warning-spot
ValueError: Cluster 'test-warning-spot' does not exist.

Now:

sky logs test-warning-spot
ValueError: Cluster 'test-warning-spot' was preempted or manually terminated in console.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch -c test-warning-spot --cpus 2 --cloud gcp --use-spot
    • manually terminate the cluster in the console
    • sky logs test-warning-spot
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@concretevitamin
Copy link
Member

Nice @Michaelvll! QQ: does this cover all ops that auto refresh, e.g., queue as mentioned in the issue?

@Michaelvll
Copy link
Collaborator Author

Nice @Michaelvll! QQ: does this cover all ops that auto refresh, e.g., queue as mentioned in the issue?

Yes, it works for all the operators with the auto refresh.

One thing to note is that it only works for the cluster with autostop/down setup or the spot cluster, as the normal cluster will not trigger the refresh. If a normal cluster is terminated in the console manually, the skypilot will still try to ssh into it and timeout after a while.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks @Michaelvll!

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
@concretevitamin
Copy link
Member

One potential issue -- I had two clusters, dbg and spot controller. The former was manually killed in console. Then,

» sky queue
Fetching and parsing job queue...
ValueError: Cluster 'dbg' not found on the cloud provider. It was either preempted, autodowned, or manually terminated in console.

» sky queue                     1 ↵
Fetching and parsing job queue...
Failed to get the job queue for cluster 'sky-spot-controller-8a3968f2'.
  sky.exceptions.ClusterNotUpError: Getting the job queue: skipped for cluster 'sky-spot-controller-8a3968f2' (status: STOPPED). It is only allowed for UP clusters.

A multi-cluster command like sky queue raised in the middle rather than processing all clusters. Should we / how should we handle this?

Michaelvll and others added 2 commits August 22, 2023 10:54
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
@Michaelvll
Copy link
Collaborator Author

One potential issue -- I had two clusters, dbg and spot controller. The former was manually killed in console. Then,

» sky queue
Fetching and parsing job queue...
ValueError: Cluster 'dbg' not found on the cloud provider. It was either preempted, autodowned, or manually terminated in console.

» sky queue                     1 ↵
Fetching and parsing job queue...
Failed to get the job queue for cluster 'sky-spot-controller-8a3968f2'.
  sky.exceptions.ClusterNotUpError: Getting the job queue: skipped for cluster 'sky-spot-controller-8a3968f2' (status: STOPPED). It is only allowed for UP clusters.

A multi-cluster command like sky queue raised in the middle rather than processing all clusters. Should we / how should we handle this?

Thanks for catching this @concretevitamin! We previously forgot to catch the ValueError, which causes the problem. It should now be fixed. : )

@Michaelvll Michaelvll merged commit 59dc4b4 into master Aug 30, 2023
17 checks passed
@Michaelvll Michaelvll deleted the operator-cluster-terminated branch August 30, 2023 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ops with auto-refresh should warn about clusters terminated on the cloud
2 participants