[UX] Better logging message for operators on the clusters terminated manually in the cloud #2389

Michaelvll · 2023-08-11T18:45:42Z

Note this only works for the spot clusters or the clusters with autostop/down setup, as the status refresh will not be triggered for the operators on the normal clusters.

Previously:

sky logs test-warning-spot
ValueError: Cluster 'test-warning-spot' does not exist.

Now:

sky logs test-warning-spot
ValueError: Cluster 'test-warning-spot' was preempted or manually terminated in console.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
- sky launch -c test-warning-spot --cpus 2 --cloud gcp --use-spot
- manually terminate the cluster in the console
- sky logs test-warning-spot
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

concretevitamin · 2023-08-11T19:32:17Z

Nice @Michaelvll! QQ: does this cover all ops that auto refresh, e.g., queue as mentioned in the issue?

Michaelvll · 2023-08-11T21:09:12Z

Nice @Michaelvll! QQ: does this cover all ops that auto refresh, e.g., queue as mentioned in the issue?

Yes, it works for all the operators with the auto refresh.

One thing to note is that it only works for the cluster with autostop/down setup or the spot cluster, as the normal cluster will not trigger the refresh. If a normal cluster is terminated in the console manually, the skypilot will still try to ssh into it and timeout after a while.

sky/backends/backend_utils.py

…or-cluster-terminated

concretevitamin

Nice, thanks @Michaelvll!

sky/backends/backend_utils.py

concretevitamin · 2023-08-21T15:26:04Z

One potential issue -- I had two clusters, dbg and spot controller. The former was manually killed in console. Then,

» sky queue
Fetching and parsing job queue...
ValueError: Cluster 'dbg' not found on the cloud provider. It was either preempted, autodowned, or manually terminated in console.

» sky queue                     1 ↵
Fetching and parsing job queue...
Failed to get the job queue for cluster 'sky-spot-controller-8a3968f2'.
  sky.exceptions.ClusterNotUpError: Getting the job queue: skipped for cluster 'sky-spot-controller-8a3968f2' (status: STOPPED). It is only allowed for UP clusters.

A multi-cluster command like sky queue raised in the middle rather than processing all clusters. Should we / how should we handle this?

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

Michaelvll · 2023-08-22T18:01:36Z

One potential issue -- I had two clusters, dbg and spot controller. The former was manually killed in console. Then,
» sky queue
Fetching and parsing job queue...
ValueError: Cluster 'dbg' not found on the cloud provider. It was either preempted, autodowned, or manually terminated in console.

» sky queue                     1 ↵
Fetching and parsing job queue...
Failed to get the job queue for cluster 'sky-spot-controller-8a3968f2'.
  sky.exceptions.ClusterNotUpError: Getting the job queue: skipped for cluster 'sky-spot-controller-8a3968f2' (status: STOPPED). It is only allowed for UP clusters.
A multi-cluster command like sky queue raised in the middle rather than processing all clusters. Should we / how should we handle this?

Thanks for catching this @concretevitamin! We previously forgot to catch the ValueError, which causes the problem. It should now be fixed. : )

Michaelvll added 2 commits August 11, 2023 11:35

Show the cluster manually terminated for operators

498080f

Fix message

b12cbb6

concretevitamin reviewed Aug 16, 2023

View reviewed changes

sky/backends/backend_utils.py Outdated Show resolved Hide resolved

Michaelvll added 3 commits August 18, 2023 12:47

Better logging

2cf55ba

format

949ad71

Merge branch 'master' of github.com:skypilot-org/skypilot into operat…

26c1544

…or-cluster-terminated

Michaelvll requested a review from concretevitamin August 20, 2023 23:48

concretevitamin approved these changes Aug 21, 2023

View reviewed changes

sky/backends/backend_utils.py Outdated Show resolved Hide resolved

Michaelvll and others added 2 commits August 22, 2023 10:54

Update sky/backends/backend_utils.py

e90ed38

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

Fix multiple cluster case

c7ee0c8

Michaelvll merged commit 59dc4b4 into master Aug 30, 2023
17 checks passed

Michaelvll deleted the operator-cluster-terminated branch August 30, 2023 07:30

concretevitamin mentioned this pull request Sep 4, 2023

sky queue shows unnecessary error for auto-down cluster #1753

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UX] Better logging message for operators on the clusters terminated manually in the cloud #2389

[UX] Better logging message for operators on the clusters terminated manually in the cloud #2389

Michaelvll commented Aug 11, 2023 •

edited

Loading

concretevitamin commented Aug 11, 2023

Michaelvll commented Aug 11, 2023

concretevitamin left a comment

concretevitamin commented Aug 21, 2023

Michaelvll commented Aug 22, 2023

[UX] Better logging message for operators on the clusters terminated manually in the cloud #2389

[UX] Better logging message for operators on the clusters terminated manually in the cloud #2389

Conversation

Michaelvll commented Aug 11, 2023 • edited Loading

concretevitamin commented Aug 11, 2023

Michaelvll commented Aug 11, 2023

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin commented Aug 21, 2023

Michaelvll commented Aug 22, 2023

Michaelvll commented Aug 11, 2023 •

edited

Loading