[ux] cache cluster status of autostop or spot clusters for 2s #4332

cg505 · 2024-11-11T20:24:28Z

Previously, every time we want the status of a cluster with spot VMs or with autostop set, we will fetch the latest actual status from the cloud. This is needed since these clusters can be terminated from "outside", and the state in the local state database will be out of date.

However, we often end up fetching the status multiple times in a single invocation. For instance, sky launch will check the status in cli.py, then again almost immediately after as part of the provisioning codepath.

To mitigate this, we can keep track of the last time we fetched the status from the cloud. If it is within the past 2 seconds, assume that it's still accurate (that is, the cluster hasn't been terminated/stopped since then).

Caveats:

~~The updated timestamp check/set is not atomic, so if multiple parallel invocations check the status, they may all see that it is out of date, and then all try to refresh the status.~~
- Edit: fixed this in the latest version
- This is equivalent to the current behavior, but the optimization won't take effect in this case.
It is possible that the cluster is terminated or stopped in the 2 seconds between the status check and our check. This could cause further operations (e.g. job launch) to fail and potentially crash SkyPilot.
- This race is already possible in master since there is always some delay between when we check the status and when we launch a job/do whatever we want to do with the cluster. But now the window for a potential race is increased by up to 2 seconds.
- This could be fixed by changing the status check to also send some "intent to use" the cluster, which would reset the idle time when it fetches the status (atomically).

Performance impact:

sky launch --fast skips one status check, saving ~3s (from ~10.5s -> ~7.5s if the cluster is already up).
sky jobs launch --fast is the same, but this will mitigate the performance hit from make --fast robust against credential or wheel updates #4289

Tested (run the relevant ones):

Code formatting: bash format.sh
Manually used sky launch --fast on many autostop cluster to try and make it fail.
All smoke tests: pytest tests/test_smoke.py
pytest tests/test_smoke.py --managed-jobs
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

romilbhardwaj

Thanks @cg505!

sky/backends/backend_utils.py

romilbhardwaj · 2024-11-15T21:41:36Z

sky/backends/backend_utils.py

+            if 0 <= cluster_status_lock_timeout < time.perf_counter(
+            ) - start_time:
+                logger.debug(
+                    'Refreshing status: Failed get the lock for cluster '
+                    f'{cluster_name!r}. Using the cached status.')
+                return record


Maybe I misread something, but doesn't the presence of this if block violate correctness?

E.g., if two concurrent requests (named r1, r2) come in, and say r1 acquires the lock and takes a long time to refresh. Shouldn't r2 wait for r1 to complete, especially since _must_refresh_cluster_status was evaluated to True previously?

You're correct, but this is the current behavior. It allows us to at least get something in the case where e.g. there is a provision that takes a long time holding the lock.

I see - can we add a quick comment noting this behavior?

sky/backends/backend_utils.py

romilbhardwaj · 2024-11-15T21:44:20Z

sky/global_user_state.py

+        'select name, launched_at, handle, last_use, status, autostop, '
+        'metadata, to_down, owner, cluster_hash, storage_mounts_metadata, '
+        'cluster_ever_up, status_updated_at from clusters '
+        'order by launched_at desc').fetchall()


Any reason to move away from *?

It is robust against column order.

Edit: to expand on this - there's no guarantee on the order of columns that select * will give up. For instance, if me and you are both developing features that add a column, and then we merge both of these changes, this will break. We will each have added our own new column before the other, so the order of the columns in our global state db will be different.

I've already hit this bug a few times between this change, #4289, and the stuff we were testing yesterday

See also #4211, same class of issues.

I see. Is backward compatibility still preserved? E.g., if a cluster was launched before this PR and state.db doesn't contain status_updated_at, but on upgrading to this branch this line tries to select status_updated_at, will that work?

(I think it should still work because create_table is called at module initialization, but just want to double check).

Yes, I believe that should be fine. In fact even with select *, this method would crash if status_updated_at was missing because we would not have enough columns to unpack.

romilbhardwaj

Thanks @cg505! Left some minor comments on previous threads, but otherwise lgtm!

cg505 added 2 commits November 11, 2024 11:49

add status_updated_at to DB

80c65eb

don't refresh autostop/spot cluster if it's recently been refreshed

6cc553f

cg505 requested a review from Michaelvll November 12, 2024 18:38

cg505 marked this pull request as draft November 13, 2024 01:07

update locking mechanism for status check to early exit

fc18dcc

cg505 force-pushed the cache-cluster-status branch from c023bd1 to fc18dcc Compare November 13, 2024 23:27

cg505 marked this pull request as ready for review November 13, 2024 23:27

romilbhardwaj self-requested a review November 15, 2024 20:13

romilbhardwaj reviewed Nov 15, 2024

View reviewed changes

address PR comments

74e38c7

cg505 requested a review from romilbhardwaj November 15, 2024 22:55

romilbhardwaj approved these changes Nov 15, 2024

View reviewed changes

add warning about cluster status lock timeout

f58b7e9

cg505 enabled auto-merge November 15, 2024 23:27

cg505 added this pull request to the merge queue Nov 15, 2024

Merged via the queue into skypilot-org:master with commit 88813ce Nov 15, 2024
20 checks passed

cg505 deleted the cache-cluster-status branch November 15, 2024 23:35

cg505 mentioned this pull request Nov 16, 2024

make --fast robust against credential or wheel updates #4289

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ux] cache cluster status of autostop or spot clusters for 2s #4332

[ux] cache cluster status of autostop or spot clusters for 2s #4332

cg505 commented Nov 11, 2024 •

edited

Loading

romilbhardwaj left a comment

romilbhardwaj Nov 15, 2024

cg505 Nov 15, 2024

romilbhardwaj Nov 15, 2024

romilbhardwaj Nov 15, 2024

cg505 Nov 15, 2024 •

edited

Loading

cg505 Nov 15, 2024

romilbhardwaj Nov 15, 2024

cg505 Nov 15, 2024

romilbhardwaj left a comment

[ux] cache cluster status of autostop or spot clusters for 2s #4332

[ux] cache cluster status of autostop or spot clusters for 2s #4332

Conversation

cg505 commented Nov 11, 2024 • edited Loading

romilbhardwaj left a comment

Choose a reason for hiding this comment

romilbhardwaj Nov 15, 2024

Choose a reason for hiding this comment

cg505 Nov 15, 2024

Choose a reason for hiding this comment

romilbhardwaj Nov 15, 2024

Choose a reason for hiding this comment

romilbhardwaj Nov 15, 2024

Choose a reason for hiding this comment

cg505 Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

cg505 Nov 15, 2024

Choose a reason for hiding this comment

romilbhardwaj Nov 15, 2024

Choose a reason for hiding this comment

cg505 Nov 15, 2024

Choose a reason for hiding this comment

romilbhardwaj left a comment

Choose a reason for hiding this comment

cg505 commented Nov 11, 2024 •

edited

Loading

cg505 Nov 15, 2024 •

edited

Loading