-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Support launching a new cluster from an existing cluster's disk #2098
Conversation
1b0136a
to
7a4d629
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome @Michaelvll! Leaving some comments first, only a partial pass.
Some UX notes:
- I tried to pass a smaller disk size than an existing cluster
» sky launch --clone dbg -c smoke --region us-west2 --cloud gcp --disk-size 50 --cpus 4
...
Launching a new cluster 'smoke' from the disk of 'dbg'. Proceed? [Y/n]:
<spinner -- creating image>
Image 'projects/skypilot-xxxx/global/images/skypilot-dbg-xxxxxxxx' for 'dbg' created successfully on GCP. Overriding task's image_id.
ValueError: Image 'projects/skypilot-xxxx/global/images/skypilot-dbg-xxxxxxxx' is 256.0GB, which is larger than the specified disk_size: 50 GB. Please specify a larger disk_size to use this image.
The error shows up only after the image creation (which is now leaked). Can we pre-check?
Btw, this also makes me wonder what _is_image_managed
means. The above ended up with an error - did we remove the auto-created image managed by us?
- A successful clone:
Launching a new cluster 'smoke' from the disk of 'dbg'. Proceed? [Y/n]:
Image 'projects/skypilot-375900/global/images/skypilot-dbg-1687454357' for 'dbg' created successfully on GCP. Overriding task's image_id.
The second line probably should have a logging prefix, otherwise easy to miss. Also, green the line?
An odd behavior - I think I did the following:
A day after,
On GCP console the image is indeed not there. Could it be |
Good catch! It was because we called the |
fix fix better logging mypy
rename func Add image id in the message format longer wait time
7230966
to
6df687d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll! Did a pass, some comments.
f'{reset}{bold}sky spot launch{reset} {yellow}or{reset} ' | ||
f'{bold}sky.spot_launch(){reset}.') | ||
|
||
if Stage.CLONE_DISK in stages: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the 3 if clauses should be placed under an "if not cluster_exists:". Not feeling strongly, just felt that may be more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we can move the CLONE_DISK stage up before the if above, so that we can still have the two original if in if not cluster_exists
. The downside with this is that, it can be a little bit confusing when the order of the stages should be CLONE_DISK -> OPTIMIZE -> PROVISION ...
, but now it is easier to read it as CLONE_DISK -> PROVISION -> OPTIMIZE -> PROVISION...
.
stream_logs=False) | ||
|
||
instance_ids = json.loads(stdout.strip()) | ||
if len(instance_ids) != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Although we checked the existance of the cluster in the caller. There might be some case where user manually terminates the original cluster, causing an issue. Added a condition for it.
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
…to clone-disk-from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for solving a longstanding issue from GPU unavailability @Michaelvll!
@@ -103,8 +103,22 @@ AWS accounts can be attached with a policy that limits the permissions of the ac | |||
] | |||
} | |||
|
|||
4. Click **Next: Tags** and follow the instructions to finish creating the policy. You can give the policy a descriptive name, such as ``minimal-skypilot-policy``. | |||
5. Go back to the previous window and click on the refresh button, and you can now search for the policy you just created. | |||
4. **Optional**: To allow ``--clone-disk-from`` to work, you need to add the following permissions to the policy above as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4. **Optional**: To allow ``--clone-disk-from`` to work, you need to add the following permissions to the policy above as well. | |
4. **Optional**: To allow ``sky launch --clone-disk-from`` to work, you need to add the following permissions to the policy above as well. |
I think it may be ok to leave this as a comment, to see if the feature is more widely used first. OK either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I tends to have it in the doc, as this doc page is not target for normal users but for admins, i.e., there is limited downside/burden this new entry will bring to a normal user. And it is easier to point our pilot user to this page, when they want to use this feature.
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
…to clone-disk-from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested:
-
pytest tests/test_smoke.py::test_clone_disk_aws --aws
-
pytest tests/test_smoke.py::test_clone_disk_gcp
-
pytest tests/test_smoke.py
-
pytest tests/test_smoke.py --aws
Fixes #2050. This PR only supports AWS and GCP as an experimental feature and unblocks the users.
It is blocked by #2087sky launch --clone-disk-from existing-cluster -c new-cluster --cloud aws --region us-west-2 task.yaml
A problem with the image copying is that the copying can take significant of time.
TODOs:
Tested (run the relevant ones):
bash format.sh
sky launch -c min --cloud aws --cpus 2 --region us-east-1
;ssh min
;echo 'hi' > user_file.txt
sky launch --clone min -c new-min --cloud aws --region us-west-2
sky launch --clone min -c new-min2
sky down new-min new-min2
sky launch -c min --cloud gcp --cpus 2 --region us-central1
;ssh min
;echo 'hi' > user_file.txt
sky launch --clone min -c new-min --cloud gcp --region us-east1 --disk-size 50
sky launch --clone min -c new-min --cloud gcp --region us-east1
sky launch --clone min -c new-min2
sky down new-min new-min2
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh