Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Support launching a new cluster from an existing cluster's disk #2098

Merged
merged 37 commits into from
Jun 28, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jun 16, 2023

Fixes #2050. This PR only supports AWS and GCP as an experimental feature and unblocks the users.

It is blocked by #2087

sky launch --clone-disk-from existing-cluster -c new-cluster --cloud aws --region us-west-2 task.yaml

A problem with the image copying is that the copying can take significant of time.

TODOs:

  • Support GCP
  • Clean up the image after the target cluster being tear down.
  • docs for additional permissions required

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • AWS (with both admin and minimal permission with the new permissions):
    • sky launch -c min --cloud aws --cpus 2 --region us-east-1; ssh min; echo 'hi' > user_file.txt
    • sky launch --clone min -c new-min --cloud aws --region us-west-2
    • sky launch --clone min -c new-min2
    • sky down new-min new-min2
    • GCP:
    • sky launch -c min --cloud gcp --cpus 2 --region us-central1; ssh min; echo 'hi' > user_file.txt
    • sky launch --clone min -c new-min --cloud gcp --region us-east1 --disk-size 50
    • sky launch --clone min -c new-min --cloud gcp --region us-east1
    • sky launch --clone min -c new-min2
    • sky down new-min new-min2
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@Michaelvll Michaelvll added the blocked PR blocked by other issues label Jun 16, 2023
@Michaelvll Michaelvll marked this pull request as draft June 16, 2023 22:54
@Michaelvll Michaelvll marked this pull request as draft June 16, 2023 22:54
@Michaelvll Michaelvll marked this pull request as ready for review June 16, 2023 23:13
@Michaelvll Michaelvll removed the blocked PR blocked by other issues label Jun 19, 2023
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @Michaelvll! Leaving some comments first, only a partial pass.

Some UX notes:

  1. I tried to pass a smaller disk size than an existing cluster
» sky launch --clone dbg -c smoke --region us-west2 --cloud gcp --disk-size 50 --cpus 4
...
Launching a new cluster 'smoke' from the disk of 'dbg'. Proceed? [Y/n]:
<spinner -- creating image>
Image 'projects/skypilot-xxxx/global/images/skypilot-dbg-xxxxxxxx' for 'dbg' created successfully on GCP. Overriding task's image_id.
ValueError: Image 'projects/skypilot-xxxx/global/images/skypilot-dbg-xxxxxxxx' is 256.0GB, which is larger than the specified disk_size: 50 GB. Please specify a larger disk_size to use this image.

The error shows up only after the image creation (which is now leaked). Can we pre-check?

Btw, this also makes me wonder what _is_image_managed means. The above ended up with an error - did we remove the auto-created image managed by us?

  1. A successful clone:

Launching a new cluster 'smoke' from the disk of 'dbg'. Proceed? [Y/n]:
Image 'projects/skypilot-375900/global/images/skypilot-dbg-1687454357' for 'dbg' created successfully on GCP. Overriding task's image_id.

The second line probably should have a logging prefix, otherwise easy to miss. Also, green the line?

sky/cli.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
sky/clouds/cloud.py Outdated Show resolved Hide resolved
sky/clouds/cloud.py Outdated Show resolved Hide resolved
sky/clouds/cloud.py Outdated Show resolved Hide resolved
sky/utils/subprocess_utils.py Outdated Show resolved Hide resolved
@concretevitamin
Copy link
Member

An odd behavior - I think I did the following:

sky launch --clone dbg -c smoke --region us-west2 --cloud gcp --cpus 4  # Success.
sky down dbg

A day after,

sky cpunode -c smoke -i120
...
ValueError: Image 'projects/skypilot-12345/global/images/skypilot-dbg-xxxxx54357' not found in GCP.

On GCP console the image is indeed not there. Could it be sky down dbg deleted the image?

@Michaelvll
Copy link
Collaborator Author

Good catch! It was because we called the post_teardown_cleanup during the status refresh, which previously will delete image when the cluster is stopped. We should not delete the image when stopping the cluster. Fixed in the latest commit.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll! Did a pass, some comments.

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
f'{reset}{bold}sky spot launch{reset} {yellow}or{reset} '
f'{bold}sky.spot_launch(){reset}.')

if Stage.CLONE_DISK in stages:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the 3 if clauses should be placed under an "if not cluster_exists:". Not feeling strongly, just felt that may be more readable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we can move the CLONE_DISK stage up before the if above, so that we can still have the two original if in if not cluster_exists. The downside with this is that, it can be a little bit confusing when the order of the stages should be CLONE_DISK -> OPTIMIZE -> PROVISION ..., but now it is easier to read it as CLONE_DISK -> PROVISION -> OPTIMIZE -> PROVISION....

sky/clouds/gcp.py Outdated Show resolved Hide resolved
sky/clouds/aws.py Outdated Show resolved Hide resolved
sky/clouds/aws.py Show resolved Hide resolved
stream_logs=False)

instance_ids = json.loads(stdout.strip())
if len(instance_ids) != 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be 0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Although we checked the existance of the cluster in the caller. There might be some case where user manually terminates the original cluster, causing an issue. Added a condition for it.

sky/utils/subprocess_utils.py Outdated Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for solving a longstanding issue from GPU unavailability @Michaelvll!

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
@@ -103,8 +103,22 @@ AWS accounts can be attached with a policy that limits the permissions of the ac
]
}

4. Click **Next: Tags** and follow the instructions to finish creating the policy. You can give the policy a descriptive name, such as ``minimal-skypilot-policy``.
5. Go back to the previous window and click on the refresh button, and you can now search for the policy you just created.
4. **Optional**: To allow ``--clone-disk-from`` to work, you need to add the following permissions to the policy above as well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
4. **Optional**: To allow ``--clone-disk-from`` to work, you need to add the following permissions to the policy above as well.
4. **Optional**: To allow ``sky launch --clone-disk-from`` to work, you need to add the following permissions to the policy above as well.

I think it may be ok to leave this as a comment, to see if the feature is more widely used first. OK either way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I tends to have it in the doc, as this doc page is not target for normal users but for admins, i.e., there is limited downside/burden this new entry will bring to a normal user. And it is easier to point our pilot user to this page, when they want to use this feature.

docs/source/cloud-setup/cloud-permissions/gcp.rst Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested:

  • pytest tests/test_smoke.py::test_clone_disk_aws --aws
  • pytest tests/test_smoke.py::test_clone_disk_gcp
  • pytest tests/test_smoke.py
  • pytest tests/test_smoke.py --aws

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support migration of a stopped cluster to another region
2 participants