Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Support launching a new cluster from an existing cluster's disk #2098

Merged
merged 37 commits into from
Jun 28, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
f447e0f
wip
Michaelvll Jun 15, 2023
419a002
delete the correct image_id
Michaelvll Jun 16, 2023
38d4693
not support migrate to existing cluster
Michaelvll Jun 16, 2023
5f003ee
Add run and retry
Michaelvll Jun 17, 2023
3360d34
Remove image after cluster is removed
Michaelvll Jun 17, 2023
c12fda0
make clone available for GCP
Michaelvll Jun 18, 2023
5c2c3db
fix image deletion
Michaelvll Jun 18, 2023
9cd39de
add permission docs
Michaelvll Jun 18, 2023
8625d7f
add permissions for the cloning in GCP
Michaelvll Jun 18, 2023
9c2b965
rename the feature
Michaelvll Jun 19, 2023
9091d92
format
Michaelvll Jun 19, 2023
b8f2c88
Address comments
Michaelvll Jun 23, 2023
14a3a99
format
Michaelvll Jun 23, 2023
2b1e2be
format
Michaelvll Jun 23, 2023
6df687d
only delete image when termination happens
Michaelvll Jun 24, 2023
ce6bf86
fix
Michaelvll Jun 25, 2023
cf0f066
Merge branch 'master' of github.com:skypilot-org/skypilot into clone-…
Michaelvll Jun 26, 2023
2651b18
address comments
Michaelvll Jun 27, 2023
f3776f7
Merge branch 'master' of github.com:skypilot-org/skypilot into clone-…
Michaelvll Jun 27, 2023
f1c98f7
Update sky/resources.py
Michaelvll Jun 27, 2023
4a778d9
fix docs
Michaelvll Jun 27, 2023
20e7da5
Merge branch 'clone-disk-from' of github.com:skypilot-org/skypilot in…
Michaelvll Jun 27, 2023
30205fd
comment
Michaelvll Jun 27, 2023
4427e47
address
Michaelvll Jun 27, 2023
e8f2bd6
Update sky/backends/backend_utils.py
Michaelvll Jun 27, 2023
2073c70
stack trace
Michaelvll Jun 27, 2023
b549691
Merge branch 'clone-disk-from' of github.com:skypilot-org/skypilot in…
Michaelvll Jun 27, 2023
8b2a94a
Merge branch 'master' of github.com:skypilot-org/skypilot into clone-…
Michaelvll Jun 27, 2023
3514029
doc numerating
Michaelvll Jun 27, 2023
8d9baa3
revert unnecessary changes
Michaelvll Jun 27, 2023
b5276bd
add smoke tests
Michaelvll Jun 27, 2023
254de30
fix test
Michaelvll Jun 27, 2023
e1e53e7
longer timeout
Michaelvll Jun 28, 2023
7b85601
sleep before image creation
Michaelvll Jun 28, 2023
ee59b10
longer wait time
Michaelvll Jun 28, 2023
4954fcb
longer wiat time
Michaelvll Jun 28, 2023
f6c77a5
fix azure test
Michaelvll Jun 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/cloud-setup/cloud-permissions/gcp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ You can grant those accesses via GCP's `IAM & Admin console <https://console.clo
Minimal Permissions
-----------------------

The :ref:`Medium Permissions <medium-permissions>` assigns admin permissions for some GCP services to the user. If you would like to grant finer-grained and more minimal permissions to your users in your organization / project, you can create a custom role by following the steps below:
The :ref:`Medium Permissions <gcp-medium-permissions>` assigns admin permissions for some GCP services to the user. If you would like to grant finer-grained and more minimal permissions to your users in your organization / project, you can create a custom role by following the steps below:

User
~~~~~~~~~~~~
Expand Down
14 changes: 10 additions & 4 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1725,6 +1725,7 @@ def check_can_clone_disk_and_override_task(
The task to use and the resource handle of the source cluster.

Raises:
ValueError: If the source cluster does not exist.
exceptions.NotSupportedError: If the source cluster is not valid or the
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
task is not compatible to clone disk from the source cluster.
"""
Expand All @@ -1742,8 +1743,9 @@ def check_can_clone_disk_and_override_task(
if source_cluster_status != status_lib.ClusterStatus.STOPPED:
with ux_utils.print_exception_no_traceback():
raise exceptions.NotSupportedError(
f'Cannot clone disk from cluster {source_cluster_status!r}. Please stop the '
f'cluster first: sky stop {cluster_name}.')
f'Cannot clone disk from cluster {cluster_name!r} '
f'({source_cluster_status!r}). Please stop the '
f'cluster first: sky stop {cluster_name}')

if target_cluster_name is not None:
target_cluster_status, _ = refresh_cluster_status_handle(
Expand All @@ -1752,15 +1754,19 @@ def check_can_clone_disk_and_override_task(
with ux_utils.print_exception_no_traceback():
raise exceptions.NotSupportedError(
f'The target cluster {target_cluster_name!r} already exists. Cloning '
'disk is only supported when creating a new cluster.')
'disk is only supported when creating a new cluster. To fix: specify '
'a new target cluster name.')

assert len(task.resources) == 1, task.resources
task_resources = list(task.resources)[0]
if handle.launched_resources.disk_size > task_resources.disk_size:
# The target cluster's disk should be at least as large as the source.
with ux_utils.print_exception_no_traceback():
target_cluster_name_str = f' {target_cluster_name!r}'
if target_cluster_name is None:
target_cluster_name_str = ''
raise exceptions.NotSupportedError(
f'The target cluster {target_cluster_name!r} should have a disk size '
f'The target cluster {target_cluster_name_str} should have a disk size '
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
f'of at least {handle.launched_resources.disk_size} GB to clone the '
f'disk from {cluster_name!r}.')
override_param = {}
Expand Down
11 changes: 6 additions & 5 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -3577,9 +3577,9 @@ def post_teardown_cleanup(self,
cluster_name=handle.cluster_name,
stdout=tpu_stdout,
stderr=tpu_stderr))
if handle.launched_resources.is_image_managed and terminate:
# Delete the image when terminating the cluster and the image is
# created by SkyPilot (--clone-disk-from)
if (terminate and handle.launched_resources.is_image_managed is True):
# Delete the image when terminating a "cloned" cluster, i.e.,
# whose image is created by SkyPilot (--clone-disk-from)
logger.debug(f'Deleting image {handle.launched_resources.image_id}')
cluster_resources = handle.launched_resources
cluster_cloud = cluster_resources.cloud
Expand All @@ -3591,8 +3591,9 @@ def post_teardown_cleanup(self,
cluster_cloud.delete_image(image_id,
handle.launched_resources.region)
except exceptions.CommandError as e:
logger.debug(
f'Failed to delete image {image_id}: '
logger.warning(
f'Failed to delete cloned image {image_id}. Please '
'remove it manually to avoid image leakage. Details: '
f'{common_utils.format_exception(e, use_bracket=True)}')

# The cluster file must exist because the cluster_yaml will only
Expand Down
65 changes: 53 additions & 12 deletions sky/clouds/aws.py
Original file line number Diff line number Diff line change
Expand Up @@ -689,7 +689,48 @@ def _get_disk_specs(cls, disk_tier: Optional[str]) -> Dict[str, Any]:
}

@classmethod
def _query_instance_property_and_retry(
def check_quota_available(cls,
region: str,
instance_type: str,
use_spot: bool = False) -> bool:
"""Check if AWS quota is available for `instance_type` in `region`.

AWS-specific implementation of check_quota_available. The function works by
matching the instance_type to the corresponding AWS quota code, and then using
the boto3 Python API to query the region for the specific quota code.

Returns:
False if the quota is found to be zero, and True otherwise.
Raises:
ImportError: if the dependencies for AWS are not able to be installed.
botocore.exceptions.ClientError: error in Boto3 client request.
"""

from sky.clouds.service_catalog import aws_catalog # pylint: disable=import-outside-toplevel,unused-import

quota_code = aws_catalog.get_quota_code(instance_type, use_spot)

if quota_code is None:
# Quota code not found in the catalog for the chosen instance_type, try provisioning anyway
return True

client = aws.client('service-quotas', region_name=region)
try:
response = client.get_service_quota(ServiceCode='ec2',
QuotaCode=quota_code)
except aws.botocore_exceptions().ClientError:
# Botocore client connection not established, try provisioning anyways
return True

if response['Quota']['Value'] == 0:
# Quota found to be zero, do not try provisioning
return False

# Quota found to be greater than zero, try provisioning
return True

@classmethod
def _query_instance_property_with_retries(
cls,
tag_filters: Dict[str, str],
region: str,
Expand All @@ -698,8 +739,8 @@ def _query_instance_property_and_retry(
filter_str = ' '.join(f'Name=tag:{key},Values={value}'
for key, value in tag_filters.items())
query_cmd = (f'aws ec2 describe-instances --filters {filter_str} '
f'--region {region} --query {query} --output json')
returncode, stdout, stderr = subprocess_utils.run_and_retry(
f'--region {region} --query "{query}" --output json')
returncode, stdout, stderr = subprocess_utils.run_with_retries(
query_cmd,
retry_returncode=[255],
retry_stderrs=[
Expand All @@ -725,7 +766,7 @@ def query_status(cls, name: str, tag_filters: Dict[str, str],
}

assert region is not None, (tag_filters, region)
returncode, stdout, stderr = cls._query_instance_property_and_retry(
returncode, stdout, stderr = cls._query_instance_property_with_retries(
tag_filters, region, query='Reservations[].Instances[].State.Name')
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

if returncode != 0:
Expand All @@ -751,7 +792,7 @@ def create_image_from_cluster(cls, cluster_name: str,
del zone # unused
assert region is not None, (tag_filters, region)
image_name = f'skypilot-{cluster_name}-{int(time.time())}'
returncode, stdout, stderr = cls._query_instance_property_and_retry(
returncode, stdout, stderr = cls._query_instance_property_with_retries(
tag_filters, region, query='Reservations[].Instances[].InstanceId')

subprocess_utils.handle_returncode(
Expand All @@ -765,14 +806,14 @@ def create_image_from_cluster(cls, cluster_name: str,
if len(instance_ids) != 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be 0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Although we checked the existance of the cluster in the caller. There might be some case where user manually terminates the original cluster, causing an issue. Added a condition for it.

with ux_utils.print_exception_no_traceback():
raise exceptions.NotSupportedError(
f'More than one instance found: '
f'{stdout + stderr}')
'Only support creating image from single '
f'instance, but got: {instance_ids}')

instance_id = instance_ids[0]
create_image_cmd = (
f'aws ec2 create-image --region {region} --instance-id {instance_id} '
f'--name {image_name} --output text')
returncode, image_id, stderr = subprocess_utils.run_and_retry(
returncode, image_id, stderr = subprocess_utils.run_with_retries(
create_image_cmd,
retry_returncode=[255],
)
Expand All @@ -792,7 +833,7 @@ def create_image_from_cluster(cls, cluster_name: str,
wait_image_cmd = (
f'aws ec2 wait image-available --region {region} --image-ids {image_id}'
)
returncode, stdout, stderr = subprocess_utils.run_and_retry(
returncode, stdout, stderr = subprocess_utils.run_with_retries(
wait_image_cmd,
retry_returncode=[255],
)
Expand All @@ -819,7 +860,7 @@ def maybe_move_image(cls, image_id: str, source_region: str,
f'--source-image-id {image_id} '
f'--source-region {source_region} '
f'--region {target_region} --output text')
returncode, target_image_id, stderr = subprocess_utils.run_and_retry(
returncode, target_image_id, stderr = subprocess_utils.run_with_retries(
copy_image_cmd,
retry_returncode=[255],
)
Expand All @@ -838,7 +879,7 @@ def maybe_move_image(cls, image_id: str, source_region: str,
wait_image_cmd = (
f'aws ec2 wait image-available --region {target_region} '
f'--image-ids {target_image_id}')
subprocess_utils.run_and_retry(
subprocess_utils.run_with_retries(
wait_image_cmd,
max_retry=5,
retry_returncode=[255],
Expand All @@ -863,7 +904,7 @@ def delete_image(cls, image_id: str, region: Optional[str]) -> None:
assert region is not None, (image_id, region)
delete_image_cmd = (f'aws ec2 deregister-image --region {region} '
f'--image-id {image_id}')
returncode, _, stderr = subprocess_utils.run_and_retry(
returncode, _, stderr = subprocess_utils.run_with_retries(
delete_image_cmd,
retry_returncode=[255],
)
Expand Down
2 changes: 1 addition & 1 deletion sky/clouds/cloud.py
Original file line number Diff line number Diff line change
Expand Up @@ -539,7 +539,7 @@ def query_status(cls, name: str, tag_filters: Dict[str, str],
raise NotImplementedError

# === Image related methods ===
# These three tmethods are used to create, move and delete images. They
# These three methods are used to create, move and delete images. They
# are currently only used in `sky launch --clone-disk-from` to clone a
# cluster's disk to launch a new cluster.
# It is not required to implement these methods for clouds that do not
Expand Down
10 changes: 5 additions & 5 deletions sky/clouds/gcp.py
Original file line number Diff line number Diff line change
Expand Up @@ -838,7 +838,7 @@ def create_image_from_cluster(cls, cluster_name: str,
instance_name_cmd = ('gcloud compute instances list '
f'--filter="({label_filter_str})" '
'--format="json(name)"')
returncode, stdout, stderr = subprocess_utils.run_and_retry(
returncode, stdout, stderr = subprocess_utils.run_with_retries(
instance_name_cmd,
retry_returncode=[255],
)
Expand All @@ -853,15 +853,15 @@ def create_image_from_cluster(cls, cluster_name: str,
with ux_utils.print_exception_no_traceback():
raise exceptions.NotSupportedError(
'Only support creating image from single '
f' instance, but got multiple instances: {instance_names}')
f'instance, but got: {instance_names}')
instance_name = instance_names[0]['name']

image_name = f'skypilot-{cluster_name}-{int(time.time())}'
create_image_cmd = (f'gcloud compute images create {image_name} '
f'--source-disk {instance_name} '
f'--source-disk-zone {zone}')
logger.debug(create_image_cmd)
subprocess_utils.run_and_retry(
subprocess_utils.run_with_retries(
create_image_cmd,
retry_returncode=[255],
)
Expand All @@ -874,7 +874,7 @@ def create_image_from_cluster(cls, cluster_name: str,

image_uri_cmd = (f'gcloud compute images describe {image_name} '
'--format="get(selfLink)"')
returncode, stdout, stderr = subprocess_utils.run_and_retry(
returncode, stdout, stderr = subprocess_utils.run_with_retries(
image_uri_cmd,
retry_returncode=[255],
)
Expand Down Expand Up @@ -904,7 +904,7 @@ def delete_image(cls, image_id: str, region: Optional[str]) -> None:
del region # Unused.
image_name = image_id.rpartition('/')[2]
delete_image_cmd = f'gcloud compute images delete {image_name} --quiet'
returncode, _, stderr = subprocess_utils.run_and_retry(
returncode, _, stderr = subprocess_utils.run_with_retries(
delete_image_cmd,
retry_returncode=[255],
)
Expand Down
49 changes: 25 additions & 24 deletions sky/execution.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,34 +280,35 @@ def _execute(
f'Backend {backend.NAME} does not support autostop, please try '
f'{backends.CloudVmRayBackend.NAME}')

if (Stage.PROVISION in stages and task.use_spot and
not _is_launched_by_spot_controller and not cluster_exists):
yellow = colorama.Fore.YELLOW
bold = colorama.Style.BRIGHT
reset = colorama.Style.RESET_ALL
logger.info(
f'{yellow}Launching an unmanaged spot task, which does not '
f'automatically recover from preemptions.{reset}\n{yellow}To '
'get automatic recovery, use managed spot instead: '
f'{reset}{bold}sky spot launch{reset} {yellow}or{reset} '
f'{bold}sky.spot_launch(){reset}.')

if Stage.CLONE_DISK in stages:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the 3 if clauses should be placed under an "if not cluster_exists:". Not feeling strongly, just felt that may be more readable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we can move the CLONE_DISK stage up before the if above, so that we can still have the two original if in if not cluster_exists. The downside with this is that, it can be a little bit confusing when the order of the stages should be CLONE_DISK -> OPTIMIZE -> PROVISION ..., but now it is easier to read it as CLONE_DISK -> PROVISION -> OPTIMIZE -> PROVISION....

task = _maybe_clone_disk_from_cluster(clone_disk_from, cluster_name,
task)

if Stage.OPTIMIZE in stages and not cluster_exists:
if task.best_resources is None:
# TODO: fix this for the situation where number of requested
# accelerators is not an integer.
if isinstance(backend, backends.CloudVmRayBackend):
# TODO: adding this check because docker backend on a
# no-credential machine should not enter optimize(), which
# would directly error out ('No cloud is enabled...'). Fix
# by moving `sky check` checks out of optimize()?
dag = sky.optimize(dag, minimize=optimize_target)
task = dag.tasks[0] # Keep: dag may have been deep-copied.
assert task.best_resources is not None, task
if not cluster_exists:
if (Stage.PROVISION in stages and task.use_spot and
not _is_launched_by_spot_controller):
yellow = colorama.Fore.YELLOW
bold = colorama.Style.BRIGHT
reset = colorama.Style.RESET_ALL
logger.info(
f'{yellow}Launching an unmanaged spot task, which does not '
f'automatically recover from preemptions.{reset}\n{yellow}To '
'get automatic recovery, use managed spot instead: '
f'{reset}{bold}sky spot launch{reset} {yellow}or{reset} '
f'{bold}sky.spot_launch(){reset}.')

if Stage.OPTIMIZE in stages:
if task.best_resources is None:
# TODO: fix this for the situation where number of requested
# accelerators is not an integer.
if isinstance(backend, backends.CloudVmRayBackend):
# TODO: adding this check because docker backend on a
# no-credential machine should not enter optimize(), which
# would directly error out ('No cloud is enabled...'). Fix
# by moving `sky check` checks out of optimize()?
dag = sky.optimize(dag, minimize=optimize_target)
task = dag.tasks[0] # Keep: dag may have been deep-copied.
assert task.best_resources is not None, task

backend.register_info(dag=dag,
optimize_target=optimize_target,
Expand Down
3 changes: 1 addition & 2 deletions sky/resources.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,6 @@ def __init__(
All fields are optional. ``Resources.is_launchable`` decides whether
the Resources is fully specified to launch an instance.

Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

Examples:
.. code-block:: python

Expand Down Expand Up @@ -339,7 +338,7 @@ def disk_tier(self) -> str:
return self._disk_tier

@property
def is_image_managed(self) -> Optional[str]:
def is_image_managed(self) -> Optional[bool]:
return self._is_image_managed

def _set_cpus(
Expand Down
2 changes: 1 addition & 1 deletion sky/utils/subprocess_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ def kill_children_processes(first_pid_to_kill: Optional[int] = None,
pass


def run_and_retry(
def run_with_retries(
cmd: str,
max_retry: int = 3,
retry_returncode: Optional[List[int]] = None,
Expand Down