Skip to content

Commit

Permalink
update ray node provider to 2.0.0
Browse files Browse the repository at this point in the history
update patches

Adapt to ray functions in 2.0.0

update azure-cli version for faster installation

format

[Onprem] Automatically install sky dependencies (#1116)

* Remove root user, move ray cluster to admin

* Automatically install sky dependencies

* Fix admin alignment

* Fix PR

* Address romil's comments

* F

* Addressed Romil's comments

Add `--retry-until-up`, `--region`, `--zone`, and `--idle-minutes-to-autostop` for interactive nodes (#1207)

* Add --retry-until-up flag for interactive nodes

* Add --region flag for interactive nodes

* Add --idle-minutes-to-autostop flag for interactive nodes

* Add --zone flag for interactive nodes

* Update help messages

* Address nit

Add all region option in catalog fetcher and speed up azure fetcher (#1204)

* Port changes

* format

* add t2a exclusion back

* fix A100 for GCP

* fix aws fetching for p4de.24xlarge

* Fill GPUInfo

* fix

* address part of comments

* address comments

* add test for A100

* patch GpuInfo

* Add generation info

* Add capabilities back to azure and fix aws

* fix azure catalog

* format

* lint

* remove zone from azure

* fix azure

* Add analyze for csv

* update catalog analysis

* format

* backward compatible for azure_catalog

* yapf

* fix GCP catalog

* fix A100-80GB

* format

* increase version number

* only keep useful columns for aws

* remove capabilities from azure

* add az to AWS

Revert "Add `--retry-until-up`, `--region`, `--zone`, and `--idle-minutes-to-autostop` for interactive nodes" (#1220)

Revert "Add `--retry-until-up`, `--region`, `--zone`, and `--idle-minutes-to-autostop` for interactive nodes (#1207)"

This reverts commit f06416d.

[Storage] Add `StorageMode` to __init__ (#1223)

* Add storage mode to __init__

* fix

[Example] Minimal containerized app example (#1212)

* Container example

* parenthesis

* Add explicit StorageMode

* lint

Fix Mac Version in Setup.py (#1224)

Fix mac

Reduce iops for aws instances (#1221)

* set the default iops to be same as console for AWS

* fix

Revert "Reduce iops for aws instances" (#1229)

Revert "Reduce iops for aws instances (#1221)"

This reverts commit 29f1458.

update back compat test
  • Loading branch information
Michaelvll committed Oct 16, 2022
1 parent f29da80 commit 06afd93
Show file tree
Hide file tree
Showing 22 changed files with 154 additions and 123 deletions.
6 changes: 4 additions & 2 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -401,6 +401,8 @@ def add_epilogue(self) -> None:
# Need this to set the job status in ray job to be FAILED.
sys.exit(1)
else:
sys.stdout.flush()
sys.stderr.flush()
job_lib.set_status({self.job_id!r}, job_lib.JobStatus.SUCCEEDED)
# This waits for all streaming logs to finish.
time.sleep(1)
Expand Down Expand Up @@ -2066,8 +2068,8 @@ def _exec_code_on_head(
else:
job_submit_cmd = (
f'{cd} && mkdir -p {remote_log_dir} && ray job submit '
f'--address=http://127.0.0.1:8265 --job-id {ray_job_id} '
'--no-wait -- '
f'--address=http://127.0.0.1:8265 --submission-id {ray_job_id} '
'--no-wait '
f'"{executable} -u {script_path} > {remote_log_path} 2>&1"')

returncode, stdout, stderr = self.run_on_head(handle,
Expand Down
4 changes: 2 additions & 2 deletions sky/setup_files/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ def parse_footnote(readme: str) -> str:
'PrettyTable',
# Lower local ray version is not fully supported, due to the
# autoscaler issues (also tracked in #537).
'ray[default]>=1.9.0,<=1.13.0',
'ray[default]>=1.9.0',
'rich',
'tabulate',
'filelock', # TODO(mraheja): Enforce >=3.6.0 when python version is >= 3.7
Expand Down Expand Up @@ -96,7 +96,7 @@ def parse_footnote(readme: str) -> str:
],
# TODO(zongheng): azure-cli is huge and takes a long time to install.
# Tracked in: https://github.com/Azure/azure-cli/issues/7387
'azure': ['azure-cli==2.31.0', 'azure-core'],
'azure': ['azure-cli==2.39.0', 'azure-core'],
'gcp': ['google-api-python-client', 'google-cloud-storage'],
'docker': ['docker'],
}
Expand Down
12 changes: 6 additions & 6 deletions sky/skylet/LICENCE
Original file line number Diff line number Diff line change
Expand Up @@ -203,16 +203,16 @@
--------------------------------------------------------------------------------

Code in providers/azure from
https://github.com/ray-project/ray/tree/ray-1.13.0/python/ray/autoscaler/_private/_azure
Git commit of the release 1.13.0: 4ce38d001dbbe09cd21c497fedd03d692b2be3e
https://github.com/ray-project/ray/tree/ray-2.0.0/python/ray/autoscaler/_private/_azure
Git commit of the release 2.0.0: cba26cc83f6b5b8a2ff166594a65cb74c0ec8740

Code in providers/gcp from
https://github.com/ray-project/ray/tree/ray-1.13.0/python/ray/autoscaler/_private/gcp
Git commit of the release 1.13.0: 4ce38d001dbbe09cd21c497fedd03d692b2be3e
https://github.com/ray-project/ray/tree/ray-2.0.0/python/ray/autoscaler/_private/gcp
Git commit of the release 2.0.0: cba26cc83f6b5b8a2ff166594a65cb74c0ec8740

Code in providers/aws from
https://github.com/ray-project/ray/tree/ray-1.13.0/python/ray/autoscaler/_private/aws
Git commit of the release 1.13.0: 4ce38d001dbbe09cd21c497fedd03d692b2be3e
https://github.com/ray-project/ray/tree/ray-2.0.0/python/ray/autoscaler/_private/aws
Git commit of the release 2.0.0: cba26cc83f6b5b8a2ff166594a65cb74c0ec8740


Copyright 2016-2022 Ray developers
Expand Down
2 changes: 1 addition & 1 deletion sky/skylet/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

SKY_LOGS_DIRECTORY = '~/sky_logs'
SKY_REMOTE_WORKDIR = '~/sky_workdir'
SKY_REMOTE_RAY_VERSION = '1.13.0'
SKY_REMOTE_RAY_VERSION = '2.0.0'

# TODO(mluo): Make explicit `sky launch -c <name> ''` optional.
UNINITIALIZED_ONPREM_CLUSTER_MESSAGE = (
Expand Down
20 changes: 15 additions & 5 deletions sky/skylet/job_lib.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import pathlib
import shlex
import time
import typing
from typing import Any, Dict, List, Optional

import filelock
Expand All @@ -18,6 +19,9 @@
from sky.utils import db_utils
from sky.utils import log_utils

if typing.TYPE_CHECKING:
from ray.dashboard.modules.job.pydantic_models import JobDetails

logger = sky_logging.init_logger(__name__)

_JOB_STATUS_LOCK = '~/.sky/locks/.job_{}.lock'
Expand Down Expand Up @@ -342,13 +346,19 @@ def update_job_status(job_owner: str,

job_client = _create_ray_job_submission_client()

# In ray 1.13.0, job_client.list_jobs returns a dict of job_id to job_info,
# where job_info contains the job status (str).
ray_job_infos = job_client.list_jobs()
# In ray 2.0.0, job_client.list_jobs returns a list of JobDetails,
# which contains the job status (str) and submission_id (str).
job_details_list: List['JobDetails'] = job_client.list_jobs()

job_details = dict()
ray_job_ids_set = set(ray_job_ids)
for job_detail in job_details_list:
if job_detail.submission_id in ray_job_ids_set:
job_details[job_detail.submission_id] = job_detail
job_statuses: List[JobStatus] = [None] * len(ray_job_ids)
for i, ray_job_id in enumerate(ray_job_ids):
if ray_job_id in ray_job_infos:
ray_status = ray_job_infos[ray_job_id].status
if ray_job_id in job_details:
ray_status = job_details[ray_job_id].status
job_statuses[i] = _RAY_TO_JOB_STATUS_MAP[ray_status]

assert len(job_statuses) == len(job_ids), (job_statuses, job_ids)
Expand Down
12 changes: 7 additions & 5 deletions sky/skylet/providers/aws/cloudwatch/cloudwatch_helper.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
import botocore
import copy
import hashlib
import json
import os
import logging
import os
import time
import hashlib
from typing import Any, Dict, List, Union, Tuple
from typing import Any, Dict, List, Tuple, Union

import botocore

from sky.skylet.providers.aws.utils import client_cache, resource_cache
from ray.autoscaler.tags import TAG_RAY_CLUSTER_NAME, NODE_KIND_HEAD, TAG_RAY_NODE_KIND
from ray.autoscaler.tags import NODE_KIND_HEAD, TAG_RAY_CLUSTER_NAME, TAG_RAY_NODE_KIND

logger = logging.getLogger(__name__)

Expand Down
46 changes: 23 additions & 23 deletions sky/skylet/providers/aws/config.py
Original file line number Diff line number Diff line change
@@ -1,30 +1,29 @@
from distutils.version import StrictVersion
from functools import lru_cache
from functools import partial
import copy
import itertools
import json
import logging
import os
import time
from distutils.version import StrictVersion
from functools import lru_cache, partial
from typing import Any, Dict, List, Optional, Set, Tuple
import logging

import boto3
import botocore

from ray.autoscaler._private.util import check_legacy_fields
from ray.autoscaler.tags import NODE_TYPE_LEGACY_HEAD, NODE_TYPE_LEGACY_WORKER
from ray.autoscaler._private.providers import _PROVIDER_PRETTY_NAMES
from sky.skylet.providers.aws.cloudwatch.cloudwatch_helper import (
CloudwatchHelper as cwh,
)
from sky.skylet.providers.aws.utils import (
LazyDefaultDict,
handle_boto_error,
resource_cache,
)
from ray.autoscaler._private.cli_logger import cli_logger, cf
from ray.autoscaler._private.cli_logger import cf, cli_logger
from ray.autoscaler._private.event_system import CreateClusterEvent, global_event_system
from sky.skylet.providers.aws.cloudwatch.cloudwatch_helper import (
CloudwatchHelper as cwh,
)
from ray.autoscaler._private.providers import _PROVIDER_PRETTY_NAMES
from ray.autoscaler._private.util import check_legacy_fields
from ray.autoscaler.tags import NODE_TYPE_LEGACY_HEAD, NODE_TYPE_LEGACY_WORKER

logger = logging.getLogger(__name__)

Expand All @@ -33,20 +32,21 @@
DEFAULT_RAY_IAM_ROLE = RAY + "-v1"
SECURITY_GROUP_TEMPLATE = RAY + "-{}"

DEFAULT_AMI_NAME = "AWS Deep Learning AMI (Ubuntu 18.04) V30.0"
# V61.0 has CUDA 11.2
DEFAULT_AMI_NAME = "AWS Deep Learning AMI (Ubuntu 18.04) V61.0"

# Obtained from https://aws.amazon.com/marketplace/pp/B07Y43P7X5 on 8/4/2020.
# Obtained from https://aws.amazon.com/marketplace/pp/B07Y43P7X5 on 6/10/2022.
DEFAULT_AMI = {
"us-east-1": "ami-029510cec6d69f121", # US East (N. Virginia)
"us-east-2": "ami-08bf49c7b3a0c761e", # US East (Ohio)
"us-west-1": "ami-0cc472544ce594a19", # US West (N. California)
"us-west-2": "ami-0a2363a9cff180a64", # US West (Oregon)
"ca-central-1": "ami-0a871851b2ab39f01", # Canada (Central)
"eu-central-1": "ami-049fb1ea198d189d7", # EU (Frankfurt)
"eu-west-1": "ami-0abcbc65f89fb220e", # EU (Ireland)
"eu-west-2": "ami-0755b39fd4dab7cbe", # EU (London)
"eu-west-3": "ami-020485d8df1d45530", # EU (Paris)
"sa-east-1": "ami-058a6883cbdb4e599", # SA (Sao Paulo)
"us-east-1": "ami-0dd6adfad4ad37eec", # US East (N. Virginia)
"us-east-2": "ami-0c77cd5ca05bf1281", # US East (Ohio)
"us-west-1": "ami-020ab1b368a5ed1db", # US West (N. California)
"us-west-2": "ami-0387d929287ab193e", # US West (Oregon)
"ca-central-1": "ami-07dbafdbd38f18d98", # Canada (Central)
"eu-central-1": "ami-0383bd0c1fc4c63ec", # EU (Frankfurt)
"eu-west-1": "ami-0a074b0a311a837ac", # EU (Ireland)
"eu-west-2": "ami-094ba2b4651f761ca", # EU (London)
"eu-west-3": "ami-031da10fbf225bf5f", # EU (Paris)
"sa-east-1": "ami-0be7c1f1dd96d7337", # SA (Sao Paulo)
}

# todo: cli_logger should handle this assert properly
Expand Down
48 changes: 25 additions & 23 deletions sky/skylet/providers/aws/node_provider.py
Original file line number Diff line number Diff line change
@@ -1,36 +1,39 @@
import copy
import threading
from collections import defaultdict, OrderedDict
import logging
import threading
import time
from collections import defaultdict, OrderedDict
from typing import Any, Dict, List

import botocore
from boto3.resources.base import ServiceResource

from ray.autoscaler.node_provider import NodeProvider
from ray.autoscaler.tags import (
TAG_RAY_CLUSTER_NAME,
TAG_RAY_NODE_NAME,
TAG_RAY_LAUNCH_CONFIG,
TAG_RAY_NODE_KIND,
TAG_RAY_USER_NODE_TYPE,
try:
import ray._private.ray_constants as ray_constants
except ImportError:
# SkyPilot: for local ray version lower than 2.0.0
import ray.ray_constants as ray_constants
from sky.skylet.providers.aws.cloudwatch.cloudwatch_helper import (
CloudwatchHelper,
CLOUDWATCH_AGENT_INSTALLED_AMI_TAG,
CLOUDWATCH_AGENT_INSTALLED_TAG,
)
from ray.autoscaler._private.constants import BOTO_MAX_RETRIES, BOTO_CREATE_MAX_RETRIES
from sky.skylet.providers.aws.config import bootstrap_aws
from ray.autoscaler._private.log_timer import LogTimer

from sky.skylet.providers.aws.utils import (
boto_exception_handler,
resource_cache,
client_cache,
)
from ray.autoscaler._private.cli_logger import cli_logger, cf
import ray.ray_constants as ray_constants

from sky.skylet.providers.aws.cloudwatch.cloudwatch_helper import (
CloudwatchHelper,
CLOUDWATCH_AGENT_INSTALLED_AMI_TAG,
CLOUDWATCH_AGENT_INSTALLED_TAG,
from ray.autoscaler._private.constants import BOTO_MAX_RETRIES, BOTO_CREATE_MAX_RETRIES
from ray.autoscaler._private.log_timer import LogTimer
from ray.autoscaler.node_provider import NodeProvider
from ray.autoscaler.tags import (
TAG_RAY_CLUSTER_NAME,
TAG_RAY_LAUNCH_CONFIG,
TAG_RAY_NODE_KIND,
TAG_RAY_NODE_NAME,
TAG_RAY_USER_NODE_TYPE,
)

logger = logging.getLogger(__name__)
Expand All @@ -56,7 +59,7 @@ def from_aws_format(tags):
return tags


def make_ec2_client(region, max_retries, aws_credentials=None):
def make_ec2_resource(region, max_retries, aws_credentials=None):
"""Make client, retrying requests up to `max_retries`."""
aws_credentials = aws_credentials or {}
return resource_cache("ec2", region, max_retries, **aws_credentials)
Expand All @@ -67,7 +70,7 @@ def list_ec2_instances(
) -> List[Dict[str, Any]]:
"""Get all instance-types/resources available in the user's AWS region.
Args:
region (str): the region of the AWS provider. e.g., "us-west-2".
region: the region of the AWS provider. e.g., "us-west-2".
Returns:
final_instance_types: a list of instances. An example of one element in
the list:
Expand Down Expand Up @@ -101,12 +104,12 @@ def __init__(self, provider_config, cluster_name):
self.cache_stopped_nodes = provider_config.get("cache_stopped_nodes", True)
aws_credentials = provider_config.get("aws_credentials")

self.ec2 = make_ec2_client(
self.ec2 = make_ec2_resource(
region=provider_config["region"],
max_retries=BOTO_MAX_RETRIES,
aws_credentials=aws_credentials,
)
self.ec2_fail_fast = make_ec2_client(
self.ec2_fail_fast = make_ec2_resource(
region=provider_config["region"],
max_retries=0,
aws_credentials=aws_credentials,
Expand Down Expand Up @@ -494,7 +497,6 @@ def terminate_node(self, node_id):
# asyncrhonous or error, which would result in a use after free error.
# If this leak becomes bad, we can garbage collect the tag cache when
# the node cache is updated.
pass

def _check_ami_cwa_installation(self, config):
response = self.ec2.meta.client.describe_images(ImageIds=[config["ImageId"]])
Expand Down
12 changes: 8 additions & 4 deletions sky/skylet/providers/aws/utils.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
from collections import defaultdict
from functools import lru_cache

import boto3
from boto3.exceptions import ResourceNotExistsError
from boto3.resources.base import ServiceResource
from botocore.client import BaseClient
from botocore.config import Config
import boto3

from ray.autoscaler._private.cli_logger import cli_logger, cf
from ray.autoscaler._private.cli_logger import cf, cli_logger
from ray.autoscaler._private.constants import BOTO_MAX_RETRIES


Expand Down Expand Up @@ -141,7 +143,9 @@ def __exit__(self, type, value, tb):


@lru_cache()
def resource_cache(name, region, max_retries=BOTO_MAX_RETRIES, **kwargs):
def resource_cache(
name, region, max_retries=BOTO_MAX_RETRIES, **kwargs
) -> ServiceResource:
cli_logger.verbose(
"Creating AWS resource `{}` in `{}`", cf.bold(name), cf.bold(region)
)
Expand All @@ -157,7 +161,7 @@ def resource_cache(name, region, max_retries=BOTO_MAX_RETRIES, **kwargs):


@lru_cache()
def client_cache(name, region, max_retries=BOTO_MAX_RETRIES, **kwargs):
def client_cache(name, region, max_retries=BOTO_MAX_RETRIES, **kwargs) -> BaseClient:
try:
# try to re-use a client from the resource cache first
return resource_cache(name, region, max_retries, **kwargs).meta.client
Expand Down
2 changes: 1 addition & 1 deletion sky/skylet/providers/azure/config.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import json
import logging
from pathlib import Path
import random
from pathlib import Path
from typing import Any, Callable

from azure.common.credentials import get_cli_profile
Expand Down
12 changes: 6 additions & 6 deletions sky/skylet/providers/azure/node_provider.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,18 @@
from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.resource.resources.models import DeploymentMode

from sky.skylet.providers.azure.config import (
bootstrap_azure,
get_azure_sdk_function,
)
from ray.autoscaler.node_provider import NodeProvider
from ray.autoscaler.tags import (
TAG_RAY_CLUSTER_NAME,
TAG_RAY_NODE_NAME,
TAG_RAY_NODE_KIND,
TAG_RAY_LAUNCH_CONFIG,
TAG_RAY_NODE_KIND,
TAG_RAY_NODE_NAME,
TAG_RAY_USER_NODE_TYPE,
)
from sky.skylet.providers.azure.config import (
bootstrap_azure,
get_azure_sdk_function,
)

VM_NAME_MAX_LEN = 64
VM_NAME_UUID_LEN = 8
Expand Down
Loading

0 comments on commit 06afd93

Please sign in to comment.