Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade ray version to 1.13 #969

Merged
merged 41 commits into from
Aug 4, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
55966de
retry for azure head timeout
Michaelvll Jun 27, 2022
56d55e8
check more matching
Michaelvll Jun 27, 2022
65fdfdf
upgrade ray version to 1.13
Michaelvll Jul 13, 2022
840fe06
upgrade AMI for aws
Michaelvll Jul 13, 2022
461ea74
Merge branch 'retry-for-azure' of github.com:concretevitamin/sky-expe…
Michaelvll Jul 13, 2022
03c205c
Fix ray quote
Michaelvll Jul 13, 2022
8b3aeea
Add comment
Michaelvll Jul 13, 2022
9e7a71a
Add retries
Michaelvll Jul 13, 2022
3ba21a8
fix smoke
Michaelvll Jul 13, 2022
25ddfec
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Jul 26, 2022
1f47f25
Fix merge problem
Michaelvll Jul 26, 2022
fd0d561
Fix constants
Michaelvll Jul 26, 2022
f4cc64d
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Jul 26, 2022
c1760c7
Fix compatibility
Michaelvll Jul 27, 2022
4c8f91f
Fix logging
Michaelvll Jul 27, 2022
d802c7a
fix template
Michaelvll Jul 27, 2022
9b0242e
Fix onprem submission
Michaelvll Jul 27, 2022
c5bef37
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Jul 29, 2022
1d655f8
Address comments
Michaelvll Jul 30, 2022
538f9b7
Fix progress bar
Michaelvll Jul 31, 2022
9380bc1
Fix comment
Michaelvll Jul 31, 2022
5af2e43
Add backward compatibility test
Michaelvll Jul 31, 2022
90018f8
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Jul 31, 2022
81f0582
format
Michaelvll Jul 31, 2022
d33ae50
address python version
Michaelvll Aug 1, 2022
d38132e
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Aug 1, 2022
d4385b9
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Aug 1, 2022
fc64862
Upgrade node_providers with ray==1.13
Michaelvll Aug 1, 2022
705cfa3
fix
Michaelvll Aug 2, 2022
90bbf21
Add job onprem fix
michaelzhiluo Aug 3, 2022
fea8fa1
Fix job status query for logging
Michaelvll Aug 3, 2022
1dc1bc5
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Aug 4, 2022
0423636
Fix merge conflict
Michaelvll Aug 4, 2022
8196bea
Make the job id fetching more robust
Michaelvll Aug 4, 2022
4330f6f
fix
Michaelvll Aug 4, 2022
2f86b8c
Only show the usage policy when the entrypoint is used
Michaelvll Aug 4, 2022
c4cc62e
fix test
Michaelvll Aug 4, 2022
0677837
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Aug 4, 2022
58f317d
fix
Michaelvll Aug 4, 2022
87adc90
format
Michaelvll Aug 4, 2022
9c7fdfb
longer timeout for multi_echo
Michaelvll Aug 4, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ We use GitHub to track issues and features. For new contributors, we recommend l

### Installing SkyPilot for development
```bash
# SkyPilot requires python >= 3.6 and < 3.10.
# SkyPilot requires python >= 3.6.
# You can just install the dependencies for
# certain clouds, e.g., ".[aws,azure,gcp]"
pip install -e ".[all]"
Expand Down
2 changes: 1 addition & 1 deletion docs/source/getting-started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Install SkyPilot using pip:

.. code-block:: console

$ # SkyPilot requires python >= 3.6 and < 3.10.
$ # SkyPilot requires python >= 3.6.
$ git clone ssh://git@github.com/skypilot-org/skypilot.git
$ cd skypilot

Expand Down
6 changes: 3 additions & 3 deletions docs/source/reference/local/setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,15 @@ For further reference, `here <https://docs.ray.io/en/latest/ray-core/configure.h
Installing SkyPilot dependencies
-----------------------------------

SkyPilot On-prem requires :code:`python3`, :code:`ray==1.10.0`, and :code:`sky` to be setup on all local nodes and globally available to all users.
SkyPilot On-prem requires :code:`python3`, :code:`ray==1.13.0`, and :code:`sky` to be setup on all local nodes and globally available to all users.

To install Ray and SkyPilot for all users, run the following commands on all local nodes:

.. code-block:: console

$ sudo -H pip3 install ray[default]==1.10.0
$ sudo -H pip3 install ray[default]==1.13.0

$ # SkyPilot requires python >= 3.6 and < 3.10.
$ # SkyPilot requires python >= 3.6.
$ git clone ssh://git@github.com/skypilot-org/skypilot.git
$ cd skypilot
$ sudo -H pip3 install -e .
Expand Down
2 changes: 1 addition & 1 deletion examples/local/cluster-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
# The system administrator must have `sudo` access to the local nodes.
# Requirements:
# 1) Python (> 3.6) on all nodes.
# 2) Ray CLI (= 1.10.0) on all nodes.
# 2) Ray CLI (= 1.13.0) on all nodes.
#
# Example usage:
# >> sky admin deploy cluster-config.yaml
Expand Down
2 changes: 1 addition & 1 deletion examples/resnet_distributed_tf_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ def run_fn(node_rank: int, ip_list: List[str]) -> Optional[str]:
train.set_outputs('resnet-model-dir', estimated_size_gigabytes=0.1)
train.set_resources(sky.Resources(sky.AWS(), accelerators='V100'))

sky.launch(dag, cluster_name=cluster)
sky.launch(dag, cluster_name=cluster, retry_until_up=True)


if __name__ == '__main__':
Expand Down
6 changes: 3 additions & 3 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
from sky import backends
from sky import check as sky_check
from sky import clouds
from sky import constants
from sky import exceptions
from sky import global_user_state
from sky import sky_logging
Expand All @@ -63,7 +64,6 @@
SKY_REMOTE_APP_DIR = '~/.sky/sky_app'
SKY_RAY_YAML_REMOTE_PATH = '~/.sky/sky_ray.yml'
IP_ADDR_REGEX = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
SKY_REMOTE_RAY_VERSION = '1.10.0'
SKY_REMOTE_PATH = '~/.sky/sky_wheels'
SKY_USER_FILE_PATH = '~/.sky/generated'

Expand Down Expand Up @@ -623,7 +623,7 @@ def write_cluster_config(to_provision: 'resources.Resources',
# GCP only.
'gcp_project_id': gcp_project_id,
# Ray version.
'ray_version': SKY_REMOTE_RAY_VERSION,
'ray_version': constants.SKY_REMOTE_RAY_VERSION,
# Cloud credentials for cloud storage.
'credentials': credentials,
# Sky remote utils.
Expand Down Expand Up @@ -1113,7 +1113,7 @@ def _ray_launch_hash(cluster_name: str, ray_config: Dict[str, Any]) -> Set[str]:
return set(ray_launch_hashes)
with subpress_output():
ray_config = ray_commands._bootstrap_config(ray_config) # pylint: disable=protected-access
# Adopted from https://github.com/ray-project/ray/blob/ray-1.10.0/python/ray/autoscaler/_private/node_launcher.py#L46-L54
# Adopted from https://github.com/ray-project/ray/blob/ray-1.13.0/python/ray/autoscaler/_private/node_launcher.py#L56-L64
# TODO(zhwu): this logic is duplicated from the ray code above (keep in sync).
launch_hashes = set()
head_node_type = ray_config['head_node_type']
Expand Down
34 changes: 21 additions & 13 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import json
import os
import pathlib
import re
import signal
import subprocess
import tempfile
Expand All @@ -22,6 +23,7 @@
from sky import backends
from sky import clouds
from sky import cloud_stores
from sky import constants
from sky import exceptions
from sky import global_user_state
from sky import resources as resources_lib
Expand Down Expand Up @@ -53,15 +55,13 @@

SKY_REMOTE_APP_DIR = backend_utils.SKY_REMOTE_APP_DIR
SKY_REMOTE_WORKDIR = backend_utils.SKY_REMOTE_WORKDIR
SKY_LOGS_DIRECTORY = job_lib.SKY_LOGS_DIRECTORY
SKY_REMOTE_RAY_VERSION = backend_utils.SKY_REMOTE_RAY_VERSION

logger = sky_logging.init_logger(__name__)

_PATH_SIZE_MEGABYTES_WARN_THRESHOLD = 256

# Timeout for provision a cluster and wait for it to be ready in seconds.
_NODES_LAUNCHING_PROGRESS_TIMEOUT = 30
_NODES_LAUNCHING_PROGRESS_TIMEOUT = 60

# Time gap between retries after failing to provision in all possible places.
# Used only if --retry-until-up is set.
Expand Down Expand Up @@ -92,6 +92,8 @@

_MAX_RAY_UP_RETRY = 5

_JOB_ID_PATTERN = re.compile(r'Job ID: ([0-9]+)')


def _get_cluster_config_template(cloud):
cloud_to_template = {
Expand Down Expand Up @@ -1198,7 +1200,7 @@ def _ensure_cluster_ray_started(self,
if isinstance(launched_resources.cloud, clouds.Local):
raise RuntimeError(
'The command `ray status` errored out on the head node '
'of the local cluster. Check if ray[default]==1.10.0 '
'of the local cluster. Check if ray[default]==1.13.0 '
'is installed or running correctly.')
backend.run_on_head(handle, 'ray stop', use_cached_head_ip=False)
log_lib.run_with_log(
Expand Down Expand Up @@ -1426,7 +1428,8 @@ def __setstate__(self, state):

def __init__(self):
self.run_timestamp = backend_utils.get_run_timestamp()
self.log_dir = os.path.join(SKY_LOGS_DIRECTORY, self.run_timestamp)
self.log_dir = os.path.join(constants.SKY_LOGS_DIRECTORY,
self.run_timestamp)
# Do not make directories to avoid create folder for commands that
# do not need it (`sky status`, `sky logs` ...)
# os.makedirs(self.log_dir, exist_ok=True)
Expand Down Expand Up @@ -1639,13 +1642,12 @@ def _provision(self,
# to SUCCEEDED, the cluster is STOPPED by `sky stop`.
# 2. On next `sky start`, it gets reset to FAILED.
cmd = job_lib.JobLibCodeGen.fail_all_jobs_in_progress()
returncode, _, stderr = self.run_on_head(handle,
cmd,
require_outputs=True)
returncode, stdout, stderr = self.run_on_head(
handle, cmd, require_outputs=True)
subprocess_utils.handle_returncode(
returncode, cmd,
'Failed to set previously in-progress jobs to FAILED',
stderr)
stdout + stderr)

with timeline.Event('backend.provision.post_process'):
global_user_state.add_or_update_cluster(cluster_name,
Expand Down Expand Up @@ -1828,8 +1830,9 @@ def _exec_code_on_head(
else:
job_submit_cmd = (
f'{cd} && mkdir -p {remote_log_dir} && ray job submit '
f'--address=127.0.0.1:8265 --job-id {ray_job_id} --no-wait '
f'-- "{executable} -u {script_path} > {remote_log_path} 2>&1"')
f'--address=http://127.0.0.1:8265 --job-id {ray_job_id} '
'--no-wait -- '
f'"{executable} -u {script_path} > {remote_log_path} 2>&1"')

returncode, stdout, stderr = self.run_on_head(handle,
job_submit_cmd,
Expand Down Expand Up @@ -1895,7 +1898,7 @@ def _setup_and_create_job_cmd_on_local_head(
switch_user_cmd = ' '.join(switch_user_cmd)
job_submit_cmd = (
'ray job submit '
f'--address=127.0.0.1:8265 --job-id {ray_job_id} --no-wait '
f'--address=http://127.0.0.1:8265 --job-id {ray_job_id} --no-wait '
f'-- {switch_user_cmd}')
return job_submit_cmd

Expand All @@ -1914,7 +1917,12 @@ def _add_job(self, handle: ResourceHandle, job_name: str,
'Failed to fetch job id.',
job_id_str + stderr)
try:
job_id = int(job_id_str)
job_id_match = _JOB_ID_PATTERN.search(job_id_str)
if job_id_match is not None:
job_id = int(job_id_match.group(1))
else:
# For backward compatibility.
job_id = int(job_id_str)
except ValueError as e:
logger.error(stderr)
raise ValueError(f'Failed to parse job id: {job_id_str}; '
Expand Down
4 changes: 2 additions & 2 deletions sky/benchmark/benchmark_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

import sky
from sky import backends
from sky import constants
from sky import data
from sky import global_user_state
from sky import sky_logging
Expand All @@ -39,7 +40,6 @@
logger = sky_logging.init_logger(__name__)
console = rich_console.Console()

_SKY_LOGS_DIRECTORY = job_lib.SKY_LOGS_DIRECTORY
_SKY_LOCAL_BENCHMARK_DIR = os.path.expanduser('~/.sky/benchmarks')
_SKY_REMOTE_BENCHMARK_DIR = '~/.sky/sky_benchmark_dir'
# NOTE: This must be the same as _SKY_REMOTE_BENCHMARK_DIR
Expand Down Expand Up @@ -489,7 +489,7 @@ def launch_benchmark_clusters(benchmark: str, clusters: List[str],

# Save stdout/stderr from cluster launches.
run_timestamp = backend_utils.get_run_timestamp()
log_dir = os.path.join(_SKY_LOGS_DIRECTORY, run_timestamp)
log_dir = os.path.join(constants.SKY_LOGS_DIRECTORY, run_timestamp)
log_dir = os.path.expanduser(log_dir)
logger.info(
f'{colorama.Fore.YELLOW}To view stdout/stderr from individual '
Expand Down
17 changes: 10 additions & 7 deletions sky/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -2450,13 +2450,16 @@ def spot_launch(
f'Launching managed spot job {name} from spot controller...',
fg='yellow')
click.echo('Launching spot controller...')
sky.launch(dag,
stream_logs=True,
cluster_name=controller_name,
detach_run=detach_run,
idle_minutes_to_autostop=spot_lib.
SPOT_CONTROLLER_IDLE_MINUTES_TO_AUTOSTOP,
is_spot_controller_task=True)
sky.launch(
dag,
stream_logs=True,
cluster_name=controller_name,
detach_run=detach_run,
idle_minutes_to_autostop=spot_lib.
SPOT_CONTROLLER_IDLE_MINUTES_TO_AUTOSTOP,
is_spot_controller_task=True,
retry_until_up=True,
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved
)


@spot.command('status', cls=_DocumentedCodeCommand)
Expand Down
4 changes: 4 additions & 0 deletions sky/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
"""Constants for SkyPilot."""
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

SKY_LOGS_DIRECTORY = '~/sky_logs'
SKY_REMOTE_RAY_VERSION = '1.13.0'
4 changes: 2 additions & 2 deletions sky/design_docs/onprem-design.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
- Does not support different types of accelerators within the same node (intranode).

## Installing Ray and SkyPilot
- Admin installs Ray==1.10.0 and SkyPilot globally on all machines. It is assumed that the admin regularly keeps SkyPilot updated on the cluster.
- Admin installs Ray==1.13.0 and SkyPilot globally on all machines. It is assumed that the admin regularly keeps SkyPilot updated on the cluster.
- Python >= 3.6 for all users.
- When a regular user runs `sky launch`, a local version of SkyPilot will be installed on the machine for each user. The local installation of Ray is specified in `sky/templates/local-ray.yml.j2`.

Expand All @@ -36,7 +36,7 @@ ray.get(ray.remote(f).remote())
```

- Therefore, SkyPilot On-prem transparently includes user-switching so that SkyPilot tasks are still run as the calling, unprivileged user. This user-switching (`sudo -H su --login [USER]` in appropriate places) works as follows:
- In `sky/backends/cloud_vm_ray_backend.py::_setup_and_create_job_cmd_on_local_head`, switching between users is called during Ray job submission. The command `ray job submit --address=127.0.0.1:8265 --job-id {ray_job_id} -- sudo -H su --login [SSH_USER] -c \"[JOB_COMMAND]\"` switches job submission execution from admin back to the original user `SSH_USER`. The `JOB_COMMAND` argument runs a bash script with the user's run commands.
- In `sky/backends/cloud_vm_ray_backend.py::_setup_and_create_job_cmd_on_local_head`, switching between users is called during Ray job submission. The command `ray job submit --address=http://127.0.0.1:8265 --job-id {ray_job_id} -- sudo -H su --login [SSH_USER] -c \"[JOB_COMMAND]\"` switches job submission execution from admin back to the original user `SSH_USER`. The `JOB_COMMAND` argument runs a bash script with the user's run commands.
- In `sky/skylet/log_lib.py::run_bash_command_with_log`, there is also another `sudo -H su` command to switch users. The function `run_bash_command_with_log` is part of the `RayCodeGen` job execution script uploaded to remote for job submission (located in `~/.sky/sky_app/sky_app_[JOB_ID].py`). This program initially runs under the calling user, but it executes the function `run_bash_command_with_log` from the context of the admin, as the function is executed within the Ray cluster as a Ray remote function (see above for why all Ray remote functions are run under admin).
- SkyPilot ensures Ray-related environment variables (that are critical for execution) are preserved across switching users (check with `examples/env_check.yaml`).

Expand Down
84 changes: 36 additions & 48 deletions sky/global_user_state.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,6 @@
import os
import pathlib
import pickle
import sqlite3
import threading
import time
import typing
from typing import Any, Dict, List, Optional
Expand All @@ -31,52 +29,42 @@
pathlib.Path(_DB_PATH).parents[0].mkdir(parents=True, exist_ok=True)


class _SQLiteConn(threading.local):
"""Thread-local connection to the sqlite3 database."""

def __init__(self, db_path: str):
super().__init__()
self.db_path = db_path
self.conn = sqlite3.connect(db_path)
self.cursor = self.conn.cursor()
self._create_tables()

def _create_tables(self):
# Table for Clusters
self.cursor.execute("""\
CREATE TABLE IF NOT EXISTS clusters (
name TEXT PRIMARY KEY,
launched_at INTEGER,
handle BLOB,
last_use TEXT,
status TEXT,
autostop INTEGER DEFAULT -1)""")
# Table for configs (e.g. enabled clouds)
self.cursor.execute("""\
CREATE TABLE IF NOT EXISTS config (
key TEXT PRIMARY KEY, value TEXT)""")
# Table for Storage
self.cursor.execute("""\
CREATE TABLE IF NOT EXISTS storage (
name TEXT PRIMARY KEY,
launched_at INTEGER,
handle BLOB,
last_use TEXT,
status TEXT)""")
# For backward compatibility.
# TODO(zhwu): Remove this function after all users have migrated to
# the latest version of SkyPilot.
# Add autostop column to clusters table
db_utils.add_column_to_table(self.cursor, self.conn, 'clusters',
'autostop', 'INTEGER DEFAULT -1')

db_utils.add_column_to_table(self.cursor, self.conn, 'clusters',
'metadata', 'TEXT DEFAULT "{}"')

self.conn.commit()


_DB = _SQLiteConn(_DB_PATH)
def create_table(cursor, conn):
# Table for Clusters
cursor.execute("""\
CREATE TABLE IF NOT EXISTS clusters (
name TEXT PRIMARY KEY,
launched_at INTEGER,
handle BLOB,
last_use TEXT,
status TEXT,
autostop INTEGER DEFAULT -1)""")
# Table for configs (e.g. enabled clouds)
cursor.execute("""\
CREATE TABLE IF NOT EXISTS config (
key TEXT PRIMARY KEY, value TEXT)""")
# Table for Storage
cursor.execute("""\
CREATE TABLE IF NOT EXISTS storage (
name TEXT PRIMARY KEY,
launched_at INTEGER,
handle BLOB,
last_use TEXT,
status TEXT)""")
# For backward compatibility.
# TODO(zhwu): Remove this function after all users have migrated to
# the latest version of SkyPilot.
# Add autostop column to clusters table
db_utils.add_column_to_table(cursor, conn, 'clusters', 'autostop',
'INTEGER DEFAULT -1')

db_utils.add_column_to_table(cursor, conn, 'clusters', 'metadata',
'TEXT DEFAULT "{}"')

conn.commit()


_DB = db_utils.SQLiteConn(_DB_PATH, create_table)


class ClusterStatus(enum.Enum):
Expand Down
1 change: 1 addition & 0 deletions sky/setup_files/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: 3.9',
'Programming Language :: Python :: 3.10',
],
description='SkyPilot',
long_description=__doc__.replace('\n', ' '),
Expand Down
Loading