Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade ray version to 1.13 #969

Merged
merged 41 commits into from
Aug 4, 2022
Merged

Upgrade ray version to 1.13 #969

merged 41 commits into from
Aug 4, 2022

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jul 13, 2022

Closes #957, #953

TODO:

  • Make this PR backward compatible.

Tested:

  • tests/run_smoke_tests.sh (with the changes from Check long cluster names only for GCP; robustify smoke tests. #966 )
  • Test staled job status sky launch -c test-stale 'echo start; sleep 200000'; sky queue test-stale shows the following
    sky queue test-stale         ✭ ✈ ✱
    Fetching and parsing job queue...
    
    Sky Job Queue of Cluster test-stale
    ID  NAME     USER  SUBMITTED    STARTED      DURATION  RESOURCES     STATUS     LOG                                        
    3   sky-cmd  zhwu  48 secs ago  45 secs ago  45s       1x [CPU:0.5]  RUNNING    ~/sky_logs/sky-2022-07-13-02-19-27-623838  
    2   sky-cmd  zhwu  3 mins ago   3 mins ago   1m 53s    1x [CPU:0.5]  CANCELLED  ~/sky_logs/sky-2022-07-13-02-16-31-016826  
    1   sky-cmd  zhwu  58 mins ago  58 mins ago  56m 48s   1x [CPU:0.5]  CANCELLED  ~/sky_logs/sky-2022-07-13-01-16-51-440867  
    
    sky stop test-stale; sky start test-stale; sky queue test-stale shows the following:
    sky queue test-stale           ✭ ✱
    Fetching and parsing job queue...
    
    Sky Job Queue of Cluster test-stale
    ID  NAME     USER  SUBMITTED   STARTED     DURATION  RESOURCES     STATUS     LOG                                        
    3   sky-cmd  zhwu  3 mins ago  3 mins ago  3m 46s    1x [CPU:0.5]  FAILED     ~/sky_logs/sky-2022-07-13-02-19-27-623838  
    2   sky-cmd  zhwu  6 mins ago  6 mins ago  1m 53s    1x [CPU:0.5]  CANCELLED  ~/sky_logs/sky-2022-07-13-02-16-31-016826  
    1   sky-cmd  zhwu  1 hr ago    1 hr ago    56m 48s   1x [CPU:0.5]  CANCELLED  ~/sky_logs/sky-2022-07-13-01-16-51-440867  
    
  • sky admin deploy; sky launch -c aws-onprem '': deploy onprem on instances with ray==1.13 installed.

Backward Compatibility Test (Adapted from #1005):

#!/bin/bash
set -ex

CLUSTER_NAME="test-back-compat"
. $(conda info --base 2> /dev/null)/etc/profile.d/conda.sh

git clone git@github.com:sky-proj/sky.git sky-master || true


# Create environment for compatibility tests
conda create -n sky-back-compat-master -y --clone sky-dev
conda create -n sky-back-compat-current -y --clone sky-dev

conda activate sky-back-compat-master
rm -r  ~/.sky/sky_wheels || true
pip uninstall skypilot -y

cd sky-master
git pull origin master
pip install -e ".[all]"
cd -

conda activate sky-back-compat-current
rm -r  ~/.sky/sky_wheels || true
pip uninstall -y skypilot
pip install -e ".[all]"


# OK: can sky exec --cloud gcp and launch
conda activate sky-back-compat-master
rm -r  ~/.sky/sky_wheels || true
which sky
sky launch --cloud gcp -c ${CLUSTER_NAME} examples/minimal.yaml

conda activate sky-back-compat-current
rm -r  ~/.sky/sky_wheels || true
sky exec --cloud gcp ${CLUSTER_NAME} examples/minimal.yaml
sky launch --cloud gcp -c ${CLUSTER_NAME} examples/minimal.yaml
sky queue

# OK: can sky stop + sky start + sky exec --cloud gcp
conda activate sky-back-compat-master
rm -r  ~/.sky/sky_wheels || true
sky launch --cloud gcp -y -c ${CLUSTER_NAME}-2 examples/minimal.yaml
conda activate sky-back-compat-current
rm -r  ~/.sky/sky_wheels || true
sky stop -y ${CLUSTER_NAME}-2
sky start -y ${CLUSTER_NAME}-2
sky exec --cloud gcp ${CLUSTER_NAME}-2 examples/minimal.yaml

# OK: can `sky autostop` + `sky status -r`
conda activate sky-back-compat-master
rm -r  ~/.sky/sky_wheels || true
sky launch --cloud gcp -y -c ${CLUSTER_NAME}-3 examples/minimal.yaml
conda activate sky-back-compat-current
rm -r  ~/.sky/sky_wheels || true
sky autostop -y -i0 ${CLUSTER_NAME}-3
sleep 100
sky status -r | grep ${CLUSTER_NAME}-3 | grep STOPPED


# OK: (1 node) can sky launch --cloud gcp + sky exec --cloud gcp + sky queue + sky logs
conda activate sky-back-compat-master
rm -r  ~/.sky/sky_wheels || true
sky launch --cloud gcp -y -c ${CLUSTER_NAME}-4 examples/minimal.yaml
sky stop -y ${CLUSTER_NAME}-4
conda activate sky-back-compat-current
rm -r  ~/.sky/sky_wheels || true
sky launch --cloud gcp -y -c ${CLUSTER_NAME}-4 examples/minimal.yaml
sky queue ${CLUSTER_NAME}-4
sky logs ${CLUSTER_NAME}-4 1 --status
sky logs ${CLUSTER_NAME}-4 2 --status
sky logs ${CLUSTER_NAME}-4 1
sky logs ${CLUSTER_NAME}-4 2

# OK: (1 node) can sky start + sky exec --cloud gcp + sky queue + sky logs
conda activate sky-back-compat-master
rm -r  ~/.sky/sky_wheels || true
sky launch --cloud gcp -y -c ${CLUSTER_NAME}-5 examples/minimal.yaml
sky stop -y ${CLUSTER_NAME}-5
conda activate sky-back-compat-current
rm -r  ~/.sky/sky_wheels || true
sky start -y ${CLUSTER_NAME}-5
sky queue ${CLUSTER_NAME}-5
sky logs ${CLUSTER_NAME}-5 1 --status
sky logs ${CLUSTER_NAME}-5 1
sky launch --cloud gcp -y -c ${CLUSTER_NAME}-5 examples/minimal.yaml
sky queue ${CLUSTER_NAME}-5
sky logs ${CLUSTER_NAME}-5 2 --status
sky logs ${CLUSTER_NAME}-5 2

# OK: (2 nodes) can sky launch --cloud gcp + sky exec --cloud gcp + sky queue + sky logs
conda activate sky-back-compat-master
rm -r  ~/.sky/sky_wheels || true
sky launch --cloud gcp -y -c ${CLUSTER_NAME}-6 examples/multi_hostname.yaml
sky stop -y ${CLUSTER_NAME}-6
conda activate sky-back-compat-current
rm -r  ~/.sky/sky_wheels || true
sky start -y ${CLUSTER_NAME}-6
sky queue ${CLUSTER_NAME}-6
sky logs ${CLUSTER_NAME}-6 1 --status
sky logs ${CLUSTER_NAME}-6 1
sky exec --cloud gcp ${CLUSTER_NAME}-6 examples/multi_hostname.yaml
sky queue ${CLUSTER_NAME}-6
sky logs ${CLUSTER_NAME}-6 2 --status
sky logs ${CLUSTER_NAME}-6 2

@michaelzhiluo
Copy link
Collaborator

Sky On-prem #763 relies a lot on ray==1.10.0. Maybe merge this branch after the PR is merged?

@Michaelvll
Copy link
Collaborator Author

Yes, I will wait for the on-prem PR, but please let me know the exact dependencies used in the on-prem PR, when you get time.

@concretevitamin concretevitamin self-requested a review July 15, 2022 16:14
@Michaelvll Michaelvll added the blocked PR blocked by other issues label Jul 15, 2022
@Michaelvll Michaelvll mentioned this pull request Jul 21, 2022
1 task
@Michaelvll Michaelvll removed the blocked PR blocked by other issues label Jul 21, 2022
@Michaelvll Michaelvll marked this pull request as draft July 21, 2022 18:46
@Michaelvll Michaelvll marked this pull request as ready for review July 27, 2022 17:09
@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Jul 29, 2022

Tested again with smoke_test, only failed with test_inline_spot_env, due to the problem fixed in #1021

Remaining TODO: test on-prem cluster.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The back-compat tests are fantastic @Michaelvll! We can consider adding it as a file in tests/ later.

Should we reflect that Python 3.10 is supported now?

sky/skylet/ray_patches/worker.py.patch Show resolved Hide resolved
sky/cli.py Show resolved Hide resolved
sky/constants.py Show resolved Hide resolved
sky/skylet/ray_patches/cli.py.patch Outdated Show resolved Hide resolved
sky/skylet/providers/aws/config.py Outdated Show resolved Hide resolved
sky/templates/azure-ray.yml.j2 Show resolved Hide resolved
sky/skylet/job_lib.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Outdated Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll! Just a few remaining questions, plus wdyt of the Python versioning question here: #969 (review)?

sky/skylet/job_lib.py Outdated Show resolved Hide resolved
tests/backward_comaptibility_tests.sh Outdated Show resolved Hide resolved
sky/templates/azure-ray.yml.j2 Show resolved Hide resolved
sky/skylet/ray_patches/worker.py.patch Show resolved Hide resolved
sky/cli.py Show resolved Hide resolved
sky/skylet/providers/aws/config.py Outdated Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll! LGTM.

Some thought for the future: I spent some time diffing this PR's skylet/providers/ with Ray 1.13's counterparts, and checking if the diffs include our patches. This verification is probably not exact due to eyeballing. In the future, we should consider transitioning skylet/providers/ into patches as well for easier maintenance.

sky/skylet/providers/azure/node_provider.py Outdated Show resolved Hide resolved
https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/_private/gcp
Git Revision: ef9d9df4e7454d428a958281e9de333795dccb44
https://github.com/ray-project/ray/tree/ray-1.13.0/python/ray/autoscaler/_private/gcp
Git Revision: 4ce38d001dbbe09cd21c497fedd03d692b2be3e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these hashes correct?

https://github.com/ray-project/ray/blob/4ce38d001dbbe09cd21c497fedd03d692b2be3e/python/ray/autoscaler/_private/gcp/config.py

returns 404. I checked that gcp/config.py latest commit in the 1.13 branch is 6d83a3f2832fc24f63af1c1270afdd4850d701db.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, I put the latest git commit for the tag ray-1.13.0. Let me fix it.

@michaelzhiluo
Copy link
Collaborator

Just fixed a submission features indirectly affected by 1.13 for Onprem, thanks @Michaelvll for most of the relevant Ray upgrades for Sky On-prem.

@michaelzhiluo
Copy link
Collaborator

michaelzhiluo commented Aug 3, 2022

Just logging stuff here; Further discussion with @Michaelvll , it turns out to be a much deeper bug and Zhanghao found a nice fix. The bug was that there is a status transition error from job_lib.query_job_status (in tail_logs) to job_lib.get_status (in _folow_job_logs). This issue will be fixed in a separate PR.

@Michaelvll Michaelvll mentioned this pull request Aug 3, 2022
1 task
@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Aug 4, 2022

Tested:

  • ./tests/run_smoke_tests.sh
  • pytest ./tests/test_onprem.py
  • ./tests/run_smoke_tests.sh::test_minimal with python=3.10

@Michaelvll Michaelvll merged commit 8b15747 into master Aug 4, 2022
@Michaelvll Michaelvll deleted the upgrade-ray-1.13 branch August 4, 2022 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade ray version to support python 3.10
3 participants