[Backward Compatibility][Spot] Avoid cluster leakage by ray yaml overwritten and reduce spot controller cost on AWS #1235

Michaelvll · 2022-10-12T05:40:12Z

It resubmits #1221 , and add backward compatibility for it.

Update (Oct 13th):

this fixes Back-compat: don't overwrite certain cluster configs in yaml if cluster exists #640, ensuring restarting stopped clusters would not create new duplicate ones
this reduces spot controller cost by 20%

Our current default instance created on the AWS has the volume with the maximum IOPs, i.e. 16000. However, by default, the instance created on the AWS console only has 3000 IOPs by default.
Our volume costs (16000-3000) * 0.005 / 30 / 24 = $0.09 / hour, which is 20% of the cost for our default CPU machine on AWS m6i.2xlarge.

Todo item: make this IOPs configurable across the clouds (probably simplify it by only having three tiers: low, medium, high).

Tested:

sky cpunode -c test-iops
sky launch --disk-size 1024
sky launch --disk-size 50
sky gpunode -c test-t4 --gpus t4
sky launch -c bk-iops --num-nodes 2 (before this PR); sky stop bk-iops; sky start bk-iops it starts the same 2 instances.

concretevitamin

Nice @Michaelvll! Questions:

Does the mechanism here also fix Back-compat: don't overwrite certain cluster configs in yaml if cluster exists #640? E.g., put "everything below and including file_mounts can and should be overwritten" into the key set.
Would https://github.com/skypilot-org/skypilot/blob/master/tests/backward_comaptibility_tests.sh catch the back compat issue? Thinking whether next time we can just enforce running that test to catch such issues more easily.

sky/backends/backend_utils.py

Michaelvll · 2022-10-13T08:42:25Z

Does the mechanism here also fix #640? E.g., put "everything below and including file_mounts can and should be overwritten" into the key set.

This is a great point! I added the keys used for calculating the launch_hash in the key set. Supposedly, the issue will be fixed.

Would https://github.com/skypilot-org/skypilot/blob/master/tests/backward_comaptibility_tests.sh catch the back compat issue? Thinking whether next time we can just enforce running that test to catch such issues more easily.

I just updated the script so that it can capture the problem with unexpectedly launching another instance. The script needs some refactorization. Probably, we can do it in another PR?

concretevitamin

Awesome! We can update the PR title/desc to reflect

this fixes Back-compat: don't overwrite certain cluster configs in yaml if cluster exists #640, ensuring restarting stopped clusters would not create new duplicate ones
this reduces spot controller cost by 20%

concretevitamin · 2022-10-13T16:46:22Z

sky/backends/backend_utils.py

+# - keeping the auth is not enough becuase the content of the key file will be used
+#   for calculating the hash.
+# TODO(zhwu): Keep in sync with the fields used in https://github.com/ray-project/ray/blob/e4ce38d001dbbe09cd21c497fedd03d692b2be3e/python/ray/autoscaler/_private/commands.py#L687-L701
+_RAY_YAML_KEYS_TO_RESTORE_FOR_BACK_COMPATIBILITY = {


Is it true that none of the following contribute to the launch hash?

resources under ray.head.default:

any of

cluster_name: {{cluster_name}} # The maximum number of workers nodes to launch in addition to the head node. max_workers: {{num_nodes - 1}} upscaling_speed: {{num_nodes - 1}} idle_timeout_minutes: 60

head_node_type: ray.head.default

Just figuring out whether we should preserve these fields.

I just added the cluster_name in the key set. For the resources and head_node_type, I think it would be better to add them when we actually want to modify those fields and find them affecting backward compatibility in the future.
As far as I understand, the max_workers, upscaling_speed and idle_timeout_minutes won't affect the launch hash.

sky/backends/backend_utils.py

concretevitamin · 2022-10-13T17:15:41Z

sky/backends/backend_utils.py

@@ -733,8 +750,11 @@ def write_cluster_config(to_provision: 'resources.Resources',
    yaml_path = _get_yaml_path_from_cluster_name(cluster_name)
    old_yaml_content = None
    if os.path.exists(yaml_path):
-        with open(yaml_path, 'r') as f:
-            old_yaml_content = f.read()
+        if force_overwrite:


When would be the case that the cluster didn't exist but this file exists? If this is an exceptional case to guard against, rename it to keep_launch_fields_in_existing_config: bool?

The configure yaml files may not be correctly deleted if an error happens during sky.down. I did find multiple yaml files in the folder ~/.sky/generated not belonging to any existing cluster.
Good point! Let me rename the variable.

sky/backends/backend_utils.py

concretevitamin

LGTM - this is awesome to have @Michaelvll.

Michaelvll · 2022-10-13T22:21:06Z

Tested:

./tests/backward_compatibility_tests.sh
./tests/run_smoke_tests.sh

…nto reduce-iops-aws

…written and reduce spot controller cost on AWS (skypilot-org#1235) * set the default iops to be same as console for AWS * fix * add backward compatibility * Address comments * fix backward_compatibility_test * Add backward test for discarding old cluster * update backward * less output * address comments

Michaelvll added 3 commits October 10, 2022 23:11

set the default iops to be same as console for AWS

341c73b

fix

1b55709

add backward compatibility

5411740

Michaelvll changed the title ~~Reduce iops aws with backward compatibility~~ [AWS] Reduce iops aws with backward compatibility Oct 12, 2022

concretevitamin reviewed Oct 12, 2022

View reviewed changes

sky/backends/backend_utils.py Outdated Show resolved Hide resolved

sky/backends/backend_utils.py Outdated Show resolved Hide resolved

Michaelvll added 4 commits October 13, 2022 00:18

Address comments

9d87021

fix backward_compatibility_test

8e14254

Add backward test for discarding old cluster

1002c38

update backward

90f5362

less output

e1b75c0

concretevitamin reviewed Oct 13, 2022

View reviewed changes

Michaelvll linked an issue Oct 13, 2022 that may be closed by this pull request

Back-compat: don't overwrite certain cluster configs in yaml if cluster exists #640

Closed

Michaelvll changed the title ~~[AWS] Reduce iops aws with backward compatibility~~ [Backward Compatibility][Spot] Avoid cluster leakage by ray yaml overwritten and reduce spot controller cost on AWS Oct 13, 2022

address comments

a15841d

concretevitamin approved these changes Oct 13, 2022

View reviewed changes

Merge branch 'master' of github.com:concretevitamin/sky-experiments i…

16391bd

…nto reduce-iops-aws

Michaelvll merged commit ed4b3f2 into master Oct 14, 2022

Michaelvll deleted the reduce-iops-aws branch October 14, 2022 00:11

Michaelvll mentioned this pull request Oct 14, 2022

[Backward Compatibility] Fix backward compatibility #1251

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backward Compatibility][Spot] Avoid cluster leakage by ray yaml overwritten and reduce spot controller cost on AWS #1235

[Backward Compatibility][Spot] Avoid cluster leakage by ray yaml overwritten and reduce spot controller cost on AWS #1235

Michaelvll commented Oct 12, 2022 •

edited

Loading

concretevitamin left a comment

Michaelvll commented Oct 13, 2022

concretevitamin left a comment

concretevitamin Oct 13, 2022

Michaelvll Oct 13, 2022

concretevitamin Oct 13, 2022

Michaelvll Oct 13, 2022

concretevitamin left a comment

Michaelvll commented Oct 13, 2022 •

edited

Loading

[Backward Compatibility][Spot] Avoid cluster leakage by ray yaml overwritten and reduce spot controller cost on AWS #1235

[Backward Compatibility][Spot] Avoid cluster leakage by ray yaml overwritten and reduce spot controller cost on AWS #1235

Conversation

Michaelvll commented Oct 12, 2022 • edited Loading

concretevitamin left a comment

Choose a reason for hiding this comment

Michaelvll commented Oct 13, 2022

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin Oct 13, 2022

Choose a reason for hiding this comment

Michaelvll Oct 13, 2022

Choose a reason for hiding this comment

concretevitamin Oct 13, 2022

Choose a reason for hiding this comment

Michaelvll Oct 13, 2022

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

Michaelvll commented Oct 13, 2022 • edited Loading

Michaelvll commented Oct 12, 2022 •

edited

Loading

Michaelvll commented Oct 13, 2022 •

edited

Loading