-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Backward Compatibility][Spot] Avoid cluster leakage by ray yaml overwritten and reduce spot controller cost on AWS #1235
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice @Michaelvll! Questions:
- Does the mechanism here also fix Back-compat: don't overwrite certain cluster configs in yaml if cluster exists #640? E.g., put "everything below and including file_mounts can and should be overwritten" into the key set.
- Would https://github.com/skypilot-org/skypilot/blob/master/tests/backward_comaptibility_tests.sh catch the back compat issue? Thinking whether next time we can just enforce running that test to catch such issues more easily.
This is a great point! I added the keys used for calculating the
I just updated the script so that it can capture the problem with unexpectedly launching another instance. The script needs some refactorization. Probably, we can do it in another PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! We can update the PR title/desc to reflect
- this fixes Back-compat: don't overwrite certain cluster configs in yaml if cluster exists #640, ensuring restarting stopped clusters would not create new duplicate ones
- this reduces spot controller cost by 20%
# - keeping the auth is not enough becuase the content of the key file will be used | ||
# for calculating the hash. | ||
# TODO(zhwu): Keep in sync with the fields used in https://github.com/ray-project/ray/blob/e4ce38d001dbbe09cd21c497fedd03d692b2be3e/python/ray/autoscaler/_private/commands.py#L687-L701 | ||
_RAY_YAML_KEYS_TO_RESTORE_FOR_BACK_COMPATIBILITY = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it true that none of the following contribute to the launch hash?
resources
underray.head.default:
- any of
cluster_name: {{cluster_name}}
# The maximum number of workers nodes to launch in addition to the head node.
max_workers: {{num_nodes - 1}}
upscaling_speed: {{num_nodes - 1}}
idle_timeout_minutes: 60
- head_node_type: ray.head.default
Just figuring out whether we should preserve these fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added the cluster_name
in the key set. For the resources
and head_node_type
, I think it would be better to add them when we actually want to modify those fields and find them affecting backward compatibility in the future.
As far as I understand, the max_workers
, upscaling_speed
and idle_timeout_minutes
won't affect the launch hash.
sky/backends/backend_utils.py
Outdated
@@ -733,8 +750,11 @@ def write_cluster_config(to_provision: 'resources.Resources', | |||
yaml_path = _get_yaml_path_from_cluster_name(cluster_name) | |||
old_yaml_content = None | |||
if os.path.exists(yaml_path): | |||
with open(yaml_path, 'r') as f: | |||
old_yaml_content = f.read() | |||
if force_overwrite: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would be the case that the cluster didn't exist but this file exists? If this is an exceptional case to guard against, rename it to keep_launch_fields_in_existing_config: bool
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configure yaml files may not be correctly deleted if an error happens during sky.down
. I did find multiple yaml files in the folder ~/.sky/generated
not belonging to any existing cluster.
Good point! Let me rename the variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - this is awesome to have @Michaelvll.
Tested:
|
…nto reduce-iops-aws
…written and reduce spot controller cost on AWS (skypilot-org#1235) * set the default iops to be same as console for AWS * fix * add backward compatibility * Address comments * fix backward_compatibility_test * Add backward test for discarding old cluster * update backward * less output * address comments
…written and reduce spot controller cost on AWS (skypilot-org#1235) * set the default iops to be same as console for AWS * fix * add backward compatibility * Address comments * fix backward_compatibility_test * Add backward test for discarding old cluster * update backward * less output * address comments
It resubmits #1221 , and add backward compatibility for it.
Update (Oct 13th):
Our current default instance created on the AWS has the volume with the maximum IOPs, i.e. 16000. However, by default, the instance created on the AWS console only has 3000 IOPs by default.
Our volume costs (16000-3000) * 0.005 / 30 / 24 = $0.09 / hour, which is 20% of the cost for our default CPU machine on AWS m6i.2xlarge.
Todo item: make this IOPs configurable across the clouds (probably simplify it by only having three tiers: low, medium, high).
Tested:
sky cpunode -c test-iops
sky launch --disk-size 1024
sky launch --disk-size 50
sky gpunode -c test-t4 --gpus t4
sky launch -c bk-iops --num-nodes 2
(before this PR);sky stop bk-iops
;sky start bk-iops
it starts the same 2 instances.