-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Fix the optimization for IP fetching in sky launch
#2400
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me! Left several nits:
sky/backends/cloud_vm_ray_backend.py
Outdated
# Optimization: Try parse internal head ip from 'ray start' stdout. | ||
# The line looks like: 'Local node IP: <internal head_ip>\n' | ||
position = stdout.rfind('Local node IP') | ||
line = stdout[position + 1:].partition('\n')[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why +1
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Changed to position:
instead.
internal_ips = self._internal_ips | ||
if internal_ips is not None: | ||
return internal_ips | ||
self.update_cluster_ips(max_attempts=max_attempts) | ||
return self._internal_ips | ||
internal_ips = self._internal_ips | ||
assert internal_ips is not None, 'update_cluster_ips failed.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we raise an error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be the internal error, as the self.cached_internal_ips
should not be None after the update. We should probably still use assert
self.update_cluster_ips(max_attempts=max_attempts) | ||
return self._external_ips | ||
external_ips = self._external_ips | ||
assert external_ips is not None, 'update_cluster_ips failed.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
sky/backends/backend_utils.py
Outdated
if external_ips is None or len(external_ips) == 0: | ||
raise exceptions.FetchIPError( | ||
reason=exceptions.FetchIPError.Reason.HEAD) | ||
# TODO(zhwu): check the correctness of stopped TPU VM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add a TODO for the case in #2304 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted it to the previous implementation to make it faster. : )
Co-authored-by: Tian Xia <cblmemo@gmail.com>
…nto optimize-head-ip
I fixed some problems with the TPU pod. It now passes all smoke tests for GCP and AWS. PTAL @cblmemo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks great to me! Left one nit and two questions that I'm not entirely sure I understand:
- Why do we need manually pass
port
argument when initializingSSHCommandRunner
? - I noticed that our IP is ephemeral. Is there a possibility that
prev_handle
might contain stale IP if wesky stop
someUP
cluster?
if isinstance(self.launched_resources.cloud, clouds.Kubernetes): | ||
head_port = backend_utils.get_head_ssh_port( | ||
self, use_cache=False, max_attempts=max_attempts) | ||
# TODO(romilb): Multinode doesn't work with Kubernetes yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we keep this TODO?
Thanks for the review @cblmemo!
The main reason is for the type checking. It seems if we do not manually pass that argument,
It is ok to pass the staled IP, because we will update the IPs later in the code once the VM is provisioned. skypilot/sky/backends/cloud_vm_ray_backend.py Line 1596 in 95ac231
And in the update, we will check if the IP from the skypilot/sky/backends/cloud_vm_ray_backend.py Lines 2349 to 2371 in 95ac231
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me!
The previous optimization for IP fetching does not work as we did not retrieve the internal IP from the provision output, causing the
sky launch
will always fetch the IP addresses causing additional overheads.This PR fixes the optimization. The following is the profiling, which shows
~9s
faster for launching a new cluster and~13s
faster forsky launch
an existing cluster. Note the optimization is for single node only at the moment, but since many users are using single node, it should be helpful.Profiling (average over 5 runs, on GCP with 2 CPUs,
time sky launch -y -d
):sky launch
a new cluster: 2m 41ssky launch
an existing cluster: 1m 26ssky launch
a new cluster: 2m 32ssky launch
an existing cluster: 1m 13sTODO:
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py --aws
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh