Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Fix the optimization for IP fetching in sky launch #2400

Merged
merged 39 commits into from
Aug 17, 2023
Merged
Changes from 1 commit
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
b843b33
wip
Michaelvll Aug 11, 2023
e06c3b5
wip
Michaelvll Aug 12, 2023
575b327
working
Michaelvll Aug 12, 2023
7d7963a
explicitly update cluster ips
Michaelvll Aug 14, 2023
4edc463
optimize
Michaelvll Aug 14, 2023
7411f53
Fix comment
Michaelvll Aug 14, 2023
d65bc3c
Add comments
Michaelvll Aug 14, 2023
30d5e21
format
Michaelvll Aug 14, 2023
fb422cf
linter
Michaelvll Aug 14, 2023
5d1e499
Fix head ip fetching in run_on_head
Michaelvll Aug 14, 2023
8e68d44
fix head ip fetching
Michaelvll Aug 14, 2023
6af9a6f
fix status refresh
Michaelvll Aug 14, 2023
35e5823
format
Michaelvll Aug 14, 2023
d450f78
Use cached external ips instead
Michaelvll Aug 14, 2023
2b64383
format
Michaelvll Aug 15, 2023
b35aec6
Update sky/backends/cloud_vm_ray_backend.py
Michaelvll Aug 15, 2023
c97f37d
Fix ports
Michaelvll Aug 15, 2023
0efbf65
Merge branch 'optimize-head-ip' of github.com:skypilot-org/skypilot i…
Michaelvll Aug 15, 2023
0e7ca13
use retry
Michaelvll Aug 15, 2023
1c19a82
format
Michaelvll Aug 15, 2023
5f2890e
typo
Michaelvll Aug 15, 2023
037c332
minor fix
Michaelvll Aug 15, 2023
d69507d
Merge branch 'master' of github.com:skypilot-org/skypilot into optimi…
Michaelvll Aug 15, 2023
b646bf5
Merge branch 'master' of github.com:skypilot-org/skypilot into optimi…
Michaelvll Aug 15, 2023
d9deb2b
change the logging to logger.debug
Michaelvll Aug 15, 2023
9ac5b9d
Use the previously ips to optimize the ip fetching for existing cluster
Michaelvll Aug 15, 2023
50e7e7f
unbounded variable
Michaelvll Aug 15, 2023
4cacb4b
format
Michaelvll Aug 15, 2023
74ea57a
print output refresh failed
Michaelvll Aug 16, 2023
e6bcdd7
Add more logging
Michaelvll Aug 16, 2023
ad9d77e
more logging
Michaelvll Aug 16, 2023
845bbbc
more output
Michaelvll Aug 16, 2023
a590c39
Ensure ray cluster started
Michaelvll Aug 16, 2023
0d2cd6c
Fix tpu catched IPs
Michaelvll Aug 16, 2023
10f9f8e
Fix ports for tpu pod
Michaelvll Aug 16, 2023
42d4fba
format
Michaelvll Aug 16, 2023
5f39b02
failover error for TPU pod
Michaelvll Aug 16, 2023
95ac231
Add stderr for the failed ip refresh
Michaelvll Aug 16, 2023
a67e60e
Add the comment back
Michaelvll Aug 17, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 16 additions & 11 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -2374,29 +2374,31 @@ def _internal_ips(self) -> Optional[List[str]]:
return [ips[0] for ips in self.stable_internal_external_ips]
return None

def internal_ips(
self,
max_attempts: int = _FETCH_IP_MAX_ATTEMPTS) -> Optional[List[str]]:
def internal_ips(self,
max_attempts: int = _FETCH_IP_MAX_ATTEMPTS) -> List[str]:
internal_ips = self._internal_ips
if internal_ips is not None:
return internal_ips
self.update_cluster_ips(max_attempts=max_attempts)
return self._internal_ips
internal_ips = self._internal_ips
assert internal_ips is not None, 'update_cluster_ips failed.'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we raise an error here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be the internal error, as the self.cached_internal_ips should not be None after the update. We should probably still use assert

return internal_ips

@property
def _external_ips(self) -> Optional[List[str]]:
if self.stable_internal_external_ips is not None:
return [ips[1] for ips in self.stable_internal_external_ips]
return None

def external_ips(
self,
max_attempts: int = _FETCH_IP_MAX_ATTEMPTS) -> Optional[List[str]]:
def external_ips(self,
max_attempts: int = _FETCH_IP_MAX_ATTEMPTS) -> List[str]:
external_ips = self._external_ips
if external_ips is not None:
return self._external_ips
return external_ips
self.update_cluster_ips(max_attempts=max_attempts)
return self._external_ips
external_ips = self._external_ips
assert external_ips is not None, 'update_cluster_ips failed.'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

return external_ips

def external_ssh_ports(
self,
Expand Down Expand Up @@ -2426,7 +2428,7 @@ def cluster_yaml(self):

@property
def head_ip(self):
external_ips = self.external_ips()
external_ips = self._external_ips
if external_ips is not None:
return external_ips[0]
return None
Expand Down Expand Up @@ -3943,6 +3945,8 @@ def run_on_head(
) -> Union[int, Tuple[int, str, str]]:
"""Runs 'cmd' on the cluster's head node.

It will try to fetch the head node IP if it is not cached.

Args:
handle: The ResourceHandle to the cluster.
cmd: The command to run.
Expand All @@ -3967,7 +3971,8 @@ def run_on_head(
exceptions.FetchIPError: If the head node IP cannot be fetched.
"""
# This will try to fetch the head node IP if it is not cached.
head_ip = handle.head_ip
external_ips = handle.external_ips(max_attempts=_FETCH_IP_MAX_ATTEMPTS)
head_ip = external_ips[0]
head_ssh_port = backend_utils.get_head_ssh_port(handle,
_FETCH_IP_MAX_ATTEMPTS)
ssh_credentials = backend_utils.ssh_credential_from_yaml(
Expand Down
Loading