[Core] Fix the optimization for IP fetching in `sky launch` #2400

Michaelvll · 2023-08-14T05:54:46Z

The previous optimization for IP fetching does not work as we did not retrieve the internal IP from the provision output, causing the sky launch will always fetch the IP addresses causing additional overheads.

This PR fixes the optimization. The following is the profiling, which shows ~9s faster for launching a new cluster and ~13s faster for sky launch an existing cluster. Note the optimization is for single node only at the moment, but since many users are using single node, it should be helpful.

Profiling (average over 5 runs, on GCP with 2 CPUs, time sky launch -y -d):

Original:
sky launch a new cluster: 2m 41s
sky launch an existing cluster: 1m 26s
Current PR:
sky launch a new cluster: 2m 32s
sky launch an existing cluster: 1m 13s

TODO:

Test TPU pod

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
All smoke tests: pytest tests/test_smoke.py --aws
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

cblmemo

The code looks good to me! Left several nits:

sky/backends/cloud_vm_ray_backend.py

cblmemo · 2023-08-14T22:31:29Z

sky/backends/cloud_vm_ray_backend.py

+            # Optimization: Try parse internal head ip from 'ray start' stdout.
+            # The line looks like: 'Local node IP: <internal head_ip>\n'
+            position = stdout.rfind('Local node IP')
+            line = stdout[position + 1:].partition('\n')[0]


Why +1 here?

Good point! Changed to position: instead.

cblmemo · 2023-08-14T23:07:33Z

sky/backends/cloud_vm_ray_backend.py

        internal_ips = self._internal_ips
        if internal_ips is not None:
            return internal_ips
        self.update_cluster_ips(max_attempts=max_attempts)
-        return self._internal_ips
+        internal_ips = self._internal_ips
+        assert internal_ips is not None, 'update_cluster_ips failed.'


Shall we raise an error here?

It should be the internal error, as the self.cached_internal_ips should not be None after the update. We should probably still use assert

cblmemo · 2023-08-14T23:07:42Z

sky/backends/cloud_vm_ray_backend.py

        self.update_cluster_ips(max_attempts=max_attempts)
-        return self._external_ips
+        external_ips = self._external_ips
+        assert external_ips is not None, 'update_cluster_ips failed.'


cblmemo · 2023-08-14T23:19:37Z

sky/backends/backend_utils.py

-            if external_ips is None or len(external_ips) == 0:
-                raise exceptions.FetchIPError(
-                    reason=exceptions.FetchIPError.Reason.HEAD)
+            # TODO(zhwu): check the correctness of stopped TPU VM


Shall we add a TODO for the case in #2304 ?

I reverted it to the previous implementation to make it faster. : )

Co-authored-by: Tian Xia <cblmemo@gmail.com>

…nto optimize-head-ip

…ze-head-ip

Michaelvll · 2023-08-17T05:30:14Z

I fixed some problems with the TPU pod. It now passes all smoke tests for GCP and AWS. PTAL @cblmemo.

cblmemo

The code looks great to me! Left one nit and two questions that I'm not entirely sure I understand:

Why do we need manually pass port argument when initializing SSHCommandRunner?
I noticed that our IP is ephemeral. Is there a possibility that prev_handle might contain stale IP if we sky stop some UP cluster?

cblmemo · 2023-08-17T17:32:43Z

sky/backends/cloud_vm_ray_backend.py

-        if isinstance(self.launched_resources.cloud, clouds.Kubernetes):
-            head_port = backend_utils.get_head_ssh_port(
-                self, use_cache=False, max_attempts=max_attempts)
-            # TODO(romilb): Multinode doesn't work with Kubernetes yet.


Shall we keep this TODO?

Michaelvll · 2023-08-17T17:53:56Z

Thanks for the review @cblmemo!

Why do we need manually pass port argument when initializing SSHCommandRunner?

The main reason is for the type checking. It seems if we do not manually pass that argument, mypy will complain about the **ssh_crednetials is **Dict[str, str] but the remaining arguments (i.e., port) in that function has int type.

I noticed that our IP is ephemeral. Is there a possibility that prev_handle might contain stale IP if we sky stop some UP cluster?

It is ok to pass the staled IP, because we will update the IPs later in the code once the VM is provisioned.

skypilot/sky/backends/cloud_vm_ray_backend.py

Line 1596 in 95ac231

handle.update_cluster_ips(max_attempts=_FETCH_IP_MAX_ATTEMPTS,

And in the update, we will check if the IP from the ray up matches the one in the cache. If they do not match, we will use the new one and update the internal IPs.

skypilot/sky/backends/cloud_vm_ray_backend.py

Lines 2349 to 2371 in 95ac231

    
           def is_provided_ips_valid(ips: Optional[List[Optional[str]]]) -> bool: 
        
               return (ips is not None and len(ips) == self.num_node_ips and 
        
                       all(ip is not None for ip in ips)) 
        
           if is_provided_ips_valid(external_ips): 
        
               logger.debug(f'Using provided external IPs: {external_ips}') 
        
               cluster_external_ips = typing.cast(List[str], external_ips) 
        
           else: 
        
               cluster_external_ips = backend_utils.get_node_ips( 
        
                   self.cluster_yaml, 
        
                   self.launched_nodes, 
        
                   handle=self, 
        
                   head_ip_max_attempts=max_attempts, 
        
                   worker_ip_max_attempts=max_attempts, 
        
                   get_internal_ips=False) 
        
           if self.cached_external_ips == cluster_external_ips: 
        
               logger.debug('Skipping the fetching of internal IPs as the cached ' 
        
                            'external IPs matches the newly fetched ones.') 
        
               # Optimization: If the cached external IPs are the same as the 
        
               # retrieved external IPs, then we can skip retrieving internal 
        
               # IPs since the cached IPs are up-to-date. 
        
               return

cblmemo

Looks great to me!

Michaelvll added 9 commits August 11, 2023 16:44

wip

b843b33

wip

e06c3b5

working

575b327

explicitly update cluster ips

7d7963a

optimize

4edc463

Fix comment

7411f53

Add comments

d65bc3c

format

30d5e21

linter

fb422cf

Michaelvll requested a review from cblmemo August 14, 2023 19:03

Michaelvll added 3 commits August 14, 2023 15:54

Fix head ip fetching in run_on_head

5d1e499

fix head ip fetching

8e68d44

fix status refresh

6af9a6f

cblmemo approved these changes Aug 14, 2023

View reviewed changes

Michaelvll and others added 16 commits August 14, 2023 16:21

format

35e5823

Use cached external ips instead

d450f78

format

2b64383

Update sky/backends/cloud_vm_ray_backend.py

b35aec6

Co-authored-by: Tian Xia <cblmemo@gmail.com>

Fix ports

c97f37d

Merge branch 'optimize-head-ip' of github.com:skypilot-org/skypilot i…

0efbf65

…nto optimize-head-ip

use retry

0e7ca13

format

1c19a82

typo

5f2890e

minor fix

037c332

Merge branch 'master' of github.com:skypilot-org/skypilot into optimi…

d69507d

…ze-head-ip

Merge branch 'master' of github.com:skypilot-org/skypilot into optimi…

b646bf5

…ze-head-ip

change the logging to logger.debug

d9deb2b

Use the previously ips to optimize the ip fetching for existing cluster

9ac5b9d

unbounded variable

50e7e7f

format

4cacb4b

Michaelvll added 10 commits August 15, 2023 20:29

print output refresh failed

74ea57a

Add more logging

e6bcdd7

more logging

ad9d77e

more output

845bbbc

Ensure ray cluster started

a590c39

Fix tpu catched IPs

0d2cd6c

Fix ports for tpu pod

10f9f8e

format

42d4fba

failover error for TPU pod

5f39b02

Add stderr for the failed ip refresh

95ac231

Michaelvll requested a review from cblmemo August 17, 2023 17:37

cblmemo reviewed Aug 17, 2023

View reviewed changes

Add the comment back

a67e60e

cblmemo approved these changes Aug 17, 2023

View reviewed changes

Michaelvll merged commit 06a927c into master Aug 17, 2023
17 checks passed

Michaelvll deleted the optimize-head-ip branch August 17, 2023 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Fix the optimization for IP fetching in `sky launch` #2400

[Core] Fix the optimization for IP fetching in `sky launch` #2400

Michaelvll commented Aug 14, 2023 •

edited

Loading

cblmemo left a comment

cblmemo Aug 14, 2023

Michaelvll Aug 15, 2023

cblmemo Aug 14, 2023

Michaelvll Aug 15, 2023

cblmemo Aug 14, 2023

cblmemo Aug 14, 2023

Michaelvll Aug 15, 2023

Michaelvll commented Aug 17, 2023

cblmemo left a comment

cblmemo Aug 17, 2023

Michaelvll commented Aug 17, 2023 •

edited

Loading

cblmemo left a comment

[Core] Fix the optimization for IP fetching in sky launch #2400

[Core] Fix the optimization for IP fetching in sky launch #2400

Conversation

Michaelvll commented Aug 14, 2023 • edited Loading

cblmemo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll commented Aug 17, 2023

cblmemo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll commented Aug 17, 2023 • edited Loading

cblmemo left a comment

Choose a reason for hiding this comment

[Core] Fix the optimization for IP fetching in `sky launch` #2400

[Core] Fix the optimization for IP fetching in `sky launch` #2400

Michaelvll commented Aug 14, 2023 •

edited

Loading

Michaelvll commented Aug 17, 2023 •

edited

Loading