Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lambda Cloud] Update Regex of Internal IP for H100 Support #1969

Merged

Conversation

zetavg
Copy link
Contributor

@zetavg zetavg commented May 19, 2023

According to RFC 1918, the IP range 172.16.0.0 - 172.31.255.255 can be used as internal IP address, which Lambda Cloud seems to be actually doing so for the new H100 GPU instances. (img)

Therefore, users may get a Failed to obtain private IP from node error while launching those new instances on Lambda Cloud. (#1948)

This PR modifies the regex used by ray get-head-ip to also match IPs 172.16.* - 172.32.*.

The new Regex is tested as: https://regex101.com/r/jHHWti/2 and with

curl https://gist.githubusercontent.com/zetavg/8712abde098f3b287dca0f3f9ea9e40f/raw/ee924d9a8ad4c98f0798233f846b43c35caf798c/ip.txt | grep -Eo "(10\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|172\.(1[6-9]|2[0-9]|3[0-1]))\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"

on a Lambda Cloud H100 instance.

Breaking Changes / Possible Issues

With old instance types that do have a 10.x.x.x internal IP, the updated command tends to use the later-listed 172.x.x.x IP instead of the original 10.x.

I haven't used multiple clusters yet and I have not test if this will break anything.

I 05-20 00:53:55 cloud_vm_ray_backend.py:1480] Launching on Lambda us-east-1
I 05-20 00:54:06 log_utils.py:89] Head node is up.
I 05-20 00:54:34 cloud_vm_ray_backend.py:1293] Successfully provisioned or found existing VM.
W 05-20 00:54:40 backend_utils.py:1277] Detected more than 1 IP from the output of the `ray get-head-ip` command. This could happen if there is extra output from it, which should be inspected below.
W 05-20 00:54:40 backend_utils.py:1277] Proceeding with the last detected IP (172.17.0.1) as head IP.
W 05-20 00:54:40 backend_utils.py:1277] == Output ==
W 05-20 00:54:40 backend_utils.py:1277] 10.19.94.189
W 05-20 00:54:40 backend_utils.py:1277] 172.17.0.1
W 05-20 00:54:40 backend_utils.py:1277] == Output ends ==

Still trying to resolve tests/test_smoke.py:41 ModuleNotFoundError: No module named 'sky' while running pytest tests/test_smoke.py.

Tested (run the relevant ones):

  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@concretevitamin
Copy link
Member

Thanks @zetavg! Will test it soon. cc @ewzeng: does https://cloud.lambdalabs.com/api/v1/docs#operation/getInstance this API return private IP or public IP? If so, would it be more robust?

@ewzeng
Copy link
Collaborator

ewzeng commented May 20, 2023

Thanks @zetavg! I am on the road right now, but I will test it out when I get back.

@concretevitamin It returns the public IP :/

@concretevitamin
Copy link
Member

concretevitamin commented May 20, 2023

Tested launching A100:1 and A10:1; like @zetavg said, I do see the following extra warning now (IP values are different):

...
W 05-20 00:54:40 backend_utils.py:1277] Detected more than 1 IP from the output of the `ray get-head-ip` command. This could happen if there is extra output from it, which should be inspected below.
W 05-20 00:54:40 backend_utils.py:1277] Proceeding with the last detected IP (172.17.0.1) as head IP.
W 05-20 00:54:40 backend_utils.py:1277] == Output ==
W 05-20 00:54:40 backend_utils.py:1277] 10.19.94.189
W 05-20 00:54:40 backend_utils.py:1277] 172.17.0.1
W 05-20 00:54:40 backend_utils.py:1277] == Output ends ==

This is probably because

$ ip -4 -br addr show
lo               UNKNOWN        127.0.0.1/8
enp5s0           UP             10.19.17.68/20
docker0          DOWN           172.17.0.1/16

Is it possible to modify the grepping to match an UP network interface only? In this case, I believe the actual private IP is 10.19.17.68.

@zetavg
Copy link
Contributor Author

zetavg commented May 20, 2023

Oh, I see. Thanks for the testing!

Is it possible to modify the grepping to match an UP network interface only?

I think we can add grep UP to filter out lines containing "UP" before trying to get the IP address.

ip -4 -br addr show | grep UP | grep -Eo "(10\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|172\.(1[6-9]|2[0-9]|3[0-1]))\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"

Such as:

curl https://gist.githubusercontent.com/zetavg/91c0e2f816518ca88708f7e57870038f/raw/79d3acef54b27eddcf9019bd541e74183eb9352b/ip_addr_show.txt | grep UP | grep -Eo "(10\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|172\.(1[6-9]|2[0-9]|3[0-1]))\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I've tested that the warning went away for A10. Thanks @zetavg!

@concretevitamin concretevitamin merged commit 9dca02b into skypilot-org:master May 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants