Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Multi-node support for Kubernetes #2609

Merged
merged 10 commits into from
Oct 4, 2023
Merged

[k8s] Multi-node support for Kubernetes #2609

merged 10 commits into from
Oct 4, 2023

Conversation

romilbhardwaj
Copy link
Collaborator

@romilbhardwaj romilbhardwaj commented Sep 25, 2023

Adds Kubernetes multi-node support. It was fairly straightforward, since most of the implementation was already done in #2096. Main thing to note is that spreading pods across nodes is set as a soft-constraint (i.e., try to place multi-node pods on different physical nodes. If not possible, ok to put them on the same node.)

Also closes #2628. This is a tricky one, since the problem is still not clear why it happens. See the issue for more details. After spending time investigating it, I have added a randomized sleep after kubectl port-forward is run.

While this is not the solution I wanted, it appears to fix the problem reliably. Ideally I would have wanted to get a signal from kubectl port-forward to know when it is ready, but it emits no signals to indicate it is actually ready (we already wait for it to print the expected output and use nc to verify the port is open).

What's working for multi-node:

  • Basic multi-node launching, tasks and ssh (sky launch --num-nodes 2 -- echo hi)
  • IP address fetching/populating SKYPILOT_NODE_IPS correctly
  • Multi-node GPU jobs (pytorch distributed, tf examples)

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Tested SSH concurrency with the script from [k8s] SSH ProxyCommand script is not concurrency-safe #2628 and running run_command_in_parallel(['ssh','myclus','echo hi'], 20, 20) 10 times.
  • Multi-node smoke tests are uncommented now - ran pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials"

@romilbhardwaj romilbhardwaj marked this pull request as ready for review September 29, 2023 23:55
@romilbhardwaj
Copy link
Collaborator Author

This is now ready for a look! Also includes a fix for #2628 - updated the PR description above.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @romilbhardwaj! It is a surprisingly small amount of changes to make it work with multiple nodes. Tested with a GKE cluster and it works smoothly. One minor UX issue:

  1. sky launch --gpus V100:0.5 --num-nodes 3 "hostname -i"
  2. sky status: the resources column shows 3x Kubernetes(2CPU--8GB--1V100, {'V100': 1}), instead of V100:0.5, which is a bit surprising.

sky/templates/kubernetes-port-forward-proxy-command.sh.j2 Outdated Show resolved Hide resolved
Comment on lines 45 to 54
# To avoid errors when many concurrent requests are sent (see https://github.com/skypilot-org/skypilot/issues/2628),
# we add a random delay before establishing the socat connection.
# Empirically, this needs to be at least 1 second. We set this to be random between 1 and 2 seconds.
sleep $(shuf -i 10-20 -n 1 | awk '{printf "%f", $1/10}')

# Establishes two directional byte streams to handle stdin/stdout between
# terminal and the jump pod.
# socat process terminates when port-forward terminates.
socat - tcp:127.0.0.1:"${local_port}" No newline at end of file
socat - tcp:127.0.0.1:"${local_port}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a random wait time enough or should this be a socat and retry with a random backoff?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it would be nice to have a newline at the EOF.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the socat command succeeds and starts running without any error even though the underlying kubectl port-forward may not be functional. Since there's no signal that I can use to detect if kubectl port-forward is ready, I used random wait which has been sufficient in my testing.

Copy link
Collaborator

@Michaelvll Michaelvll Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, makes sense. I think it should be fine to wait for a random gap. Just curious, if something that checks the socket connection would work for checking if the connection gets setup correctly, like the following:

def _wait_ssh_connection_direct(
ip: str,
ssh_user: str,
ssh_private_key: str,
ssh_control_name: Optional[str] = None,
ssh_proxy_command: Optional[str] = None) -> bool:
del ssh_control_name
assert ssh_proxy_command is None, 'SSH proxy command is not supported.'
try:
with socket.create_connection((ip, 22), timeout=1) as s:
if s.recv(100).startswith(b'SSH'):
return True
except socket.timeout: # this is the most expected exception
pass
except Exception: # pylint: disable=broad-except
pass
command = _ssh_probe_command(ip, ssh_user, ssh_private_key,
ssh_proxy_command)
logger.debug(f'Waiting for SSH to {ip}. Try: '
f'{_shlex_join(command)}')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, it seems we have already done the nc check before this wait. Waiting looks good to me.

@romilbhardwaj
Copy link
Collaborator Author

Thanks for the review @Michaelvll! I've filed the fractional GPU bug in #2655, we should get to that too soon..

@romilbhardwaj romilbhardwaj merged commit 21ee81f into master Oct 4, 2023
18 checks passed
@romilbhardwaj romilbhardwaj deleted the k8s_multinode branch October 4, 2023 19:22
jc9123 pushed a commit to jc9123/skypilot that referenced this pull request Oct 11, 2023
* Initial multi-node support

* Add pod anti-affinity

* Fix concurrent SSH for Kubernetes

* lint

* comments

* update readme

* remove lsof dependency

* newline

* Update roadmap in readme
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] SSH ProxyCommand script is not concurrency-safe
2 participants