[k8s] Multi-node support for Kubernetes #2609

romilbhardwaj · 2023-09-25T23:29:31Z

Adds Kubernetes multi-node support. It was fairly straightforward, since most of the implementation was already done in #2096. Main thing to note is that spreading pods across nodes is set as a soft-constraint (i.e., try to place multi-node pods on different physical nodes. If not possible, ok to put them on the same node.)

Also closes #2628. This is a tricky one, since the problem is still not clear why it happens. See the issue for more details. After spending time investigating it, I have added a randomized sleep after kubectl port-forward is run.

While this is not the solution I wanted, it appears to fix the problem reliably. Ideally I would have wanted to get a signal from kubectl port-forward to know when it is ready, but it emits no signals to indicate it is actually ready (we already wait for it to print the expected output and use nc to verify the port is open).

What's working for multi-node:

Basic multi-node launching, tasks and ssh (sky launch --num-nodes 2 -- echo hi)
IP address fetching/populating SKYPILOT_NODE_IPS correctly
Multi-node GPU jobs (pytorch distributed, tf examples)
- Tested Nemo multi-node training for ~12 hours from [Examples] NeMo distributed training for BERT and GPT3 #2533

Tested (run the relevant ones):

Code formatting: bash format.sh
Tested SSH concurrency with the script from [k8s] SSH ProxyCommand script is not concurrency-safe #2628 and running run_command_in_parallel(['ssh','myclus','echo hi'], 20, 20) 10 times.
Multi-node smoke tests are uncommented now - ran pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials"

romilbhardwaj · 2023-09-29T23:56:27Z

This is now ready for a look! Also includes a fix for #2628 - updated the PR description above.

Michaelvll

This is awesome @romilbhardwaj! It is a surprisingly small amount of changes to make it work with multiple nodes. Tested with a GKE cluster and it works smoothly. One minor UX issue:

sky launch --gpus V100:0.5 --num-nodes 3 "hostname -i"
sky status: the resources column shows 3x Kubernetes(2CPU--8GB--1V100, {'V100': 1}), instead of V100:0.5, which is a bit surprising.

sky/templates/kubernetes-port-forward-proxy-command.sh.j2

Michaelvll · 2023-10-03T16:38:54Z

sky/templates/kubernetes-port-forward-proxy-command.sh.j2

+# To avoid errors when many concurrent requests are sent (see https://github.com/skypilot-org/skypilot/issues/2628),
+# we add a random delay before establishing the socat connection.
+# Empirically, this needs to be at least 1 second. We set this to be random between 1 and 2 seconds.
+sleep $(shuf -i 10-20 -n 1 | awk '{printf "%f", $1/10}')
+
 # Establishes two directional byte streams to handle stdin/stdout between
 # terminal and the jump pod.
 # socat process terminates when port-forward terminates.
-socat - tcp:127.0.0.1:"${local_port}"
+socat - tcp:127.0.0.1:"${local_port}"


Is a random wait time enough or should this be a socat and retry with a random backoff?

Also it would be nice to have a newline at the EOF.

Unfortunately, the socat command succeeds and starts running without any error even though the underlying kubectl port-forward may not be functional. Since there's no signal that I can use to detect if kubectl port-forward is ready, I used random wait which has been sufficient in my testing.

Ahh, makes sense. I think it should be fine to wait for a random gap. Just curious, if something that checks the socket connection would work for checking if the connection gets setup correctly, like the following:

skypilot/sky/provision/provisioner.py

Lines 239 to 258 in 029f886

def _wait_ssh_connection_direct(

ip: str,

ssh_user: str,

ssh_private_key: str,

ssh_control_name: Optional[str] = None,

ssh_proxy_command: Optional[str] = None) -> bool:

del ssh_control_name

assert ssh_proxy_command is None, 'SSH proxy command is not supported.'

try:

with socket.create_connection((ip, 22), timeout=1) as s:

if s.recv(100).startswith(b'SSH'):

return True

except socket.timeout: # this is the most expected exception

pass

except Exception: # pylint: disable=broad-except

pass

command = _ssh_probe_command(ip, ssh_user, ssh_private_key,

ssh_proxy_command)

logger.debug(f'Waiting for SSH to {ip}. Try: '

f'{_shlex_join(command)}')

nvm, it seems we have already done the nc check before this wait. Waiting looks good to me.

sky/templates/kubernetes-port-forward-proxy-command.sh.j2

romilbhardwaj · 2023-10-04T04:57:53Z

Thanks for the review @Michaelvll! I've filed the fractional GPU bug in #2655, we should get to that too soon..

* Initial multi-node support * Add pod anti-affinity * Fix concurrent SSH for Kubernetes * lint * comments * update readme * remove lsof dependency * newline * Update roadmap in readme

romilbhardwaj added 4 commits September 25, 2023 16:26

Initial multi-node support

9fea2b7

Add pod anti-affinity

f5e3a3f

Fix concurrent SSH for Kubernetes

9f49d91

lint

e462176

romilbhardwaj marked this pull request as ready for review September 29, 2023 23:55

romilbhardwaj requested a review from Michaelvll September 29, 2023 23:55

romilbhardwaj added 2 commits September 30, 2023 07:42

comments

9a0140c

update readme

359b01f

Michaelvll approved these changes Oct 3, 2023

View reviewed changes

romilbhardwaj added 3 commits October 3, 2023 18:49

remove lsof dependency

6516cf4

newline

9e84620

Update roadmap in readme

99ef4a5

Merge branch 'master' into k8s_multinode

50050cc

romilbhardwaj merged commit 21ee81f into master Oct 4, 2023
18 checks passed

romilbhardwaj deleted the k8s_multinode branch October 4, 2023 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] Multi-node support for Kubernetes #2609

[k8s] Multi-node support for Kubernetes #2609

romilbhardwaj commented Sep 25, 2023 •

edited

Loading

romilbhardwaj commented Sep 29, 2023

Michaelvll left a comment

Michaelvll Oct 3, 2023

Michaelvll Oct 3, 2023

romilbhardwaj Oct 4, 2023

Michaelvll Oct 4, 2023 •

edited

Loading

Michaelvll Oct 4, 2023

romilbhardwaj commented Oct 4, 2023

	def _wait_ssh_connection_direct(
	ip: str,
	ssh_user: str,
	ssh_private_key: str,
	ssh_control_name: Optional[str] = None,
	ssh_proxy_command: Optional[str] = None) -> bool:
	del ssh_control_name
	assert ssh_proxy_command is None, 'SSH proxy command is not supported.'
	try:
	with socket.create_connection((ip, 22), timeout=1) as s:
	if s.recv(100).startswith(b'SSH'):
	return True
	except socket.timeout: # this is the most expected exception
	pass
	except Exception: # pylint: disable=broad-except
	pass
	command = _ssh_probe_command(ip, ssh_user, ssh_private_key,
	ssh_proxy_command)
	logger.debug(f'Waiting for SSH to {ip}. Try: '
	f'{_shlex_join(command)}')

[k8s] Multi-node support for Kubernetes #2609

[k8s] Multi-node support for Kubernetes #2609

Conversation

romilbhardwaj commented Sep 25, 2023 • edited Loading

romilbhardwaj commented Sep 29, 2023

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Oct 3, 2023

Choose a reason for hiding this comment

Michaelvll Oct 3, 2023

Choose a reason for hiding this comment

romilbhardwaj Oct 4, 2023

Choose a reason for hiding this comment

Michaelvll Oct 4, 2023 • edited Loading

Choose a reason for hiding this comment

Michaelvll Oct 4, 2023

Choose a reason for hiding this comment

romilbhardwaj commented Oct 4, 2023

romilbhardwaj commented Sep 25, 2023 •

edited

Loading

Michaelvll Oct 4, 2023 •

edited

Loading