-
Notifications
You must be signed in to change notification settings - Fork 511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] Multi-node support for Kubernetes #2609
Conversation
This is now ready for a look! Also includes a fix for #2628 - updated the PR description above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome @romilbhardwaj! It is a surprisingly small amount of changes to make it work with multiple nodes. Tested with a GKE cluster and it works smoothly. One minor UX issue:
sky launch --gpus V100:0.5 --num-nodes 3 "hostname -i"
sky status
: the resources column shows3x Kubernetes(2CPU--8GB--1V100, {'V100': 1})
, instead ofV100:0.5
, which is a bit surprising.
# To avoid errors when many concurrent requests are sent (see https://github.com/skypilot-org/skypilot/issues/2628), | ||
# we add a random delay before establishing the socat connection. | ||
# Empirically, this needs to be at least 1 second. We set this to be random between 1 and 2 seconds. | ||
sleep $(shuf -i 10-20 -n 1 | awk '{printf "%f", $1/10}') | ||
|
||
# Establishes two directional byte streams to handle stdin/stdout between | ||
# terminal and the jump pod. | ||
# socat process terminates when port-forward terminates. | ||
socat - tcp:127.0.0.1:"${local_port}" No newline at end of file | ||
socat - tcp:127.0.0.1:"${local_port}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a random wait time enough or should this be a socat
and retry with a random backoff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it would be nice to have a newline at the EOF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, the socat
command succeeds and starts running without any error even though the underlying kubectl port-forward
may not be functional. Since there's no signal that I can use to detect if kubectl port-forward is ready, I used random wait which has been sufficient in my testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, makes sense. I think it should be fine to wait for a random gap. Just curious, if something that checks the socket connection would work for checking if the connection gets setup correctly, like the following:
skypilot/sky/provision/provisioner.py
Lines 239 to 258 in 029f886
def _wait_ssh_connection_direct( | |
ip: str, | |
ssh_user: str, | |
ssh_private_key: str, | |
ssh_control_name: Optional[str] = None, | |
ssh_proxy_command: Optional[str] = None) -> bool: | |
del ssh_control_name | |
assert ssh_proxy_command is None, 'SSH proxy command is not supported.' | |
try: | |
with socket.create_connection((ip, 22), timeout=1) as s: | |
if s.recv(100).startswith(b'SSH'): | |
return True | |
except socket.timeout: # this is the most expected exception | |
pass | |
except Exception: # pylint: disable=broad-except | |
pass | |
command = _ssh_probe_command(ip, ssh_user, ssh_private_key, | |
ssh_proxy_command) | |
logger.debug(f'Waiting for SSH to {ip}. Try: ' | |
f'{_shlex_join(command)}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, it seems we have already done the nc
check before this wait. Waiting looks good to me.
Thanks for the review @Michaelvll! I've filed the fractional GPU bug in #2655, we should get to that too soon.. |
* Initial multi-node support * Add pod anti-affinity * Fix concurrent SSH for Kubernetes * lint * comments * update readme * remove lsof dependency * newline * Update roadmap in readme
Adds Kubernetes multi-node support. It was fairly straightforward, since most of the implementation was already done in #2096. Main thing to note is that spreading pods across nodes is set as a soft-constraint (i.e., try to place multi-node pods on different physical nodes. If not possible, ok to put them on the same node.)
Also closes #2628. This is a tricky one, since the problem is still not clear why it happens. See the issue for more details. After spending time investigating it, I have added a randomized sleep after
kubectl port-forward
is run.While this is not the solution I wanted, it appears to fix the problem reliably. Ideally I would have wanted to get a signal from
kubectl port-forward
to know when it is ready, but it emits no signals to indicate it is actually ready (we already wait for it to print the expected output and usenc
to verify the port is open).What's working for multi-node:
sky launch --num-nodes 2 -- echo hi
)SKYPILOT_NODE_IPS
correctlyTested (run the relevant ones):
bash format.sh
run_command_in_parallel(['ssh','myclus','echo hi'], 20, 20)
10 times.pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials"