Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Allow ssh access from head to worker nodes #3690

Open
Michaelvll opened this issue Jun 26, 2024 · 3 comments
Open

[Core] Allow ssh access from head to worker nodes #3690

Michaelvll opened this issue Jun 26, 2024 · 3 comments

Comments

@Michaelvll
Copy link
Collaborator

We currently do not set up the ssh connection from the head node to the workers, which is required for MPI workload.

One way to do so is to setup another public/private key pair for ssh for each cluster's head and worker nodes.

Version & Commit info:

  • sky -v: PLEASE_FILL_IN
  • sky -c: PLEASE_FILL_IN
@asaiacai
Copy link
Contributor

If you want to ssh from the head to the other workers and have it work for mpirun, its sufficient to enable ssh-agent. No need to setup/copy keys. I have an example of this for doing nccl-test here with mpirun and i've also pasted here for convenience an example for doing this for ssh'ing between hosts. Might be sufficient to just include this in the docs/examples?

Andrews-MacBook-Air:skypilot asai$ eval $(ssh-agent -s)
Andrews-MacBook-Air:skypilot asai$ ssh-add ~/.ssh/sky-key
Andrews-MacBook-Air:skypilot asai$ sky launch -c test --num-nodes 2 --cloud gcp 'echo "$SKYPILOT_NODE_IPS"'
(worker1, rank=1, pid=3398, ip=10.128.0.12) 10.128.0.8
(worker1, rank=1, pid=3398, ip=10.128.0.12) 10.128.0.12
(head, rank=0, pid=3877) 10.128.0.8
(head, rank=0, pid=3877) 10.128.0.12
Andrews-MacBook-Air:skypilot asai$ ssh test # get onto head node 
(base) gcpuser@test-ebd1-head-op1wrzgz-compute:~$  ssh 10.128.0.12 # ssh to worker via private IP

right now this doesn't work if you do sky jobs launch since the controller doesn't have the ssh-agent on. However it looks like if you just run ssh-agent on the job controller it will similarly work

(sky) Andrews-MacBook-Air:skypilot asai$ sky jobs launch -c test 'echo "$SKYPILOT_NODE_IPS"; sleep 1000000' --num-nodes 2 --cloud gcp
(worker1, rank=1, pid=3343, ip=10.128.0.43) 10.128.0.42
(worker1, rank=1, pid=3343, ip=10.128.0.43) 10.128.0.43
(head, rank=0, pid=3762) 10.128.0.42
(head, rank=0, pid=3762) 10.128.0.43
(sky) Andrews-MacBook-Air:skypilot asai$ ssh sky-jobs-controller-ebd16671 # access job controller
(base) gcpuser@sky-jobs-controller-ebd16671-ebd1-head-duxhhecj-compute$ eval $(ssh-agent -s)
(base) gcpuser@sky-jobs-controller-ebd16671-ebd1-head-duxhhecj-compute$ ssh-add ~/.ssh/sky-key
(base) gcpuser@sky-jobs-controller-ebd16671-ebd1-head-duxhhecj-compute:~$ ssh test-1 # access head node of job
(base) gcpuser@test-1-ebd1-head-48o88sx0-compute:~$ ssh 10.128.0.43 # access other worker
(base) gcpuser@test-1-ebd1-worker-9udnab2g-compute:~$

@Michaelvll
Copy link
Collaborator Author

If you want to ssh from the head to the other workers and have it work for mpirun, its sufficient to enable ssh-agent. No need to setup/copy keys. I have an example of this for doing nccl-test here with mpirun and i've also pasted here for convenience an example for doing this for ssh'ing between hosts. Might be sufficient to just include this in the docs/examples?

Andrews-MacBook-Air:skypilot asai$ eval $(ssh-agent -s)
Andrews-MacBook-Air:skypilot asai$ ssh-add ~/.ssh/sky-key
Andrews-MacBook-Air:skypilot asai$ sky launch -c test --num-nodes 2 --cloud gcp 'echo "$SKYPILOT_NODE_IPS"'
(worker1, rank=1, pid=3398, ip=10.128.0.12) 10.128.0.8
(worker1, rank=1, pid=3398, ip=10.128.0.12) 10.128.0.12
(head, rank=0, pid=3877) 10.128.0.8
(head, rank=0, pid=3877) 10.128.0.12
Andrews-MacBook-Air:skypilot asai$ ssh test # get onto head node 
(base) gcpuser@test-ebd1-head-op1wrzgz-compute:~$  ssh 10.128.0.12 # ssh to worker via private IP

right now this doesn't work if you do sky jobs launch since the controller doesn't have the ssh-agent on. However it looks like if you just run ssh-agent on the job controller it will similarly work

(sky) Andrews-MacBook-Air:skypilot asai$ sky jobs launch -c test 'echo "$SKYPILOT_NODE_IPS"; sleep 1000000' --num-nodes 2 --cloud gcp
(worker1, rank=1, pid=3343, ip=10.128.0.43) 10.128.0.42
(worker1, rank=1, pid=3343, ip=10.128.0.43) 10.128.0.43
(head, rank=0, pid=3762) 10.128.0.42
(head, rank=0, pid=3762) 10.128.0.43
(sky) Andrews-MacBook-Air:skypilot asai$ ssh sky-jobs-controller-ebd16671 # access job controller
(base) gcpuser@sky-jobs-controller-ebd16671-ebd1-head-duxhhecj-compute$ eval $(ssh-agent -s)
(base) gcpuser@sky-jobs-controller-ebd16671-ebd1-head-duxhhecj-compute$ ssh-add ~/.ssh/sky-key
(base) gcpuser@sky-jobs-controller-ebd16671-ebd1-head-duxhhecj-compute:~$ ssh test-1 # access head node of job
(base) gcpuser@test-1-ebd1-head-48o88sx0-compute:~$ ssh 10.128.0.43 # access other worker
(base) gcpuser@test-1-ebd1-worker-9udnab2g-compute:~$

This is awesome! Thanks for mentioning this @asaiacai. The ssh-agent should work well in the interactive case, but it might not be sufficient for examples that require the SSH access in the run section of the task, as the run section is detached from the ssh connection.

@asaiacai
Copy link
Contributor

asaiacai commented Jun 27, 2024

@Michaelvll it also works for mpirun tasks define via run. I just tested this works on the latest commit skypilot, commit bd383e912a55f0afbd9cc3c239771dbbf3dcb900 using the same task definition example you have in #3693 but omitted mounting ~/.ssh/sky-key. Output is shown here

Note that if we used sky jobs launch this probably won't work, but maybe it would probably work just starting ssh-agent on the job controller by default?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants