Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add AddKeysToAgent for ssh config file and ssh cmd #3985

Conversation

zpoint
Copy link
Contributor

@zpoint zpoint commented Sep 25, 2024

Address and resolve this [Core] Allow ssh access from head to worker nodes

  • Create a test case file: examples/mpirun_test.yaml

  • Add the parameter AddKeysToAgent=yes in sky/provision/provisioner.py and sky/utils/command_runner.py

    • The sky launch command and sky exec command will by default use the command generated from these files, We're currently using the parameter ControlMaster/ControlPath/ControlPersist parameters to reuse the ssh connection, which make it importand to add this parameter the first time connect to a new machine, Subsequent connections will reuse the first connection without adding the keys to the agent(the AddKeysToAgent=yes won't work for subsequent connections because it's reusing the existing connection)
  • Add parameter AddKeysToAgent yes in sky/backends/backend_utils.py

    • This change add AddKeysToAgent yes to ~/.sky/generated/ssh files, allowing users to automatically add the agent when they type ssh myclustername

You can verify by:

ssh-add -l
# can't see the ~/.ssh/sky-key before launching the cluster
sky launch -c mycluster hello_sky.yaml
# After success executing, then run this command again
ssh-add -l
# Now, ~/.ssh/sky-key should be printed

More details about the problem investigation can be found here

mpirun test:

This test yaml file is the same as examples/mpirun_test.yaml created in this PR

(sky) ➜  hello-sky git:(dev/zeping/allow_ssh_access_from_head_to_worker) ✗ ssh-add -l
3072 xxx zepingguo@ZePingGuos-MacBook-Pro.local (RSA)
sky launch -c mpirun hello_sky_mpirun.yaml
I 09-25 11:39:51 cloud_vm_ray_backend.py:3215] Setup completed.
I 09-25 11:39:55 cloud_vm_ray_backend.py:3319] Job submitted with Job ID: 1
I 09-25 03:39:56 log_lib.py:415] Start streaming logs for job 1.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 2 nodes. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['172.16.54.161', '172.16.25.125']
(worker1, rank=1, pid=29571, ip=172.16.25.125) worker nodes
(head, rank=0, pid=29933) head node
(head, rank=0, pid=29933) 172.16.54.161,172.16.25.125
(head, rank=0, pid=29933) Warning: Permanently added '172.16.25.125' (ECDSA) to the list of known hosts.
(head, rank=0, pid=29933) mpirun hello from IP 172.16.25.125 172.17.0.1
(head, rank=0, pid=29933) mpirun hello from IP 172.16.54.161 172.17.0.1
INFO: Job finished (status: SUCCEEDED).
Clusters
NAME    LAUNCHED     RESOURCES            STATUS  AUTOSTOP  COMMAND
mpirun  34 secs ago  2x AWS(m6i.2xlarge)  UP      -         sky launch -c mpirun hell...
(sky) ➜  hello-sky git:(dev/zeping/allow_ssh_access_from_head_to_worker) ✗ ssh-add -l
3072 xxx zepingguo@ZePingGuos-MacBook-Pro.local (RSA)
2048 xxx /Users/zepingguo/.ssh/sky-key (RSA)

(sky) ➜  hello-sky git:(dev/zeping/allow_ssh_access_from_head_to_worker) ✗ ssh mpirun
Warning: Permanently added '44.200.126.97' (ED25519) to the list of known hosts.
=============================================================================
       __|  __|_  )
       _|  (     /   Deep Learning AMI GPU PyTorch 2.1.0 (Ubuntu 20.04)
      ___|\___|___|
=============================================================================
Last login: Wed Sep 25 03:39:56 2024 from 27.46.67.5

(base) ubuntu@ip-172-16-54-161:~$ ssh 172.16.25.125
=============================================================================
       __|  __|_  )
       _|  (     /   Deep Learning AMI GPU PyTorch 2.1.0 (Ubuntu 20.04)
      ___|\___|___|
=============================================================================

sky jobs launch test:

sky) ➜  hello-sky git:(dev/zeping/allow_ssh_access_from_head_to_worker) ✗ ssh-add -D
All identities removed.
(sky) ➜  hello-sky git:(dev/zeping/allow_ssh_access_from_head_to_worker) ✗ sky jobs launch -c test 'echo "$SKYPILOT_NODE_IPS"; sleep 1000000' --num-nodes 2 --cloud
(sky) ➜  hello-sky git:(dev/zeping/allow_ssh_access_from_head_to_worker) ✗ ssh-add -l
The agent has no identities.
(sky) ➜  hello-sky git:(dev/zeping/allow_ssh_access_from_head_to_worker) ✗ sky jobs launch -c test 'echo "$SKYPILOT_NODE_IPS"; sleep 1000000' --num-nodes 2 --cloud aws
Task from command: echo "$SKYPILOT_NODE_IPS"; sleep 1000000
Managed job 'test' will be launched on (estimated):
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 2 nodes. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['172.16.24.253', '172.16.7.157']
(worker1, rank=1, pid=28764, ip=172.16.7.157) 172.16.24.253
(worker1, rank=1, pid=28764, ip=172.16.7.157) 172.16.7.157
(head, rank=0, pid=29120) 172.16.24.253
(head, rank=0, pid=29120) 172.16.7.157

➜  ~ ssh-add -l
3072 xxx zepingguo@ZePingGuos-MacBook-Pro.local (RSA)
2048 xxx /Users/zepingguo/.ssh/sky-key (RSA)
➜  ~ ssh sky-jobs-controller-b8be8084

Warning: Permanently added '3.238.65.123' (ED25519) to the list of known hosts.
=============================================================================
       __|  __|_  )
       _|  (     /   Deep Learning AMI GPU PyTorch 2.1.0 (Ubuntu 20.04)
      ___|\___|___|
=============================================================================

(base) ubuntu@ip-172-16-0-167:~$ ssh 172.16.24.253
Warning: Permanently added '172.16.24.253' (ECDSA) to the list of known hosts.
=============================================================================
       __|  __|_  )
       _|  (     /   Deep Learning AMI GPU PyTorch 2.1.0 (Ubuntu 20.04)
      ___|\___|___|
=============================================================================
(base) ubuntu@ip-172-16-24-253:~$

sky launch test:

(sky) ➜  hello-sky git:(dev/zeping/allow_ssh_access_from_head_to_worker) ✗ ssh-add -D
All identities removed.
(sky) ➜  hello-sky git:(dev/zeping/allow_ssh_access_from_head_to_worker) ✗ ssh-add -l
The agent has no identities.
(sky) ➜  hello-sky git:(dev/zeping/allow_ssh_access_from_head_to_worker) ✗ sky launch -c test --num-nodes 2 --cloud aws 'echo "$SKYPILOT_NODE_IPS"'
Task from command: echo "$SKYPILOT_NODE_IPS"
INFO: All task resources reserved.
INFO: Reserved IPs: ['172.16.53.88', '172.16.9.47']
(worker1, rank=1, pid=28767, ip=172.16.53.88) 172.16.9.47
(worker1, rank=1, pid=28767, ip=172.16.53.88) 172.16.53.88
(head, rank=0, pid=29121) 172.16.9.47
(head, rank=0, pid=29121) 172.16.53.88
INFO: Job finished (status: SUCCEEDED).

(sky) ➜  hello-sky git:(dev/zeping/allow_ssh_access_from_head_to_worker) ✗ ssh-add -l
2048 xxx /Users/zepingguo/.ssh/sky-key (RSA)

(sky) ➜  hello-sky git:(dev/zeping/allow_ssh_access_from_head_to_worker) ✗ ssh test
Warning: Permanently added '34.231.242.51' (ED25519) to the list of known hosts.
=============================================================================
       __|  __|_  )
       _|  (     /   Deep Learning AMI GPU PyTorch 2.1.0 (Ubuntu 20.04)
      ___|\___|___|
=============================================================================
Last login: Wed Sep 25 04:02:12 2024 from 27.46.67.5

(base) ubuntu@ip-172-16-9-47:~$ ssh 172.16.53.88
Warning: Permanently added '172.16.53.88' (ECDSA) to the list of known hosts.
=============================================================================
       __|  __|_  )
       _|  (     /   Deep Learning AMI GPU PyTorch 2.1.0 (Ubuntu 20.04)
      ___|\___|___|
=============================================================================
(base) ubuntu@ip-172-16-53-88:~$

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py (Don't have all cloud access required, should be triggered by CI test)
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name (Don't have all cloud access required, should be triggered by CI test)
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks for figuring out this issue @zpoint! This PR mostly looks good to me.

Does this mean a user will have to start the ssh agent locally first before these changes can take effect? Or, does those arguments ensure ssh agent will be started automatically locally?

examples/mpirun_test.yaml Outdated Show resolved Hide resolved
@zpoint
Copy link
Contributor Author

zpoint commented Sep 25, 2024

Does this mean a user will have to start the ssh agent locally first before these changes can take effect? Or, does those arguments ensure ssh agent will be started automatically locally?

@Michaelvll
Yes user will have to start the ssh agent locally first

I searched google it says:

On most Linux systems, ssh-agent is automatically configured and run at login

If we want to cover all case including those not starting ssh-agent automatically, I suggest we run a command line check like

if ssh-agent not start, then run eval "$(ssh-agent -s)"

@Michaelvll
Copy link
Collaborator

Michaelvll commented Sep 26, 2024

Thanks @zpoint ! That makes sense. It would be good if we can add some hints in the output if the ssh-agent is not running locally and multi-node is used.

Also, it would be good to make sure that if ssh-agent is not running locally, the newly added argument will not cause any errors : )

@zpoint
Copy link
Contributor Author

zpoint commented Sep 26, 2024

You're welcome @Michaelvll

If ssh-agent is not running, Sky can still launch and SSH into the cluster. However, the ForwardAgent and AddKeysToAgent parameters won’t work. It doesn’t seem to cause additional errors but appears to fall back to the original version.

(base) [root@cheery-slab-8 ~]# eval "$(ssh-agent -s)"
Agent pid 27902
(base) [root@cheery-slab-8 ~]# ssh-agent -k
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 27902 killed;
(base) [root@cheery-slab-8 ~]# ssh-add -l
Error connecting to agent: No such file or directory
(base) [root@cheery-slab-8 ~]# conda activate sky
(sky) [root@cheery-slab-8 ~]# cd skypilot/hello-sky/
(sky) [root@cheery-slab-8 hello-sky]# sky launch -c hello_cluster hello_sky.yaml
INFO: Reserved IPs: ['172.16.7.212', '172.16.46.107', '172.16.39.21']
(worker1, rank=1, pid=28885, ip=172.16.39.21) worker nodes
(head, rank=0, pid=29235) head node
(worker2, rank=2, pid=28888, ip=172.16.7.212) worker nodes
INFO: Job finished (status: SUCCEEDED).

(sky) [root@cheery-slab-8 hello-sky]# ssh hello_cluster
Warning: Permanently added '98.81.100.35' (ECDSA) to the list of known hosts.
=============================================================================
       __|  __|_  )
       _|  (     /   Deep Learning AMI GPU PyTorch 2.1.0 (Ubuntu 20.04)
      ___|\___|___|
=============================================================================

(base) ubuntu@ip-172-16-46-107:~$ ssh root@172.16.39.21
Warning: Permanently added '172.16.39.21' (ECDSA) to the list of known hosts.
root@172.16.39.21: Permission denied (publickey).
(base) ubuntu@ip-172-16-46-107:~$ ssh-add -l
Could not open a connection to your authentication agent.

@zpoint
Copy link
Contributor Author

zpoint commented Sep 26, 2024

And also add a checking function to hint if ssh-agent not running:

(base) [root@cheery-slab-8 ~]# conda activate sky
(sky) [root@cheery-slab-8 ~]# eval "$(ssh-agent -s)"
Agent pid 5566
(sky) [root@cheery-slab-8 ~]# ssh-agent -k
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 5566 killed;
(sky) [root@cheery-slab-8 hello-sky]# sky launch -c hello_cluster hello_sky.yaml
ssh-agent is not running, so SSH key forwarding might not work properly. Try starting a new terminal session and manually run `eval "$(ssh-agent -s)"` to launch the ssh-agent and resolve this issue.
Normal workflow can still proceed

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @zpoint! I I just tested it with sky launch and sky jobs launch and it seems working well. We can merge this PR in and fix our docs in a future PR.

sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll added this pull request to the merge queue Sep 27, 2024
Merged via the queue into skypilot-org:master with commit e6b8d2c Sep 27, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants