-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add AddKeysToAgent for ssh config file and ssh cmd #3985
add AddKeysToAgent for ssh config file and ssh cmd #3985
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thanks for figuring out this issue @zpoint! This PR mostly looks good to me.
Does this mean a user will have to start the ssh agent locally first before these changes can take effect? Or, does those arguments ensure ssh agent will be started automatically locally?
@Michaelvll I searched google it says:
If we want to cover all case including those not starting ssh-agent automatically, I suggest we run a command line check like if ssh-agent not start, then run eval "$(ssh-agent -s)" |
Thanks @zpoint ! That makes sense. It would be good if we can add some hints in the output if the ssh-agent is not running locally and multi-node is used. Also, it would be good to make sure that if |
You're welcome @Michaelvll If (base) [root@cheery-slab-8 ~]# eval "$(ssh-agent -s)"
Agent pid 27902
(base) [root@cheery-slab-8 ~]# ssh-agent -k
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 27902 killed;
(base) [root@cheery-slab-8 ~]# ssh-add -l
Error connecting to agent: No such file or directory
(base) [root@cheery-slab-8 ~]# conda activate sky
(sky) [root@cheery-slab-8 ~]# cd skypilot/hello-sky/
(sky) [root@cheery-slab-8 hello-sky]# sky launch -c hello_cluster hello_sky.yaml
INFO: Reserved IPs: ['172.16.7.212', '172.16.46.107', '172.16.39.21']
(worker1, rank=1, pid=28885, ip=172.16.39.21) worker nodes
(head, rank=0, pid=29235) head node
(worker2, rank=2, pid=28888, ip=172.16.7.212) worker nodes
INFO: Job finished (status: SUCCEEDED).
(sky) [root@cheery-slab-8 hello-sky]# ssh hello_cluster
Warning: Permanently added '98.81.100.35' (ECDSA) to the list of known hosts.
=============================================================================
__| __|_ )
_| ( / Deep Learning AMI GPU PyTorch 2.1.0 (Ubuntu 20.04)
___|\___|___|
=============================================================================
(base) ubuntu@ip-172-16-46-107:~$ ssh root@172.16.39.21
Warning: Permanently added '172.16.39.21' (ECDSA) to the list of known hosts.
root@172.16.39.21: Permission denied (publickey).
(base) ubuntu@ip-172-16-46-107:~$ ssh-add -l
Could not open a connection to your authentication agent. |
And also add a checking function to hint if ssh-agent not running: (base) [root@cheery-slab-8 ~]# conda activate sky
(sky) [root@cheery-slab-8 ~]# eval "$(ssh-agent -s)"
Agent pid 5566
(sky) [root@cheery-slab-8 ~]# ssh-agent -k
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 5566 killed;
(sky) [root@cheery-slab-8 hello-sky]# sky launch -c hello_cluster hello_sky.yaml
ssh-agent is not running, so SSH key forwarding might not work properly. Try starting a new terminal session and manually run `eval "$(ssh-agent -s)"` to launch the ssh-agent and resolve this issue.
Normal workflow can still proceed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @zpoint! I I just tested it with sky launch
and sky jobs launch
and it seems working well. We can merge this PR in and fix our docs in a future PR.
Address and resolve this [Core] Allow ssh access from head to worker nodes
Create a test case file:
examples/mpirun_test.yaml
Add the parameter
AddKeysToAgent=yes
insky/provision/provisioner.py
andsky/utils/command_runner.py
sky launch
command andsky exec
command will by default use the command generated from these files, We're currently using the parameterControlMaster/ControlPath/ControlPersist
parameters to reuse the ssh connection, which make it importand to add this parameter the first time connect to a new machine, Subsequent connections will reuse the first connection without adding the keys to the agent(theAddKeysToAgent=yes
won't work for subsequent connections because it's reusing the existing connection)Add parameter
AddKeysToAgent yes
insky/backends/backend_utils.py
AddKeysToAgent yes
to~/.sky/generated/ssh
files, allowing users to automatically add the agent when they typessh myclustername
You can verify by:
More details about the problem investigation can be found here
mpirun test:
This test yaml file is the same as
examples/mpirun_test.yaml
created in this PRsky jobs launch test:
sky launch test:
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
(Don't have all cloud access required, should be triggered by CI test)pytest tests/test_smoke.py::test_fill_in_the_name
(Don't have all cloud access required, should be triggered by CI test)conda deactivate; bash -i tests/backward_compatibility_tests.sh