-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid deadlocks when retrieving stdout/stderr via SSH #3787
Avoid deadlocks when retrieving stdout/stderr via SSH #3787
Conversation
Just one minor comment: since this is touching the ssh connection, it might be good to test this in actual production runs before merging. |
Agreed - this was also my comment I gave in person to @sphuber - it this breaks, this break most of the AiiDA functionality :-D But also if someone has the willingness to think if the logic is correct, this would be already great to at least have the feeling that we know why it is supposed to work (I am quite convinced that it's ok, but I might be missing something) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic seems to check out, assuming that the Paramiko methods are honest and we find out the answer to @greschd's question.
Also, did you benchmark the speed of this approach for small stdout/stderr?
From a quick benchmark on small files, the time seemed to be the same (see #3787: 10kB file transfer: 0.53s [new] vs 0.53s [old]) |
What happened to the big difference we were seeing originally, where the new buffer switching method was even a lot faster than the old naive implementation. Was that a measuring error? |
This is still there, again, see issue (e.g. |
Here is a code to test out if it is working for you (remember of course to check out this branch): def get_file_md5(computer_name, remote_file_name):
"""
Given an AiiDA computer name and a remote file name, returns the MD5 of its content as
a string, returned
"""
import hashlib
from aiida import orm
from aiida.common.escaping import escape_for_bash
computer = orm.Computer.get(label=computer_name)
transport = computer.get_transport()
# this opens the SSH connection, for SSH transports
with transport:
retcode, stdout, stderr = transport.exec_command_wait_bytes("cat {}".format(escape_for_bash(remote_file_name)))
assert retcode == 0, f'wrong retcode {retcode}; stderr={stderr}, stdout={stdout}'
assert stderr == b'', f'non-zero stderr: stderr={stderr}, stdout={stdout}'
return str(hashlib.md5(stdout).hexdigest())
if __name__ == "__main__":
import sys
print(get_file_md5(sys.argv[1], sys.argv[2])) Save it into a file e.g.
(note that in the first case you need to pass the AiiDA computer name). Great if you could perform some tests on some SSH remotes you have.
|
I have run one of my benchmark suites (558 |
I added you as a collaborator on my fork - once you accept it, you can just push a new commit in my branch |
@giovannipizzi uhm, so, who do you mean by "you"? |
@ramirezfranciscof ahah I had multiple tabs open and I put the comment in wrong tab/issue :-) I had been wondering where this comment had gone, I knew I had written it but couldn't find it anymore |
25e97f6
to
a42ab58
Compare
Codecov Report
@@ Coverage Diff @@
## develop #3787 +/- ##
===========================================
+ Coverage 80.10% 80.11% +0.02%
===========================================
Files 515 515
Lines 36626 36665 +39
===========================================
+ Hits 29335 29370 +35
- Misses 7291 7295 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
a42ab58
to
dfd6490
Compare
maybe as a note for the future: for Python 3.6+ there's an alternative to Paramiko: AsyncSSH which may tie in better to the current architecture of AiiDA, and avoid some of problems encountered with Paramiko (while probably causing others, as usual). |
Ok, I optimised the parameters to have a reasonably good throughput (I can go to 100MB/s over SSH on the local host, a bit slower than pure SSH+md5 but still reasonably good) - see also comments and commit message in my last commit. Therefore, as soon as I get some feedback from @pzarabadip and @adegomme, I can convert this draft into a PR |
Hi @giovannipizzi thanks for this. It seems quite okay with me. Unfortunately, my access to metavo machines has finished and cannot test it. I'm not aware of any other user to be honest to probably face with backward compatibility. I might get access again in future and then, I'll modify the plugin to reflect these changes (of course, if it is not to be done now). |
Thanks @pzarabadip! I suggest that in that case either you apply the fix anyway (and open an issue to say to remove the copy-pasted function at some point), or just open an issue pointing here, but add a limit on |
Thanks @giovannipizzi . I just did limit the version and opened an issue for future traceback. Cheers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting work! I can probably not provide as deep technical reviews as you have already received, but I detected a couple of typos and had a couple of comments that might be of some use.
Hi, |
I can confirm that it seems to be working fine. To avoid breaking support, I will keep the ssh.py file for now, but add a version check at import, to use the aiida_core's one on for aiida > 1.6. Thanks a lot. |
…/aiida-core#3787 should get merged soon
Fixes aiidateam#1525 The simple shift of the line `retval = channel.recv_exit_status()` below `stdout.read()` and `stderr.read()`, while partially improving the situation, is not enough (see discussion in aiidateam#1525). This instead reads in chunks both stdout and stderr, alternating. Tests show that this does not hang anymore for large files.
Now, I add a exec_commmand_wait_bytes that actually does the job, and this needs to be implemented by plugins. The two plugins implemented by AiiDA already define instead the new function. Also, I improved in both the API for the stdin, so that these commands can accept also bytes ot BytesIO objects, that makes more sense in general. I tried to keep the API (when calling them) backward-compatible, so using e.g. a str/StringIO as stdin still works, and similarly `exec_command_wait` works as before, but adds an optional `encoding` parameter (the default is `utf-8` to give the same old behavior, but give more flexibility now in using different encodings). Tests have been added for these usecases. However, note that the interface of plugins have changed (plugins now have to define instead the `exec_command_wait_bytes` function instead). The change is minimal (change the function name and avoid decoding the bytes into strings) but plugins extending the transport functionality should take care of this (and, if they drop the `exec_command_wait`, they should be depending on AiiDA >=1.6; otherwise they should copy the implementation from the general Transport class if they want to remain backward-compatible). Also, I changed a couple of places to call directly the _bytes version, to avoid possible issues with strange encodings on the remote computer. Others are left in to avoid too many changes in the code, since until now noone complained because they had a weird encoding on the supercomputer. When such issue will arise we'll fix it, and thanks to this PR now the code is all ready to move to treat bytes directly (or use a custom encoding, that e.g. might need to be defined in the `verdi computer setup` or `verdi computer configure` steps, in the future).
I could get ~100MB/s when not using compression, ~33MB/s when using it (the second case is capped by the compression speed; the time is only marginally larger of the one I get if I run, on the command line, `time ssh -C localhost cat /path/to/remote/file | md5sum`; the first one is not as good as running md5 (via SSH) on the command line (which can go ~3x faster of a 4GB file) but it's still acceptable for most usecases. (Note that here we're speaking of differences measurable only when sending hundreds of MB sent over stdout, which should not be anyway the default way of transferring a lot of data).
Co-authored-by: Francisco Ramirez <francisco.ramirez@epfl.ch>
98e0f3a
to
c8cbb6e
Compare
Thanks @ramirezfranciscof and sorry for the long time it took! If tests pass, I think this is ready to be reviewed! |
Great, thanks!
@giovannipizzi you mean merged? Bah, unless you wanted someone else in particular to take a look at it. Also there is the fact that the branch is a but outdated now. I'll try to merge develop into this, hopefully it will manage to do it automatically, but if it breaks then I think you might have to rebase or merge locally and manually fix any problem. |
Yeah, I meant just a final check + merge. I had rebased it, so the changes are minimal (and the merge seems to have worked). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good to go then @giovannipizzi ! I don't know if you can/want to write a more descriptive commit message for this, otherwise I can merge with the message on the OP.
Commit message clarified, and added migration notes to the wiki @pzarabadip @adegomme as discussed above, this will affect your plugins so you might want to take note of this and decide if/when to take action. |
Thanks @giovannipizzi. I will adopt the plugins to reflect these changes. Cheers. |
Now that the [corresponding PR in aiida-core](aiidateam/aiida-core#3787) has been merged, we know that 1.6 still has the old syntax, and the new changes will appear in 2.x only so I suggest to change this check
Fixes #1525
The simple shift of the line
retval = channel.recv_exit_status()
below
stdout.read()
andstderr.read()
, while partially improvingthe situation, is not enough (see discussion in #1525).
This instead reads in chunks both stdout and stderr, alternating between them.
Tests (added) show that this does not hang anymore for large files.