Avoid deadlocks when retrieving stdout/stderr via SSH #3787

giovannipizzi · 2020-02-21T19:01:29Z

The simple shift of the line retval = channel.recv_exit_status()
below stdout.read() and stderr.read(), while partially improving
the situation, is not enough (see discussion in #1525).
This instead reads in chunks both stdout and stderr, alternating between them.
Tests (added) show that this does not hang anymore for large files.

ltalirz · 2020-02-21T21:50:30Z

Just one minor comment: since this is touching the ssh connection, it might be good to test this in actual production runs before merging.

giovannipizzi · 2020-02-21T21:55:48Z

Agreed - this was also my comment I gave in person to @sphuber - it this breaks, this break most of the AiiDA functionality :-D

But also if someone has the willingness to think if the logic is correct, this would be already great to at least have the feeling that we know why it is supposed to work (I am quite convinced that it's ok, but I might be missing something)

aiida/transports/plugins/ssh.py

ConradJohnston

Logic seems to check out, assuming that the Paramiko methods are honest and we find out the answer to @greschd's question.

Also, did you benchmark the speed of this approach for small stdout/stderr?

giovannipizzi · 2020-02-28T11:33:08Z

Also, did you benchmark the speed of this approach for small stdout/stderr?

From a quick benchmark on small files, the time seemed to be the same (see #3787: 10kB file transfer: 0.53s [new] vs 0.53s [old])

sphuber · 2020-02-28T11:42:51Z

Also, did you benchmark the speed of this approach for small stdout/stderr?

From a quick benchmark on small files, the time seemed to be the same (see #3787: 10kB file transfer: 0.53s [new] vs 0.53s [old])

What happened to the big difference we were seeing originally, where the new buffer switching method was even a lot faster than the old naive implementation. Was that a measuring error?

giovannipizzi · 2020-02-28T12:14:20Z

This is still there, again, see issue (e.g. 16MB file transfer: 1.2s vs 16s)

giovannipizzi · 2020-02-28T12:21:36Z

Here is a code to test out if it is working for you (remember of course to check out this branch):
[EDIT - updated using the exec_command_wait_bytes new function in the most recent commits]

def get_file_md5(computer_name, remote_file_name):
    """
    Given an AiiDA computer name and a remote file name, returns the MD5 of its content as
    a string, returned
    """
    import hashlib
    from aiida import orm
    from aiida.common.escaping import escape_for_bash

    computer = orm.Computer.get(label=computer_name)
    transport = computer.get_transport()

    # this opens the SSH connection, for SSH transports
    with transport:
        retcode, stdout, stderr = transport.exec_command_wait_bytes("cat {}".format(escape_for_bash(remote_file_name)))
    assert retcode == 0, f'wrong retcode {retcode}; stderr={stderr}, stdout={stdout}'
    assert stderr == b'', f'non-zero stderr: stderr={stderr}, stdout={stdout}'
    return str(hashlib.md5(stdout).hexdigest())

if __name__ == "__main__":
    import sys

    print(get_file_md5(sys.argv[1], sys.argv[2]))

Save it into a file e.g. check_ssh.py, then here is an example result in my case:

$ verdi run check_ssh.py daint rur.out
5c17d1a5ef891343517c19fd5841efe1

$ ssh daint.cscs.ch md5sum rur.out
5c17d1a5ef891343517c19fd5841efe1  rur.out

(note that in the first case you need to pass the AiiDA computer name).

Great if you could perform some tests on some SSH remotes you have. ~~, on large text (non-binary, see below) files.~~

IMPORTANT (see also comment in the function above). Currently we have the exec_command_wait return a string rather than bytes. I think this is a problem, e.g. if you cat a PDF file. See discussion in #3814 (so we don't clutter this PR with the discussion on bytes vs. strings). However note that if you test the script above, you can test on big text files, but it will probably crash on big binary files.

sphuber · 2020-03-02T10:05:27Z

I have run one of my benchmark suites (558 PwBaseWorkChains) on this branch and I haven't noticed any problems out of the ordinary. There were some failed calculations, but those seem to have just been failures on the remote. Everything seems retrieved just fine on AiiDA's side as far as I have been able to check. The retrieved files may not have been particularly big so it may never have triggered the problem that this PR is supposed to fix, but in any case, this would be a relatively common use-case of high-throughput'ish use of AiiDA.

giovannipizzi · 2020-03-02T21:07:44Z

ToDo: update this PR according to the comments in #3814

Also, as a comment to @sphuber test - we should stress test in particular the part of reading from stdout/stderr; most of the AiiDA submissions will actually transfer files and this part is not touched by this PR.

giovannipizzi · 2020-09-15T14:56:35Z

I added you as a collaborator on my fork - once you accept it, you can just push a new commit in my branch

ramirezfranciscof · 2020-09-16T07:21:40Z

@giovannipizzi uhm, so, who do you mean by "you"?

giovannipizzi · 2020-09-29T07:04:46Z

@ramirezfranciscof ahah I had multiple tabs open and I put the comment in wrong tab/issue :-) I had been wondering where this comment had gone, I knew I had written it but couldn't find it anymore

codecov · 2021-02-08T10:16:37Z

Codecov Report

Merging #3787 (406a3bc) into develop (124fece) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop    #3787      +/-   ##
===========================================
+ Coverage    80.10%   80.11%   +0.02%     
===========================================
  Files          515      515              
  Lines        36626    36665      +39     
===========================================
+ Hits         29335    29370      +35     
- Misses        7291     7295       +4

Flag	Coverage Δ
django	`74.57% <100.00%> (+0.02%)`	⬆️
sqlalchemy	`73.50% <100.00%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiida/transports/plugins/local.py	`81.41% <100.00%> (-0.13%)`	⬇️
aiida/transports/plugins/ssh.py	`80.14% <100.00%> (+1.00%)`	⬆️
aiida/transports/transport.py	`66.22% <100.00%> (+0.47%)`	⬆️
aiida/engine/daemon/client.py	`75.13% <0.00%> (-1.01%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 124fece...406a3bc. Read the comment docs.

dev-zero · 2021-02-08T11:19:33Z

maybe as a note for the future: for Python 3.6+ there's an alternative to Paramiko: AsyncSSH which may tie in better to the current architecture of AiiDA, and avoid some of problems encountered with Paramiko (while probably causing others, as usual).

aiida/transports/plugins/local.py

giovannipizzi · 2021-02-08T19:17:35Z

Ok, I optimised the parameters to have a reasonably good throughput (I can go to 100MB/s over SSH on the local host, a bit slower than pure SSH+md5 but still reasonably good) - see also comments and commit message in my last commit.
I also checked (by modifying a bit the script provided in the comment above) that a similar speed is obtained (and correct results are obtained) if the data is sent over the stderr rather than stdout.

Therefore, as soon as I get some feedback from @pzarabadip and @adegomme, I can convert this draft into a PR

giovannipizzi · 2021-02-08T19:38:44Z

As a further piece of info, I tried transferring actual data over ssh between my workstation in the office and my laptop at home.
I turned off the wifi and connected by cable. I started the connection, then physically disconnected the cable for ~6-7s, then reconnected it.
I tried this a few times and it worked fine, at the end I could get the correct MD5 of a ~330MB file.

This below is the measured network speed; the 'flat' parts are my combined network speed of ~12MB/s; the "hole" in the middle is the moment where I physically disconnected the cable (every data point is 1s):

ezpzbz · 2021-02-08T20:04:37Z

Hi @giovannipizzi thanks for this. It seems quite okay with me. Unfortunately, my access to metavo machines has finished and cannot test it. I'm not aware of any other user to be honest to probably face with backward compatibility. I might get access again in future and then, I'll modify the plugin to reflect these changes (of course, if it is not to be done now).

giovannipizzi · 2021-02-08T22:31:33Z

Thanks @pzarabadip! I suggest that in that case either you apply the fix anyway (and open an issue to say to remove the copy-pasted function at some point), or just open an issue pointing here, but add a limit on ~~aiida-core < 1.6~~aiida-core < 2.0 already (with a comment on why, pointing to your issue). This to avoid that in the future, when someone will get back to the code, it will take quite some time to figure out this subtle change.

ezpzbz · 2021-02-08T22:59:41Z

Thanks @giovannipizzi . I just did limit the version and opened an issue for future traceback. Cheers.

ramirezfranciscof

Very interesting work! I can probably not provide as deep technical reviews as you have already received, but I detected a couple of typos and had a couple of comments that might be of some use.

aiida/transports/plugins/local.py

aiida/transports/plugins/ssh.py

docs/source/topics/transport_template.py

tests/transports/test_all_plugins.py

aiida/transports/plugins/ssh.py

adegomme · 2021-02-24T13:40:44Z

Hi,
Sorry about the long delay in my answer. The reason we had our own version was actually to integrate this fix early, as it was needed. As #4363 was already merged, this override should now become redundant and be removed. So I think all is needed is to make the next version of the plugin depend on aiida>1.6 and remove it altogether.
I will try to run this this afternoon and confirm it works on this branch.

adegomme · 2021-02-24T14:46:58Z

I can confirm that it seems to be working fine. To avoid breaking support, I will keep the ssh.py file for now, but add a version check at import, to use the aiida_core's one on for aiida > 1.6. Thanks a lot.

…/aiida-core#3787 should get merged soon

Fixes aiidateam#1525 The simple shift of the line `retval = channel.recv_exit_status()` below `stdout.read()` and `stderr.read()`, while partially improving the situation, is not enough (see discussion in aiidateam#1525). This instead reads in chunks both stdout and stderr, alternating. Tests show that this does not hang anymore for large files.

Now, I add a exec_commmand_wait_bytes that actually does the job, and this needs to be implemented by plugins. The two plugins implemented by AiiDA already define instead the new function. Also, I improved in both the API for the stdin, so that these commands can accept also bytes ot BytesIO objects, that makes more sense in general. I tried to keep the API (when calling them) backward-compatible, so using e.g. a str/StringIO as stdin still works, and similarly `exec_command_wait` works as before, but adds an optional `encoding` parameter (the default is `utf-8` to give the same old behavior, but give more flexibility now in using different encodings). Tests have been added for these usecases. However, note that the interface of plugins have changed (plugins now have to define instead the `exec_command_wait_bytes` function instead). The change is minimal (change the function name and avoid decoding the bytes into strings) but plugins extending the transport functionality should take care of this (and, if they drop the `exec_command_wait`, they should be depending on AiiDA >=1.6; otherwise they should copy the implementation from the general Transport class if they want to remain backward-compatible). Also, I changed a couple of places to call directly the _bytes version, to avoid possible issues with strange encodings on the remote computer. Others are left in to avoid too many changes in the code, since until now noone complained because they had a weird encoding on the supercomputer. When such issue will arise we'll fix it, and thanks to this PR now the code is all ready to move to treat bytes directly (or use a custom encoding, that e.g. might need to be defined in the `verdi computer setup` or `verdi computer configure` steps, in the future).

I could get ~100MB/s when not using compression, ~33MB/s when using it (the second case is capped by the compression speed; the time is only marginally larger of the one I get if I run, on the command line, `time ssh -C localhost cat /path/to/remote/file | md5sum`; the first one is not as good as running md5 (via SSH) on the command line (which can go ~3x faster of a 4GB file) but it's still acceptable for most usecases. (Note that here we're speaking of differences measurable only when sending hundreds of MB sent over stdout, which should not be anyway the default way of transferring a lot of data).

Co-authored-by: Francisco Ramirez <francisco.ramirez@epfl.ch>

giovannipizzi · 2021-06-04T10:59:30Z

Thanks @ramirezfranciscof and sorry for the long time it took! If tests pass, I think this is ready to be reviewed!

ramirezfranciscof · 2021-06-04T15:44:01Z

Great, thanks!

Thanks @ramirezfranciscof and sorry for the long time it took! If tests pass, I think this is ready to be reviewed!

@giovannipizzi you mean merged? Bah, unless you wanted someone else in particular to take a look at it.

Also there is the fact that the branch is a but outdated now. I'll try to merge develop into this, hopefully it will manage to do it automatically, but if it breaks then I think you might have to rebase or merge locally and manually fix any problem.

giovannipizzi · 2021-06-05T21:00:17Z

Yeah, I meant just a final check + merge. I had rebased it, so the changes are minimal (and the merge seems to have worked).
I'm converting to a proper PR, feel free to merge!

ramirezfranciscof

All good to go then @giovannipizzi ! I don't know if you can/want to write a more descriptive commit message for this, otherwise I can merge with the message on the OP.

giovannipizzi · 2021-06-09T09:17:18Z

Commit message clarified, and added migration notes to the wiki

@pzarabadip @adegomme as discussed above, this will affect your plugins so you might want to take note of this and decide if/when to take action.
(Note that v1.6 still has the "old" syntax and this PR will be included in the upcoming AiiDA version, probably 2.0)

ezpzbz · 2021-06-09T09:19:29Z

Thanks @giovannipizzi. I will adopt the plugins to reflect these changes. Cheers.

Now that the [corresponding PR in aiida-core](aiidateam/aiida-core#3787) has been merged, we know that 1.6 still has the old syntax, and the new changes will appear in 2.x only so I suggest to change this check

giovannipizzi requested review from dev-zero, sphuber and ConradJohnston February 21, 2020 19:01

giovannipizzi mentioned this pull request Feb 21, 2020

SSH transport hangs for big stdout content #1525

Closed

greschd reviewed Feb 22, 2020

View reviewed changes

aiida/transports/plugins/ssh.py Show resolved Hide resolved

ConradJohnston previously approved these changes Feb 24, 2020

View reviewed changes

giovannipizzi mentioned this pull request Mar 2, 2020

exec_command_wait should return stderr and stdout as bytes rather than strings? #3814

Closed

ramirezfranciscof self-assigned this Apr 1, 2020

giovannipizzi mentioned this pull request Apr 24, 2020

Calculation job can be submitted twice if first submission succeeds but communication of result from scheduler to AiiDA times out #3404

Open

giovannipizzi mentioned this pull request Jun 30, 2020

ssh without sftp ? #4199

Closed

adegomme mentioned this pull request Jul 9, 2020

Ssh only transport #4235

Closed

adegomme mentioned this pull request Sep 9, 2020

ssh transport : adapt for subclassing #4363

Merged

ramirezfranciscof marked this pull request as draft November 27, 2020 10:27

giovannipizzi dismissed ConradJohnston’s stale review via a42ab58 February 8, 2021 10:00

giovannipizzi force-pushed the fix_1525_hanging_ssh branch from 25e97f6 to a42ab58 Compare February 8, 2021 10:00

giovannipizzi force-pushed the fix_1525_hanging_ssh branch from a42ab58 to dfd6490 Compare February 8, 2021 10:20

dev-zero reviewed Feb 8, 2021

View reviewed changes

aiida/transports/plugins/local.py Show resolved Hide resolved

ezpzbz mentioned this pull request Feb 8, 2021

Adaptation to upcoming changes in aiida transport plugins ezpzbz/aiida-metavo-scheduler#2

Open

giovannipizzi requested a review from ramirezfranciscof February 9, 2021 08:47

ramirezfranciscof reviewed Feb 23, 2021

View reviewed changes

adegomme added a commit to adegomme/aiida-sshonly that referenced this pull request Feb 24, 2021

revert to default ssh.py from aiida starting with 1.6.0, as aiidateam…

0a9e072

…/aiida-core#3787 should get merged soon

giovannipizzi added this to the v2.0.0 milestone Mar 24, 2021

giovannipizzi and others added 4 commits June 4, 2021 12:32

Adding comments from code review

c8cbb6e

Co-authored-by: Francisco Ramirez <francisco.ramirez@epfl.ch>

giovannipizzi force-pushed the fix_1525_hanging_ssh branch from 98e0f3a to c8cbb6e Compare June 4, 2021 10:58

Merge branch 'develop' into fix_1525_hanging_ssh

406a3bc

giovannipizzi marked this pull request as ready for review June 5, 2021 21:00

ramirezfranciscof approved these changes Jun 8, 2021

View reviewed changes

giovannipizzi merged commit 68df47e into aiidateam:develop Jun 9, 2021

giovannipizzi deleted the fix_1525_hanging_ssh branch June 9, 2021 09:05

giovannipizzi mentioned this pull request Jun 9, 2021

Change dependency check adegomme/aiida-sshonly#1

Open

chrisjsewell mentioned this pull request Mar 12, 2022

🔀 MERGE: Release v2.0.0b1 #5426

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid deadlocks when retrieving stdout/stderr via SSH #3787

Avoid deadlocks when retrieving stdout/stderr via SSH #3787

giovannipizzi commented Feb 21, 2020

ltalirz commented Feb 21, 2020 •

edited

Loading

giovannipizzi commented Feb 21, 2020

ConradJohnston left a comment

giovannipizzi commented Feb 28, 2020

sphuber commented Feb 28, 2020

giovannipizzi commented Feb 28, 2020

giovannipizzi commented Feb 28, 2020 •

edited

Loading

sphuber commented Mar 2, 2020

giovannipizzi commented Mar 2, 2020

giovannipizzi commented Sep 15, 2020

ramirezfranciscof commented Sep 16, 2020

giovannipizzi commented Sep 29, 2020

codecov bot commented Feb 8, 2021 •

edited

Loading

dev-zero commented Feb 8, 2021

giovannipizzi commented Feb 8, 2021

giovannipizzi commented Feb 8, 2021

ezpzbz commented Feb 8, 2021

giovannipizzi commented Feb 8, 2021 •

edited

Loading

ezpzbz commented Feb 8, 2021

ramirezfranciscof left a comment

adegomme commented Feb 24, 2021

adegomme commented Feb 24, 2021

giovannipizzi commented Jun 4, 2021

ramirezfranciscof commented Jun 4, 2021

giovannipizzi commented Jun 5, 2021

ramirezfranciscof left a comment

giovannipizzi commented Jun 9, 2021 •

edited

Loading

ezpzbz commented Jun 9, 2021

Avoid deadlocks when retrieving stdout/stderr via SSH #3787

Avoid deadlocks when retrieving stdout/stderr via SSH #3787

Conversation

giovannipizzi commented Feb 21, 2020

ltalirz commented Feb 21, 2020 • edited Loading

giovannipizzi commented Feb 21, 2020

ConradJohnston left a comment

Choose a reason for hiding this comment

giovannipizzi commented Feb 28, 2020

sphuber commented Feb 28, 2020

giovannipizzi commented Feb 28, 2020

giovannipizzi commented Feb 28, 2020 • edited Loading

sphuber commented Mar 2, 2020

giovannipizzi commented Mar 2, 2020

giovannipizzi commented Sep 15, 2020

ramirezfranciscof commented Sep 16, 2020

giovannipizzi commented Sep 29, 2020

codecov bot commented Feb 8, 2021 • edited Loading

Codecov Report

dev-zero commented Feb 8, 2021

giovannipizzi commented Feb 8, 2021

giovannipizzi commented Feb 8, 2021

ezpzbz commented Feb 8, 2021

giovannipizzi commented Feb 8, 2021 • edited Loading

ezpzbz commented Feb 8, 2021

ramirezfranciscof left a comment

Choose a reason for hiding this comment

adegomme commented Feb 24, 2021

adegomme commented Feb 24, 2021

giovannipizzi commented Jun 4, 2021

ramirezfranciscof commented Jun 4, 2021

giovannipizzi commented Jun 5, 2021

ramirezfranciscof left a comment

Choose a reason for hiding this comment

giovannipizzi commented Jun 9, 2021 • edited Loading

ezpzbz commented Jun 9, 2021

ltalirz commented Feb 21, 2020 •

edited

Loading

giovannipizzi commented Feb 28, 2020 •

edited

Loading

codecov bot commented Feb 8, 2021 •

edited

Loading

giovannipizzi commented Feb 8, 2021 •

edited

Loading

giovannipizzi commented Jun 9, 2021 •

edited

Loading