-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
verdi computer test
: Add test for login shell being slow
#5845
verdi computer test
: Add test for login shell being slow
#5845
Conversation
By default computers are configured to use a login shell, setting the `use_login_shell` configuration option to `True`, which ensures that the `Transport` always uses the `-l` option for bash when executing a command. This can be important for certain compute environments where the login script contains commands that are necessary for calculation jobs to execute successfully. However, this can incur significant slowdowns if the login script loading is slow. This will affect the throughput of calculation jobs significantly the source of which can be difficult to track down. That is why a test is added to `verdi computer test`. It will execute `whoami` over the transport with and without a login shell and if the ratio exceeds 2, a warning is emitted, suggesting to turn off the option unless strictly necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @sphuber, very useful!
some minor comments
finally: | ||
authinfo.set_auth_params(auth_params) | ||
|
||
if timing_true > (timing_false * factor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to consider introducing an absolute threshold for the difference as well (say, 100ms).
If the timing in both cases is very short (e.g. it goes to the local machine, and there are no "slow" commands in the .bashrc), the current approach might be brittle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. But I have questions about the implementation. My first instinct was to use math.isclose
however this does not work as I expected. If I set relative tolerance of 0.5 (factor of two), it would still mark two very small numbers, say 0.01 and 0.011, as close even for a negligible absolute tolerance of 0.000000001
. The same goes for numpy.isclose
. They effectively consider both tolerances, and as long as one is respected, the numbers are considered close. I think in this case, we want it to fail as soon as either difference (abs or relative) exceeds the tolerance, by itself. Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of the absolute threshold was to avoid false positives when the absolute time delta is very small and could be dominated by "noise".
So I would have raised the warning only if both thresholds are exceeded.
Come to think of it, I guess we mostly care about absolute time differences here anyhow, so as long as we have a somewhat reliable timing (via the average of the 3 attempts), one could also go with an absolute time threshold only (as in the verdi
load tests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, simplified to just checking the difference is within 100 ms
@ltalirz Thanks for the review. I have addressed the comments. Just not sure about my implementation of the addition of absolute tolerance. And I haven't addressed the comment about including the computer label in the error message as it would require a refactor. If you think this is worth it, happy to do it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @sphuber !
looks good to me
Just encountered this in practise for the first time: * Checking for possible delay from using login shell... [Failed]: computer is configured to use a login shell, which is slower compared to a normal shell (Command execution time of 3.693061113357544 s versus 3.409668763478597, respectively).
Unless this setting is really necessary, consider disabling it with: verdi computer configure core.local COMPUTER_NAME -n --no-use-login-shell
Warning: 1 out of 6 tests failed I understand this is useful, and at least it made me thing of whether I really need a login shell or not, but new users already struggle quite a bit with setting up/configuring computers. So perhaps:
|
Also, after trying the instructions: ❯ verdi computer test eiger
Report: Testing computer<eiger> for user<marnik.bercx@epfl.ch>...
* Opening connection... [OK]
* Checking for spurious output... [OK]
* Getting number of jobs from scheduler... [OK]: 532 jobs found in the queue
* Determining remote user name... [OK]: mbercx
* Creating and deleting temporary file... [OK]
* Checking for possible delay from using login shell... [Failed]: computer is configured to use a login shell, which is slower compared to a normal shell (Command execution time of 4.225524981816609 s versus 3.783755381902059, respectively).
Unless this setting is really necessary, consider disabling it with: verdi computer configure core.local COMPUTER_NAME -n --no-use-login-shell
Warning: 1 out of 6 tests failed
❯ verdi computer configure core.local eiger -n --no-use-login-shell
Critical: Computer eiger has transport of type "core.ssh", not core.local!
❯ verdi computer configure core.ssh eiger -n --no-use-login-shell
Usage: verdi computer configure core.ssh [OPTIONS] COMPUTER
Try 'verdi computer configure core.ssh --help' for help.
Error: Missing option '--safe-interval'. This raises more questions on the interface of the various setup/configure commands, but will try to find the time to address these more holistically, see #4981 |
The |
@sphuber just ran into this again and opened some issues so we don't forget these! |
Fixes #5844
By default computers are configured to use a login shell, setting the
use_login_shell
configuration option toTrue
, which ensures that theTransport
always uses the-l
option for bash when executing a command. This can be important for certain compute environments where the login script contains commands that are necessary for calculation jobs to execute successfully.However, this can incur significant slowdowns if the login script loading is slow. This will affect the throughput of calculation jobs significantly the source of which can be difficult to track down. That is why a test is added to
verdi computer test
. It will executewhoami
over the transport with and without a login shell and if the ratio exceeds 2, a warning is emitted, suggesting to turn off the option unless strictly necessary.