-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ssh_exchange_identification: Connection closed by remote host (RP v0.40) #1214
Comments
IIRC this is because of multiple CUs (and hence processes) running on the same node. If you increase the number of cores per CU, you shouldn't see this issue. I don't exactly remember what the number was (might have been 4 cores per CU), but I remember being able to avoid this issue by using an entire node for 1 CU (=24 cores per CU). |
At least 4 per node. |
Yeah, @vivek-bala is right, we are running out of ssh connections to start CUs when we run too many concurrent CUs per node. But the agent's orte startup methods should actually resolve this already! For that, you should use |
Hey @iparask , just to confirm: the limit is 4 CUs per node (= 6 cores per CU)? |
I was doing 6 CUs per node (4 cores per CU) |
Just to be clear, if this concerns the experiment/aimes stack, then it's not a concern for the 0.45 release (correct me if wrong, please)! |
This is likely also affecting |
In v0.45, ORTE is the default (xsede.comet). The ssh label is xsede.comet_ssh. |
I tried running using
I also get the following error in the
|
Hmm, this can't be |
PS.: the |
It should be indeed the
|
I updated the ORTE installation on comet, and the xsede resource config. Can you please try again? thanks! |
I just tried it again, but I get the following errors. From the JSON file, the pilots enter the From
From
From
|
Thanks Ming, I'll look into it! |
So this is caused by switch from |
So given that the new OpenMPI stack is installed on Comet, please confirm if the example scripts are working with 0.45.RC2 (and if so this ticket can be retargetted to the next release to solve problems relating the experiment/aimes branch code) |
Sure. I'm already doing the testing so I'll let you know |
@andre-merzky, is there anything I can help with? |
nah - but thanks for the kind reminder... :) |
So according to the testing spreadsheet, everything except MPI units ( #1239 ) is working on Comet with the new OpenMPI installation, so let's leave this ticket to get the split-module branch working on XSEDE. |
Ping. Ming reports that on the experiment/aimes branch, orte fails also for non MPI-units. This is now blocking Ming's experiments for his paper and the AIMES Experience one. We may want to have a look at it relatively soon. |
Hey Ming, Matteo - I am not getting jobs through the queue on comet unfortunately. Will stay on it. I assume that this is either a configuration issue, or we hit a process limit when creating the orterun children. The first is hopefully easy to fix, the second will probably mean that we need to switch to ortelib on comet. |
Hey Andre, I am using the following stack to run my experiments:
I was able to run my experiments successfully SuperMIC, but does not work on Comet (ORTE). I have a virtualenv which I source in the pre_exec in order to use radical.synapse. When the units begin |
Install it to |
I would like not to touch the sandbox on which RP runs if possible lest the dependencies of radical.synapse do not mesh well with those of RP. |
Ming, please run the following commands on comet: module load python
source ~/radical.pilot.sandbox/ve_comet/bin/activate
module use --append /home/amerzky/ompi/modules/
module load openmpi/2017_02_17_6da4dbb
pip install orte_cffi Let me know if that gives any errors. Once done, please use the Re synapse: I usually create a separate virtualenv for synapse ( #!/bin/sh
module load python
.$HOME/ve_synapse/bin/activate
radical-synapse-sample $* and then call that via
or whatever I want to emulate via synapse. |
I can't run |
IIUC, |
Did you get an error actually? Or you didn't try? |
I should have been more clear. I subbed my home directory in place of Andre's. So Vivek's comment addressed my problem. |
I tested for 16 CUs on Comet and this branch works, and am going to submit 256 CUs to see how the branch performs. However, I now get the following problem on Stampede. It seems that Python was not loaded on Stampede
|
Can you please open a new ticket for stampede, please, and report your stack there? |
Done. See 1276 |
Great. Let us know how things scale on comet, and if we can close this ticket then. Thanks! |
I can run 256 CUs on Comet. We can close this ticket after |
I managed to get up to 1024 CUs on Comet. When will |
This is merged now - thanks for testing! |
When I try to run on Comet, I always get a few units which fail. The failing units give the following error:
While this issue has been addressed in issue #1105, the solution was to upgrade to a newer version of RP. However, I am using the experiment/aimes stack of RADICAL Cybertools to run XSEDE/OSG experiments. Is there a solution to this issue without using a new version of RP?
This is the RP stack I am running.
The text was updated successfully, but these errors were encountered: