-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RP 0.45.RC1 ORTE failure on comet #1218
Comments
@marksantcroos : Mark, I guess this needs a new deployment of ompi. Should I use the same OMPI commit we use for |
Yes, indeed.
Thats a safe bet. |
Error is reproducible. Additionally, I see the following text while the job is still waiting on queue:
|
Sorry for the delay, but the comet batch scheduler doesn't like me right now, so tests are still pending. But if you want to give it a go, please check out the |
I can confirm the examples started working on comet (haven't finished all the examples yet). But I also get the same message as Srinivas posted.
Note that its labelled as an ERROR but does not lead to cancellation/termination. |
that message has been removed in a different pull request, and should be gone in the next release candidate. See https://github.com/radical-cybertools/saga-python/pull/616 |
Ok, all the examples worked on comet from This is surprising. The mpi example worked, even though we don't have an mpi4py module (which the example uses) built against the rp openmpi. Any ideas how/why? |
Probably because of the module load (that Andre since then removed)? |
Yes, but that would use the system mpi4py which is probably linked to some mpi library other than the rp openmpi. Shouldn't this cause a conflict? I remember running into such conflict on BW when I was using it. Maybe this isn't the case anymore. |
No, that loaded mpi4py linked against openmpi. |
I don't think it is. Maybe I'm missing something. Please see:
|
Tested it within a CU: cudesc:
stdout:
|
The pilot env should not leak into the CU env, right? So if the pilot pre-exec gets a CU to work, I would consider that a bug, really :/ But maybe the system module works because (a) the dynamic linker finds our openmpi libs suitable, or (b) the openmpi versions are, by chance, sufficiently compatible for our test? |
Yea, that seems to be the case here. |
How did you interpret that actually? I see the script mentions another tag, I consider 6da4dbb last known good. |
Oh, I see - my bad I guess... |
That is confusing though: Can you reconfirm which of the three ( |
For recording purposes, the current |
Thanks for the clarification, that helps a lot. So at this point the recommended ompi installation is Thanks! |
Hmm, wait: my understanding from the above was that the mpi example still works after removing the |
Ah. I assumed that its leaking since I didn't have to do the following in the CUs (in any of the examples):
but maybe |
0.45.RC2 should now use the new OpenMPI installation, please try again with that and let us know how it goes. |
I don't face this issue anymore with RC2. Consider this resolved. |
The CUs start executing and then fail with the following error:
The text was updated successfully, but these errors were encountered: