-
Couldn't load subscription status.
- Fork 928
Check for OPAL_PREFIX and set corresponding PMIX_PREFIX if found #3985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit 3744967)
|
@rhc54 I'm going to runtime-check it and report back |
| } | ||
| /* check for OPAL_PREFIX */ | ||
| if (NULL != (evar = getenv("OPAL_PREFIX"))) { | ||
| opal_setenv("PMIX_PREFIX", evar, false, &environ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhc54 while I'm checking if this fixes my particular case I have a coment.
This fix is focused on internal PMIx. This may break external pmix in following cases:
- if
OPAL_PREFIXis set - we will setPMIX_PREFIXand this will mislead external pmix - if no
OPAL_PREFIXprovided but user still needs to guide external PMIx - this fix will not help.
A;so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can fix the first case by moving this down into opal/mca/pmix/pmix2x itself (note that you can't put it in the ORTE code as that wouldn't help the direct launch situation). There is no way to fix the other case - the user will just have to provide PMIX_PREFIX in that scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about the second case: that is fine to provide PMIX_PREFIX, the problem is it won't be propagated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second case will indeed be propagated. Today, mpirun harvests all envars starting with OMPI_ and PMIX_ and forwards them. If someone direct launches, then the launcher is responsible for forwarding the environment. Thus, if someone sets PMIX_PREFIX in their environment, it should get to the backend. There is no way we can determine whether or not PMIX_PREFIX is required for the external PMIx installation.
So the only case we have to deal with is the one where OPAL_PREFIX is set and we need to ensure that PMIX_PREFIX corresponds. As you correctly pointed out, this only happens when we are using the embedded version of PMIx.
|
But that's what my issue was all about. We are setting PMIX_INSTALL_PREFIX and it is not propagated. |
…IX to the internal integration code for PMIx so we only do this when running with the embeddied PMIx Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit 8f34fa4)
|
@artpol84 I'm not sure what you mean - in master, we definitely pickup and propagate everything starting with PMIX_. Is your concern that this code hasn't moved to the 3.0 branch? FWIW: It is PMIX_PREFIX, not PMIX_INSTALL_PREFIX - are you setting the right variable? The problem this PR solves is that someone setting OPAL_PREFIX expects the embedded PMIx to also adjust its plugin location. So we need to detect OPAL_PREFIX and manually force PMIX_PREFIX in that case. Since OPAL_PREFIX is picked up and propagated, we don't have to worry about doing PMIX_PREFIX separately - we can just locally do the right thing. |
|
Just checked and 3.0 is indeed picking up and propagating all PMIX_ envars |
|
As I updated on #3980 (#3980 (comment)) I agree that auto-set PMIX search path if OPAL one is used is a good idea, it is related to #3980 but it doesn't fix #3980. I don't have concerns - I see at runtime that v3.0.x is not able to work if sources are moved to another location and multiple nodes are used (see #3980 (comment)). I see that As far as
Have you tried to reproduce the problem in your environment? |
|
|
|
Also - propagation to what entity have you tested? To |
Correction: I expect this PR to fix the case where internal PMIx is used. However I have concerns about external one then. |
|
@artpol84 I honestly don't have any more time to spend on this one. I would suggest you do two things:
Those are the only cases we can impact. We cannot do anything for the cases where we direct launch or launch via the other plm components - the system admin will have to configure their environment to forward the necessary envars. |
|
that sounds ok, @rhc54. |
|
I've addressed what was suggested in #3985 (comment). This solves my problem and I believe will solve 99.9% of cases. |
|
@bwbarrett @hppritcha |
this came up recently. I guess what it means is that env var is not checked and could be constructed in the way that may (un)intentionally break the logic. |
|
We ignore those coverity issues. What they mean is that the envar could have been something that causes you to do evil things, and we don't restrict the potential values to avoid that scenario. In these cases, there is no way to know if a value is right or not as the range is any valid path. Technically, we could check the string to see if it points to a valid path - but then you'd get the same warning because you are using the string in a function without first checking to see if it is valid. Can't win - we ignore. |
|
@hppritcha @bwbarrett |
|
You are welcome to push the cherry-pick into my PR - will retain your authorship
…Sent from my iPhone
On Aug 2, 2017, at 10:12 AM, Artem Polyakov ***@***.***> wrote:
@hppritcha @bwbarrett
Ok, I need this PR merged to port #4004.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
lets merge as is. |
|
@hppritcha @bwbarrett Just to be clear, the combination of this PR and #4004 does not fully address the problem. You are welcome to piecemeal the commits if you like, or we can roll them all up. Totally up to you. |
|
@artpol84 and @rhc54, I have to admit it's pretty unclear where we are. #4004 is a commit to master; it doesn't look like there's a corresponding commit to 3.0. I'd really prefer that we have one PR which covers cherry picking whatever we need from master to entirely solve this issue. Continuing to piece meal is really confusing at this point. Pulling the RM approved tag until @artpol84 can update with a plan for getting the issue solved enough that we can release, so that we don't think we've solved a problem we haven't solved. |
|
@bwbarrett I guess it is on @rhc54 side now: #4007 (comment). #4004 depends on this PR, @rhc54 can cherry-pick those commits here if it will be more convenient. |
|
If we will go with one PR - let me know when all the changes are there and I will check at runtime again. |
|
Ralph is traveling today. I spoke with him this morning briefly. He posted PR #4012 for discussion. I think we should wait until tomorrow to merge any more in on this topic. Maybe @rhc54 @artpol84 and I can get on the phone tomorrow and sort though this a bit. @bwbarrett I agree that one PR with all of the necessary commits is what we should wait for for |
|
Dropped in favor of #4052 |
Fixes #3980
Signed-off-by: Ralph Castain rhc@open-mpi.org
(cherry picked from commit 3744967)