Skip to content

Conversation

@rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jul 31, 2017

Fixes #3980

Signed-off-by: Ralph Castain rhc@open-mpi.org
(cherry picked from commit 3744967)

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 3744967)
@rhc54 rhc54 added the bug label Jul 31, 2017
@rhc54 rhc54 added this to the v3.0.0 milestone Jul 31, 2017
@rhc54 rhc54 self-assigned this Jul 31, 2017
@rhc54 rhc54 requested a review from artpol84 July 31, 2017 17:44
@artpol84
Copy link
Contributor

@rhc54 I'm going to runtime-check it and report back

}
/* check for OPAL_PREFIX */
if (NULL != (evar = getenv("OPAL_PREFIX"))) {
opal_setenv("PMIX_PREFIX", evar, false, &environ);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhc54 while I'm checking if this fixes my particular case I have a coment.
This fix is focused on internal PMIx. This may break external pmix in following cases:

  • if OPAL_PREFIX is set - we will set PMIX_PREFIX and this will mislead external pmix
  • if no OPAL_PREFIX provided but user still needs to guide external PMIx - this fix will not help.
    A;so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can fix the first case by moving this down into opal/mca/pmix/pmix2x itself (note that you can't put it in the ORTE code as that wouldn't help the direct launch situation). There is no way to fix the other case - the user will just have to provide PMIX_PREFIX in that scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about the second case: that is fine to provide PMIX_PREFIX, the problem is it won't be propagated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second case will indeed be propagated. Today, mpirun harvests all envars starting with OMPI_ and PMIX_ and forwards them. If someone direct launches, then the launcher is responsible for forwarding the environment. Thus, if someone sets PMIX_PREFIX in their environment, it should get to the backend. There is no way we can determine whether or not PMIX_PREFIX is required for the external PMIx installation.

So the only case we have to deal with is the one where OPAL_PREFIX is set and we need to ensure that PMIX_PREFIX corresponds. As you correctly pointed out, this only happens when we are using the embedded version of PMIx.

@artpol84
Copy link
Contributor

artpol84 commented Aug 1, 2017

But that's what my issue was all about. We are setting PMIX_INSTALL_PREFIX and it is not propagated.

Ralph Castain added 2 commits August 1, 2017 15:15
…und"

This reverts commit 3744967.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit e94786f)
…IX to the internal integration code for PMIx so we only do this when running with the embeddied PMIx

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 8f34fa4)
@rhc54
Copy link
Contributor Author

rhc54 commented Aug 1, 2017

@artpol84 I'm not sure what you mean - in master, we definitely pickup and propagate everything starting with PMIX_. Is your concern that this code hasn't moved to the 3.0 branch? FWIW: It is PMIX_PREFIX, not PMIX_INSTALL_PREFIX - are you setting the right variable?

The problem this PR solves is that someone setting OPAL_PREFIX expects the embedded PMIx to also adjust its plugin location. So we need to detect OPAL_PREFIX and manually force PMIX_PREFIX in that case. Since OPAL_PREFIX is picked up and propagated, we don't have to worry about doing PMIX_PREFIX separately - we can just locally do the right thing.

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 1, 2017

Just checked and 3.0 is indeed picking up and propagating all PMIX_ envars

@artpol84
Copy link
Contributor

artpol84 commented Aug 1, 2017

As I updated on #3980 (#3980 (comment)) I agree that auto-set PMIX search path if OPAL one is used is a good idea, it is related to #3980 but it doesn't fix #3980.

I don't have concerns - I see at runtime that v3.0.x is not able to work if sources are moved to another location and multiple nodes are used (see #3980 (comment)). I see that PMIX_INSTALL_PREFIX is not propagated although it starts from PMIX_.

As far as PMIX_INSTALL_PREFIX:

  • according to pmix's installdirs component PMIX_INSTALL_PREFIX is responsible for install dir prefix ( see here ). for OMPI OPAL_PREFIX plays that role ( see here ). That is the reason I'm using it.
  • I verified at runtime that PMIX_INSTALL_PREFIX is fixing the problem for hnp-only run (i.e. where mpirun is the only orte daemon). This is because I'm setting PMIX_INSTALL_PREFIX and mpirun sees it. If I go multi-node - this variable is not propagated and orted's are unable to find pmix components.

Have you tried to reproduce the problem in your environment?

@artpol84
Copy link
Contributor

artpol84 commented Aug 1, 2017

Just checked and 3.0 is indeed picking up and propagating all PMIX_ envars
I'm going to double-check now.
Can you tell what was your way of testing that?

@artpol84
Copy link
Contributor

artpol84 commented Aug 1, 2017

Also - propagation to what entity have you tested? To orted or to the application process?
Because the problem I see is the result of orted's missing the PMIX_ prefix info.

@artpol84
Copy link
Contributor

artpol84 commented Aug 1, 2017

As I updated on #3980 I agree that auto-set PMIX search path if OPAL one is used is a good idea,
it is related to #3980 but it doesn't fix #3980.

Correction: I expect this PR to fix the case where internal PMIx is used. However I have concerns about external one then.

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 1, 2017

@artpol84 I honestly don't have any more time to spend on this one. I would suggest you do two things:

  • change PMIX_PREFIX in the master commit I did to be PMIX_INSTALL_PREFIX since we changed it to that in PMIx (not sure why, but doesn't matter)

  • go into the plm/rsh component and duplicate the code for forwarding OPAL_PREFIX to do the same logic for PMIX_INSTALL_PREFIX. That way if someone sets PMIX_INSTALL_PREFIX and not OPAL_PREFIX, we'll forward it for the rsh case. If someone sets both to different values, my commit won't override their setting.

Those are the only cases we can impact. We cannot do anything for the cases where we direct launch or launch via the other plm components - the system admin will have to configure their environment to forward the necessary envars.

@artpol84
Copy link
Contributor

artpol84 commented Aug 2, 2017

that sounds ok, @rhc54.
One more question. We have couple of envars in the installdir component:
https://github.com/open-mpi/ompi/blob/master/opal/mca/installdirs/env/opal_installdirs_env.c#L64:L80
I guess we need to forward all of them, please correct me if I am wrong.

@artpol84
Copy link
Contributor

artpol84 commented Aug 2, 2017

I've addressed what was suggested in #3985 (comment). This solves my problem and I believe will solve 99.9% of cases.
I verified that it works both in Slurm-managed allocation and with rsh/hostfile.
What we still may miss (0.1% of cases) is that user may have a finer grained installation:
https://github.com/open-mpi/ompi/blob/master/opal/mca/installdirs/env/opal_installdirs_env.c#L64:L80
and to support this more envars would need to be accounted.
With respect to plm/rsh - only OPAL_PREFIX is now propagated. So I can conclude that rsh do not support this finer granularity even from OMPI perspective.

@artpol84
Copy link
Contributor

artpol84 commented Aug 2, 2017

@bwbarrett @hppritcha
Once #4004 is merged I'll port it to v3.0.x and I will need this to be there to merge cleanly..

@artpol84
Copy link
Contributor

artpol84 commented Aug 2, 2017

@rhc54

>>>     CID 1416016:  Insecure data handling  (TAINTED_STRING)
>>>     Assigning: "evar" = "getenv("OPAL_PREFIX")", which taints "evar".
81             if (NULL != (evar = getenv("OPAL_PREFIX"))) {
82                 opal_setenv("PMIX_PREFIX", evar, false, &environ);
83             }
84         }

this came up recently. I guess what it means is that env var is not checked and could be constructed in the way that may (un)intentionally break the logic.
I guess this is not the only place where we use getenv, what is the common policy for that?

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 2, 2017

We ignore those coverity issues. What they mean is that the envar could have been something that causes you to do evil things, and we don't restrict the potential values to avoid that scenario. In these cases, there is no way to know if a value is right or not as the range is any valid path. Technically, we could check the string to see if it points to a valid path - but then you'd get the same warning because you are using the string in a function without first checking to see if it is valid.

Can't win - we ignore.

@artpol84
Copy link
Contributor

artpol84 commented Aug 2, 2017

@hppritcha @bwbarrett
Ok, I need this PR merged to port #4004.

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 2, 2017 via email

@hppritcha
Copy link
Member

@artpol84 are you going to add #4004 to this PR or would you prefer this be merged in as is?

@artpol84
Copy link
Contributor

artpol84 commented Aug 2, 2017

lets merge as is.

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 3, 2017

@hppritcha @bwbarrett Just to be clear, the combination of this PR and #4004 does not fully address the problem. You are welcome to piecemeal the commits if you like, or we can roll them all up. Totally up to you.

@bwbarrett
Copy link
Member

bwbarrett commented Aug 3, 2017

@artpol84 and @rhc54, I have to admit it's pretty unclear where we are. #4004 is a commit to master; it doesn't look like there's a corresponding commit to 3.0. I'd really prefer that we have one PR which covers cherry picking whatever we need from master to entirely solve this issue. Continuing to piece meal is really confusing at this point.

Pulling the RM approved tag until @artpol84 can update with a plan for getting the issue solved enough that we can release, so that we don't think we've solved a problem we haven't solved.

@artpol84
Copy link
Contributor

artpol84 commented Aug 3, 2017

@bwbarrett I guess it is on @rhc54 side now: #4007 (comment).

#4004 depends on this PR, @rhc54 can cherry-pick those commits here if it will be more convenient.

@artpol84
Copy link
Contributor

artpol84 commented Aug 3, 2017

If we will go with one PR - let me know when all the changes are there and I will check at runtime again.

@jjhursey
Copy link
Member

jjhursey commented Aug 3, 2017

Ralph is traveling today. I spoke with him this morning briefly. He posted PR #4012 for discussion. I think we should wait until tomorrow to merge any more in on this topic. Maybe @rhc54 @artpol84 and I can get on the phone tomorrow and sort though this a bit.

@bwbarrett I agree that one PR with all of the necessary commits is what we should wait for for v3.0.x. It shouldn't take long to sort out, and we'll probably have it ready be COB tomorrow. Just want to wait for Ralph before moving forward too much more. I think watching Issue #3980 is the best at the moment.

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 13, 2017

Dropped in favor of #4052

@rhc54 rhc54 closed this Aug 13, 2017
@rhc54 rhc54 deleted the cmr30/prefix branch March 25, 2018 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants