-
Notifications
You must be signed in to change notification settings - Fork 905
plm/rsh: Add chdir option to change directory before orted exec #7092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
68be954
to
5aa857d
Compare
Signed-off-by: Scott Miller <scott.miller1@ibm.com>
5aa857d
to
c1b8599
Compare
sadly, we used to take care of this in the schizo/singularity component that has since been removed |
Unfortunately, not all container runtimes provide a 'tell' (I really wish they did) that the |
okay, so let's use the "tell" when it is available - at least singularity users have a chance of running correctly! Others will have to endure the pain of constantly forgetting to set the param and having to retry their app. |
@jjhursey I think the problem with using the typical container "tells" (typically We may be able to leverage some hook mechanism for different container runtimes, but I'm not sure how much work that will be to implement. |
I don't know what may have changed, but there used to be a simple "flag" in the container we could detect before anything else happens. I believe that is still the current method used, for example, when layering in external libraries. @gvallee Can you help us out here? We need to know when mpirun is in a Singularity container. |
I think that is a separate conversation from this PR, but may be worth doing as a separate PR. This PR adds the ability to chdir between the We could introduce a schizo component for Singularity that is active in the (For my own reference) Looking at the old Should we file an issue for someone to pick up to add Singularity discovery in a schizo component? |
I guess it just depends on what problem this PR is attempting to solve. If it is for your multi-node test setup, then it is fine as-is since you know you need to set the param. However, if it is attempting to resolve a general user problem, then I fear that it will create as much trouble as it solves. Experience shows that people will forget to set the param, and their job will work its way thru the queue and fail to execute. Then they might remember the param (or more likely file an issue with us that someone will have to answer reminding them about the param), resubmit the job and wait again for it to work thru the queue. I have no issue with this PR - I'm just pointing out that it doesn't solve the general user problem. If that is the objective, then adding infrastructure while kicking the can down the road for someone else to tackle seems rather self-defeating. |
My proposal is that we take this PR since it addresses an issue by providing a capability that we didn't have before. Then iterate on a Singularity based discovery solution in a separate ticket/action instead of expanding the scope of this PR for something Singularity specific (that would leverage this change). |
Good to merge this one in? Might be good to create that additional issue @jjhursey and link that discussion here. |
Currently, there is no mechanism to tell the rsh/ssh launcher to change directories prior to exec'ing the
orted
. This is normally okay, as theorted
will manage the working directory for the ranks of the user's application. This can be seen here:However, launching
orted
inside of certain container technologies presents a new problem where the working directory of the user application can disappear during the container exec phase. This is because some container technologies like singularity mount the current working directory. But when exec'ing theorted
from the rsh launcher, the current directory has changed to the user's home directory.Using an
orte_launch_agent
, we can getorted
to launch inside of a container and illustrate this problem:Our proposed solution is to introduce an mca parameter that allows users to change the working directory after
rsh/ssh
, but before the exec oforted
. This should allow the launch agent to callsingularity exec
from the correct working directory and mount in that directory for the user application.