Can't run executable with FluxExecutor using Parsl, but can run it directly with Flux #5558
-
Up to this point I've been using the Intel compilers and MPI for developing Parsl code that uses Flux via Parsl's FluxExecutor. However, I'm now trying to make it work with GNU compilers and OpenMPI. I'm using the same exact program and configuration, just swapping the compiler and MPI. Well.... Intel+IntelMPI works. But gfortran+OpenMPI fails in the First a bit of basic info. I am installing Flux via a conda environment:
This results in these versions being installed:
The OpenMPI plugins are installed:
And the OpenMPI wrappers point to very recent GNU compilers
Finally, Parsl is the newest available:
The weird part is that if I take Parsl out of the equation, and just start up a Flux instance in my containerized Slurm cluster, it works. Sort of. I mean, it runs and produces the expected output. But it also spews a TON of warnings.
I can issue
|
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 9 replies
-
Well, on the warnings (the easier part):
This error when running under parsl
is the critical one to solve I guess? This is an early failure in the openmpi flux pmix MCA plugin. I think it will fail with that error if the environment variables PMI_FD, PMI_RANK, and PMI_SIZE are not getting through to the mpi task (those are set by flux). We've recently had other issues (#5460) with those flux plugins in openmpi but I believe they manifest as a hang with UCX rather than this. The solution to that one is add the |
Beta Was this translation helpful? Give feedback.
-
Actually there isn't anything really different in the way that openmpi is bootstrapping here vs intel MPI. Both would have dlopened the flux libpmi.so and then used the PMI-1 wire protocol over the PMI_FD to do its handshake. I'm not sure why it is suddenly failing. As a sanity check you could try running |
Beta Was this translation helpful? Give feedback.
-
Ah from #5079 (also regarding a conda install) |
Beta Was this translation helpful? Give feedback.
-
Thank you so much for all of this great information. I have done two things thus far:
The first one did remove a lot of warnings so that when I now run my executable without using Parsl via
I have also verified that PMI is working from Parsl using this Bash App:
Called like so:
And it's output is:
My Parsl Apps that run
|
Beta Was this translation helpful? Give feedback.
-
I have been doing this inside a containerized Slurm cluster because it's easier to control the environment. I'm going to try reconfiguring for Slurm on our on-prem system and try running directly there without the container to verify whether the behavior is reproducible there. |
Beta Was this translation helpful? Give feedback.
-
In case it helps (more information is usually better), this is my Parsl config for the FluxExecutor. My containerized Slurm cluster has 3 nodes with 2 cores per node.
|
Beta Was this translation helpful? Give feedback.
Ok.
flux config builtin pmi_library_path
confirms your guess. And, well, settingexport FLUX_PMI_LIBRARY_PATH=/opt/miniconda3/envs/chiltepin/lib/flux/libpmi.so
fixes it! I guess we still don't have an explanation yet as to whyFLUX_PMI_LIBRARY_PATH
isn't set. But, after setting that in my Parsl App before runningflux pmi --method=libpmi:$FLUX_PMI_LIBRARY_PATH barrier
it works. I added the same export command to the Parsl app that runs my hello MPI executable and it also works now, too on multiple tasks across multiple nodes. So, yay!