Skip to content

Conversation

@mvhulten
Copy link
Contributor

This is needed when GCC+OpenMPI is used. This is tested on JURECA.

With Intel+ParaStationMPI this already went right; there is no effect when compiling with Intel LLVM. It only has effect if ${SLURM_SRUN} is set, which means it is a Slurm system (as is already assumed).

Resolves: #87
Replaces: #88

This is needed when GCC+OpenMPI is used.  This is tested on JURECA.

With Intel+ParaStationMPI this already went right; there is no effect
when compiling with Intel LLVM.  It only has effect if ${SLURM_SRUN} is
set, which means it is a Slurm system (as is already assumed).

Resolves: #87
@mvhulten
Copy link
Contributor Author

The only additional assumption, I can think of, is that some system where srun is found, requires the usage of something different from srun. I find this highly unlikely, but in case certainty is needed, someone should dig into where MPIEXEC_EXECUTABLE is originally set (I couldn't find it).

@mvhulten mvhulten requested a review from kvrigor August 18, 2025 13:45
@mvhulten
Copy link
Contributor Author

Only here I found a case where it was set. This is for a specific machine (same for two other NERSC machines). Instead of if(SLURM_SRUN) we could check if we are on a JSC machine, but I was hessitant to introduce a machine check to BuildParFlow.cmake, especially if my solution here may be OK.

@mvhulten mvhulten marked this pull request as draft August 22, 2025 14:23
@mvhulten
Copy link
Contributor Author

mvhulten commented Aug 22, 2025

It's not okay. On Marvin it fails with this message:

--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------

whereas master gives no problems on Marvin. Options

  1. Do a machine ("if on JSC") check (add to build code)
  2. Ask Marvin admins to provide OMPI with Slurm's PMI support (waiting for response)
  3. ...

WDYT?

@kvrigor
Copy link
Member

kvrigor commented Aug 25, 2025

How about this solution:

  1. Add export MPIEXEC_EXECUTABLE=/path/to/srun in environment file
  2. Modify cmake/FindMPIFortran.cmake:
    • Use the MPIEXEC_EXECUTABLE environment setting
    • If no MPIEXEC_EXECUTABLE has been set, use the autodetected MPIEXEC_EXECUTABLE

@mvhulten
Copy link
Contributor Author

I think I need to

if(JSC)
  set(MPIEXEC_EXECUTABLE "${SLURM_SRUN}")
endif()

in cmake/BuildParFlow.cmake, because Marvin does not appear to be able to use srun (with some PMI API, even though srun --mpi=list lists pmi2). I'm getting verification from my Marvin admin colleagues about this…

How do I properly check if I'm on JSC hardware?

@kvrigor
Copy link
Member

kvrigor commented Aug 26, 2025

I think I need to

if(JSC)
  set(MPIEXEC_EXECUTABLE "${SLURM_SRUN}")
endif()

in cmake/BuildParFlow.cmake, because Marvin does not appear to be able to use srun (with some PMI API, even though srun --mpi=list lists pmi2). I'm getting verification from my Marvin admin colleagues about this…

How do I properly check if I'm on JSC hardware?

It's best to keep the CMake scripts as system-agnostic as possible. All machine-specific settings should be configured thru the environment files (that's why they exist). In this case you can set MPIEXEC_EXECUTABLE in default.2025.env or on the env files themselves, and then let cmake/FindMPIFortran.cmake detect the user-defined MPIEXEC_EXECUTABLE. This way we nicely limit the scope of TSMP2 CMake scripts to compile-related options.

@mvhulten
Copy link
Contributor Author

I've done this in 30bfbcb.

But I cannot reproduce the original problem anymore. I went back to master and did a clean compile and several of the parallel Tcl scripts from the test directory all pass. I'm on the same old commits for TSMP2 and parflow. I'll look at it later again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-process ParFlow stand-alone fails on GNU stack

3 participants