Skip to content

Enable finer control of MPI calls and nprocs selection logic. #16

Open
@JHopeCollins

Description

@JHopeCollins

These are just a few modifications/enhancements I've been thinking about, in large part because of a few issues that have come up recently including this issue: firedrakeproject/firedrake#4101 (which has also come up before in slack), and someone trying to run on an HPC where the MPI executable was srun and not mpiexec. I'll lay out a few things that (I think) could improve, and then suggest some possible solutions.

Restrictions/problems with current implementation.

  1. If you try to use MPI "on the inside" with an MPI distribution that doesn't support nested inits, then you just get the cryptic error from MPI, rather than a nice message from mpi-pytest telling you what's wrong and how to fix it. @connorjward has a WIP fix for this here: WIP Emit a helpful error message when trying to run in forking mode #14

  2. If you try to use MPI "on the outside" and there's a mismatch between the number of MPI ranks and the nprocs argument, then you get the helpful error message _pytest.config.exceptions.UsageError: Attempting to run parallel tests inside an mpiexec call where the requested and provided process counts do not match, but it means you have to repeat yourself when you're running parallel tests. Sometimes it would be nice to "opt-in" to the tests being automatically selected. I often find myself doing something a bit unwieldy like:

N=4; mpiexec -n $N pytest -m "parallel[$N]" tests
  1. If you try to use MPI "on the inside" with an MPI distribution that doesn't use mpiexec (say it has srun or something) then the parallel callback here will just fail, with no way of modifying it.

  2. If you use MPI "on the inside" and you want the ranks other than rank 0 to have more detailed output, then you will always be thwarted because the quiet arguments are added after the user's argument here so will always take precedence.

  3. The default nprocs is hardcoded here, so there's no way of having this be different between different projects/invocations etc.

  4. Each test has to explicitly declare in the code how many processors they can run with. Sometimes it might be useful to be able to specify that a particular test can run with any number of processors (for example if the test doesn't rely on a specific number of ranks, it just tests that something runs successfully in parallel).

Possible solutions

  1. See Connor's PR.

  2. We could add a command line argument when running MPI "on the outside" to tell mpi-pytest to either fail or skip if it sees tests that use a different number of tests to the size of COMM_WORLD , e.g. if I run with 2 cores and select skip, only the parallel tests with nprocs=2 (or that have 2 as one of the options) will be run, and the others will be skipped.

mpiexec -np 2 pytest --nprocs-mismatch=<skip,fail>
  1. We could add a --mpi-executable= command line argument to pytest to specify what to use. We might also need something to specify which flag will propagate environment variables in case it doesn't use the -genv argument like MPICH's mpiexec, e.g. for srun it would be something like:
pytest -m "parallel[2]" --mpi-executable=srun --mpi-env-flag=--export
  1. This could be as simple as adding a command line argument, something like --mpi-quiet=<none,nice,priority> where none means don't add any quiet arguments, nice means add them before the user's arguments so they don't take priority, and priority means adding them after so they override any conflicts (what we do now).

  2. The default nprocs could be modifiable either through an environment variable that is read by mpi-pytest or by another command line argument like --nprocs-default=4

  3. This one might be a bit tricky/contentious. The parallel mark could also allow parallel(nprocs=2, any=True) so that a test runs no matter the size of COMM_WORLD. The nprocs argument would still be needed so that running with MPI "on the inside" knows what to do. Not quite sure what the best way to implement this logic would be, but something to discuss. e.g. if you specify -m parallel[4] would a test with parallel(nprocs=2, any=True) be run? Or not?

P.S. I'm not attached to the names of anything I've suggested here, happy to bikeshed any that we think we do want.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions