Mpi4py detection #29

mkolopanis · 2025-06-24T21:12:43Z

Uses mpi4py to perform MPI implementation in forking mode. This does not require MPI initialization or finalization.
I have also taken the liberty of adding a few extra version string comparisons to find the backend implementation just in case there is some other weird open mpi name that pops up.

Unfortunately, mpi4py only returns a string and then it is required to be checked for the version type.
I have noticed that mpich, openmpi, and microsoft mpi all like to have their name as the first thing in the returned string, so it could be possible to use version.startswith("open") instead of the in statement that is currently implemented. However that is not necessarily a guarantee and if the format changes this version should hopefully still function.

I took the

fixes #27, fixes #28

mkolopanis · 2025-06-24T21:28:14Z

Another thing I have thought of concerning the matching is that mpich does include full paths to executables and libraries in the version string. It is possible a user could have MPICH installed but the path contain an openmpi version_str possibility and get a false positive on matching the open mpi implementation.

One potential way to guard against that is to have the mpich comparison first because from my anecdotal tests openmpi does not seem to include full library paths and would not suffer from having mpich in the path causing a false positive.

pytest_mpi/plugin.py

connorjward · 2025-06-25T11:19:58Z

Another thing I have thought of concerning the matching is that mpich does include full paths to executables and libraries in the version string. It is possible a user could have MPICH installed but the path contain an openmpi version_str possibility and get a false positive on matching the open mpi implementation.

One potential way to guard against that is to have the mpich comparison first because from my anecdotal tests openmpi does not seem to include full library paths and would not suffer from having mpich in the path causing a false positive.

I'm happy to reorder the checks. In general I think we need a better solution for this, though that isn't needed for this PR. I've created an issue: #30.

miguelcoolchips · 2025-06-27T07:48:20Z

I tried this branch with this MFE

import pytest

@pytest.mark.parallel(nprocs=5)  # run in parallel with 5 processes
def test_my_code_on_5_procs():
    print("hello")

and got this error

firedrake@1be300546b10:~/glaciercore$ pytest -v test_mfe.py                                                                                                                                                                                                                                                                                           ================================================================================================================================================================ test session starts =================================================================================================================================================================
platform linux -- Python 3.11.13, pytest-8.3.4, pluggy-1.6.0 -- /home/firedrake/.pyenv/versions/3.11.13/bin/python3.11
cachedir: .pytest_cache
rootdir: /home/firedrake/glaciercore
configfile: pyproject.toml
plugins: xdist-3.7.0, timeout-2.3.1, split-0.10.0, mpi-pytest-2025.7.0.dev0, cov-6.2.1
collected 1 item

test_mfe.py::test_my_code_on_5_procs FAILED                                                                                                                                                                                                                                                                                                    [100%]

====================================================================================================================================================================== FAILURES ======================================================================================================================================================================
______________________________________________________________________________________________________________________________________________________________ test_my_code_on_5_procs _______________________________________________________________________________________________________________________________________________________________

args = (), kwargs = {}

    def parallel_callback(*args, **kwargs):
>       subprocess.run(cmd, check=True)

../.pyenv/versions/3.11.13/lib/python3.11/site-packages/pytest_mpi/plugin.py:249:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

input = None, capture_output = False, timeout = None, check = True, popenargs = (['mpiexec', '--oversubscribe', '-n', '1', '-x', '_PYTEST_MPI_CHILD_PROCESS=1', ...],), kwargs = {}, process = <Popen: returncode: 1 args: ['mpiexec', '--oversubscribe', '-n', '1', '-x', ...>, stdout = None, stderr = None, retcode = 1

    def run(*popenargs,
            input=None, capture_output=False, timeout=None, check=False, **kwargs):
        """Run command with arguments and return a CompletedProcess instance.

        The returned instance will have attributes args, returncode, stdout and
        stderr. By default, stdout and stderr are not captured, and those attributes
        will be None. Pass stdout=PIPE and/or stderr=PIPE in order to capture them,
        or pass capture_output=True to capture both.

        If check is True and the exit code was non-zero, it raises a
        CalledProcessError. The CalledProcessError object will have the return code
        in the returncode attribute, and output & stderr attributes if those streams
        were captured.

        If timeout is given, and the process takes too long, a TimeoutExpired
        exception will be raised.

        There is an optional argument "input", allowing you to
        pass bytes or a string to the subprocess's stdin.  If you use this argument
        you may not also use the Popen constructor's "stdin" argument, as
        it will be used internally.

        By default, all communication is in bytes, and therefore any "input" should
        be bytes, and the stdout and stderr will be bytes. If in text mode, any
        "input" should be a string, and stdout and stderr will be strings decoded
        according to locale encoding, or by "encoding" if set. Text mode is
        triggered by setting any of text, encoding, errors or universal_newlines.

        The other arguments are the same as for the Popen constructor.
        """
        if input is not None:
            if kwargs.get('stdin') is not None:
                raise ValueError('stdin and input arguments may not both be used.')
            kwargs['stdin'] = PIPE

        if capture_output:
            if kwargs.get('stdout') is not None or kwargs.get('stderr') is not None:
                raise ValueError('stdout and stderr arguments may not be used '
                                 'with capture_output.')
            kwargs['stdout'] = PIPE
            kwargs['stderr'] = PIPE

        with Popen(*popenargs, **kwargs) as process:
            try:
                stdout, stderr = process.communicate(input, timeout=timeout)
            except TimeoutExpired as exc:
                process.kill()
                if _mswindows:
                    # Windows accumulates the output in a single blocking
                    # read() call run on child threads, with the timeout
                    # being done in a join() on those threads.  communicate()
                    # _after_ kill() is required to collect that and add it
                    # to the exception.
                    exc.stdout, exc.stderr = process.communicate()
                else:
                    # POSIX _communicate already populated the output so
                    # far into the TimeoutExpired exception.
                    process.wait()
                raise
            except:  # Including KeyboardInterrupt, communicate handled that.
                process.kill()
                # We don't call process.wait() as .__exit__ does that for us.
                raise
            retcode = process.poll()
            if check and retcode:
>               raise CalledProcessError(retcode, process.args,
                                         output=stdout, stderr=stderr)
E               subprocess.CalledProcessError: Command '['mpiexec', '--oversubscribe', '-n', '1', '-x', '_PYTEST_MPI_CHILD_PROCESS=1', '/home/firedrake/.pyenv/versions/3.11.13/bin/pytest', '--runxfail', '-s', '-q', '/home/firedrake/glaciercore/test_mfe.py::test_my_code_on_5_procs', ':', '-n', '4', '/home/firedrake/.pyenv/versions/3.11.13/bin/pytest', '--runxfail', '-s', '-q', '/home/firedrake/glaciercore/test_mfe.py::test_my_code_on_5_procs', '--tb=no', '--no-summary', '--no-header', '--disable-warnings', '--show-capture=no']' returned non-zero exit status 1.

../.pyenv/versions/3.11.13/lib/python3.11/subprocess.py:571: CalledProcessError
============================================================================================================================================================== short test summary info ===============================================================================================================================================================
FAILED test_mfe.py::test_my_code_on_5_procs - subprocess.CalledProcessError: Command '['mpiexec', '--oversubscribe', '-n', '1', '-x', '_PYTEST_MPI_CHILD_PROCESS=1', '/home/firedrake/.pyenv/versions/3.11.13/bin/pytest', '--runxfail', '-s', '-q', '/home/firedrake/glaciercore/test_mfe.py::test_my_code_on_5_procs', ':', '-n', '4', '/home/firedrake/.pyenv/versions/3.11.13/bin/pytest', ...

Running the mpiexec command directly from my terminal did not give problems

mpiexec --oversubscribe -n  1 -x _PYTEST_MPI_CHILD_PROCESS=1 /home/firedrake/.pyenv/versions/3.11.13/bin/pytest --runxfail -s -q /home/firedrake/glaciercore/test_mfe.py::test_my_code_on_5_procs : -n 4 /home/firedrake/.pyenv/versions/3.11.13/bin/pytest --runxfail -s -q /home/firedrake/glaciercore/test_mfe.py::test_my_code_on_5_procs --tb=no --no-summary --no-header --disable-warnings --show-capture=no

Is there a way that I can obtain more information about the error from pytest?

JHopeCollins · 2025-06-27T08:32:43Z

Can you rerun with the -s flag? This will prevent pytest from capturing stdout

connorjward · 2025-06-27T08:52:56Z

I tried this branch with this MFE

import pytest

@pytest.mark.parallel(nprocs=5)  # run in parallel with 5 processes
def test_my_code_on_5_procs():
    print("hello")

and got this error

firedrake@1be300546b10:~/glaciercore$ pytest -v test_mfe.py                                                                                                                                                                                                                                                                                           ================================================================================================================================================================ test session starts =================================================================================================================================================================
platform linux -- Python 3.11.13, pytest-8.3.4, pluggy-1.6.0 -- /home/firedrake/.pyenv/versions/3.11.13/bin/python3.11
cachedir: .pytest_cache
rootdir: /home/firedrake/glaciercore
configfile: pyproject.toml
plugins: xdist-3.7.0, timeout-2.3.1, split-0.10.0, mpi-pytest-2025.7.0.dev0, cov-6.2.1
collected 1 item

test_mfe.py::test_my_code_on_5_procs FAILED                                                                                                                                                                                                                                                                                                    [100%]

====================================================================================================================================================================== FAILURES ======================================================================================================================================================================
______________________________________________________________________________________________________________________________________________________________ test_my_code_on_5_procs _______________________________________________________________________________________________________________________________________________________________

args = (), kwargs = {}

    def parallel_callback(*args, **kwargs):
>       subprocess.run(cmd, check=True)

../.pyenv/versions/3.11.13/lib/python3.11/site-packages/pytest_mpi/plugin.py:249:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

input = None, capture_output = False, timeout = None, check = True, popenargs = (['mpiexec', '--oversubscribe', '-n', '1', '-x', '_PYTEST_MPI_CHILD_PROCESS=1', ...],), kwargs = {}, process = <Popen: returncode: 1 args: ['mpiexec', '--oversubscribe', '-n', '1', '-x', ...>, stdout = None, stderr = None, retcode = 1

    def run(*popenargs,
            input=None, capture_output=False, timeout=None, check=False, **kwargs):
        """Run command with arguments and return a CompletedProcess instance.

        The returned instance will have attributes args, returncode, stdout and
        stderr. By default, stdout and stderr are not captured, and those attributes
        will be None. Pass stdout=PIPE and/or stderr=PIPE in order to capture them,
        or pass capture_output=True to capture both.

        If check is True and the exit code was non-zero, it raises a
        CalledProcessError. The CalledProcessError object will have the return code
        in the returncode attribute, and output & stderr attributes if those streams
        were captured.

        If timeout is given, and the process takes too long, a TimeoutExpired
        exception will be raised.

        There is an optional argument "input", allowing you to
        pass bytes or a string to the subprocess's stdin.  If you use this argument
        you may not also use the Popen constructor's "stdin" argument, as
        it will be used internally.

        By default, all communication is in bytes, and therefore any "input" should
        be bytes, and the stdout and stderr will be bytes. If in text mode, any
        "input" should be a string, and stdout and stderr will be strings decoded
        according to locale encoding, or by "encoding" if set. Text mode is
        triggered by setting any of text, encoding, errors or universal_newlines.

        The other arguments are the same as for the Popen constructor.
        """
        if input is not None:
            if kwargs.get('stdin') is not None:
                raise ValueError('stdin and input arguments may not both be used.')
            kwargs['stdin'] = PIPE

        if capture_output:
            if kwargs.get('stdout') is not None or kwargs.get('stderr') is not None:
                raise ValueError('stdout and stderr arguments may not be used '
                                 'with capture_output.')
            kwargs['stdout'] = PIPE
            kwargs['stderr'] = PIPE

        with Popen(*popenargs, **kwargs) as process:
            try:
                stdout, stderr = process.communicate(input, timeout=timeout)
            except TimeoutExpired as exc:
                process.kill()
                if _mswindows:
                    # Windows accumulates the output in a single blocking
                    # read() call run on child threads, with the timeout
                    # being done in a join() on those threads.  communicate()
                    # _after_ kill() is required to collect that and add it
                    # to the exception.
                    exc.stdout, exc.stderr = process.communicate()
                else:
                    # POSIX _communicate already populated the output so
                    # far into the TimeoutExpired exception.
                    process.wait()
                raise
            except:  # Including KeyboardInterrupt, communicate handled that.
                process.kill()
                # We don't call process.wait() as .__exit__ does that for us.
                raise
            retcode = process.poll()
            if check and retcode:
>               raise CalledProcessError(retcode, process.args,
                                         output=stdout, stderr=stderr)
E               subprocess.CalledProcessError: Command '['mpiexec', '--oversubscribe', '-n', '1', '-x', '_PYTEST_MPI_CHILD_PROCESS=1', '/home/firedrake/.pyenv/versions/3.11.13/bin/pytest', '--runxfail', '-s', '-q', '/home/firedrake/glaciercore/test_mfe.py::test_my_code_on_5_procs', ':', '-n', '4', '/home/firedrake/.pyenv/versions/3.11.13/bin/pytest', '--runxfail', '-s', '-q', '/home/firedrake/glaciercore/test_mfe.py::test_my_code_on_5_procs', '--tb=no', '--no-summary', '--no-header', '--disable-warnings', '--show-capture=no']' returned non-zero exit status 1.

../.pyenv/versions/3.11.13/lib/python3.11/subprocess.py:571: CalledProcessError
============================================================================================================================================================== short test summary info ===============================================================================================================================================================
FAILED test_mfe.py::test_my_code_on_5_procs - subprocess.CalledProcessError: Command '['mpiexec', '--oversubscribe', '-n', '1', '-x', '_PYTEST_MPI_CHILD_PROCESS=1', '/home/firedrake/.pyenv/versions/3.11.13/bin/pytest', '--runxfail', '-s', '-q', '/home/firedrake/glaciercore/test_mfe.py::test_my_code_on_5_procs', ':', '-n', '4', '/home/firedrake/.pyenv/versions/3.11.13/bin/pytest', ...

Running the mpiexec command directly from my terminal did not give problems

mpiexec --oversubscribe -n  1 -x _PYTEST_MPI_CHILD_PROCESS=1 /home/firedrake/.pyenv/versions/3.11.13/bin/pytest --runxfail -s -q /home/firedrake/glaciercore/test_mfe.py::test_my_code_on_5_procs : -n 4 /home/firedrake/.pyenv/versions/3.11.13/bin/pytest --runxfail -s -q /home/firedrake/glaciercore/test_mfe.py::test_my_code_on_5_procs --tb=no --no-summary --no-header --disable-warnings --show-capture=no

Is there a way that I can obtain more information about the error from pytest?

Unfortunately I think that this is because some OpenMPI distributions don't work with calling mpiexec inside an already initialised MPI process (which is tricky to avoid because just running from mpi4py import MPI does this usually). This was the original reason why Firedrake used to ship with MPICH.

Since it is calling mpiexec --oversubscribe it means that this PR is working as expected and correctly detecting the version of MPI.

Very unhelpfully it seems to work fine on some distributions. For example on Arch Linux (and CI!) I can run this without issue.

connorjward

Looks good. Thanks!

miguelcoolchips · 2025-06-27T09:22:10Z

Unfortunately I think that this is because some OpenMPI distributions don't work with calling mpiexec inside an already initialised MPI process (which is tricky to avoid because just running from mpi4py import MPI does this usually). This was the original reason why Firedrake used to ship with MPICH.

Since it is calling mpiexec --oversubscribe it means that this PR is working as expected and correctly detecting the version of MPI.

Very unhelpfully it seems to work fine on some distributions. For example on Arch Linux (and CI!) I can run this without issue.

This was using a docker image with ubuntu:25:10. Which docker image do you recommend I use? archlinux:latest? I will also try switching to mpich

JHopeCollins · 2025-06-27T09:31:38Z

This was using a docker image with ubuntu:25:10. Which docker image do you recommend I use? archlinux:latest? I will also try switching to mpich

Is there a particular reason that you do not want to use MPI "on the outside"? i.e.

mpiexec -n $n python -m pytest -m parallel[$n] ...

miguelcoolchips · 2025-06-27T09:38:06Z

This was using a docker image with ubuntu:25:10. Which docker image do you recommend I use? archlinux:latest? I will also try switching to mpich

Is there a particular reason that you do not want to use MPI "on the outside"? i.e.
mpiexec -n $n python -m pytest -m parallel[$n] ...

Some of my tests are hanging indefinitely so I wanted to explore other ways around it. I think it has to do with how some of the pytest's resources are getting cleaned up

connorjward · 2025-06-27T09:40:42Z

This was using a docker image with ubuntu:25:10. Which docker image do you recommend I use? archlinux:latest? I will also try switching to mpich

Is there a particular reason that you do not want to use MPI "on the outside"? i.e.
mpiexec -n $n python -m pytest -m parallel[$n] ...
Some of my tests are hanging indefinitely so I wanted to explore other ways around it. I think it has to do with how some of the pytest's resources are getting cleaned up

This sounds like a slightly different problem. And this PR thread probably isn't the best place to discuss this. Would you be able to create a discussion on the Firedrake repo explaining your issue in more detail? @JHopeCollins and I would be very happy to help you there.

mkolopanis added 2 commits June 24, 2025 14:00

use mpi4py for implementation detection

1a40668

add another openrte permutation

7530aff

connorjward requested changes Jun 25, 2025

View reviewed changes

pytest_mpi/plugin.py Outdated Show resolved Hide resolved

pytest_mpi/plugin.py Show resolved Hide resolved

pytest_mpi/plugin.py Show resolved Hide resolved

connorjward mentioned this pull request Jun 25, 2025

Forking mode should accept argument to specify the correct MPI distribution #30

Open

mkolopanis added 3 commits June 25, 2025 12:51

add functools.cache to implement detection

4f9b160

remove mpi.rc initialize and finalize setting

4299286

move mpich check to be first

2d419f1

connorjward approved these changes Jun 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mpi4py detection #29

Mpi4py detection #29

Uh oh!

mkolopanis commented Jun 24, 2025

Uh oh!

mkolopanis commented Jun 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

connorjward commented Jun 25, 2025

Uh oh!

miguelcoolchips commented Jun 27, 2025

Uh oh!

JHopeCollins commented Jun 27, 2025

Uh oh!

connorjward commented Jun 27, 2025

Uh oh!

connorjward left a comment

Uh oh!

miguelcoolchips commented Jun 27, 2025

Uh oh!

JHopeCollins commented Jun 27, 2025

Uh oh!

miguelcoolchips commented Jun 27, 2025

Uh oh!

connorjward commented Jun 27, 2025

Uh oh!

Uh oh!

Mpi4py detection #29

Are you sure you want to change the base?

Mpi4py detection #29

Uh oh!

Conversation

mkolopanis commented Jun 24, 2025

Uh oh!

mkolopanis commented Jun 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

connorjward commented Jun 25, 2025

Uh oh!

miguelcoolchips commented Jun 27, 2025

Uh oh!

JHopeCollins commented Jun 27, 2025

Uh oh!

connorjward commented Jun 27, 2025

Uh oh!

connorjward left a comment

Choose a reason for hiding this comment

Uh oh!

miguelcoolchips commented Jun 27, 2025

Uh oh!

JHopeCollins commented Jun 27, 2025

Uh oh!

miguelcoolchips commented Jun 27, 2025

Uh oh!

connorjward commented Jun 27, 2025

Uh oh!

Uh oh!