Skip to content

[BUG] OSError: MPI environment variables are not set. #7711

@Mulbetty

Description

@Mulbetty

Describe the bug
I am using ds-0.18.0 setting the launcher as openmpi, and get an error about MPI environment variables.

To Reproduce
run like following commond:

deepspeed \
    --hostfile=${HOSTFILE_PATH} \
    --launcher=OPENMPI \
    --launcher_args="-bind-to none -map-by slot --mca pml ob1 --oversubscribe --display-allocation --display-map" \
    --master_addr=${MASTER_ADDR} \
    --master_port=${_M_PORT} \
    --no_ssh_check \
    test.py

test.py could be any simple code.The error like:

Traceback (most recent call last):
  File "/usr/local/bin/deepspeed", line 6, in <module>
    main()
  File "/usr/local/lib/python3.12/dist-packages/deepspeed/launcher/runner.py", line 583, in main
    runner = OpenMPIRunner(args, world_info_base64, resource_pool)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/deepspeed/launcher/multinode_runner.py", line 129, in __init__
    super().__init__(args, world_info_base64)
  File "/usr/local/lib/python3.12/dist-packages/deepspeed/launcher/multinode_runner.py", line 23, in __init__
    self.validate_args()
  File "/usr/local/lib/python3.12/dist-packages/deepspeed/launcher/multinode_runner.py", line 145, in validate_args
    self._setup_mpi_environment()
  File "/usr/local/lib/python3.12/dist-packages/deepspeed/launcher/multinode_runner.py", line 160, in _setup_mpi_environment
    raise EnvironmentError("MPI environment variables are not set. "
OSError: MPI environment variables are not set. Ensure you are running the script with an MPI-compatible launcher.

I find the link:[#6979-disscuss] mentioned likes my quesiton.
If I comment the self._setup_mpi_environment() in

self._setup_mpi_environment()
, the error disappears.
Expected behavior
no error about MPI environment variables.

System info (please complete the following information):

  • OS: ubuntu22.04
  • GPU count and types H20 x 16
  • Interconnects (if applicable)
  • Python version 3.10
  • Any other relevant info about your setup: deepspeed-0.18.0

Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?
MPI

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions