Skip to content

osc/ucx: internal error in MPI_Get (resource temporarily unavailable) #10639

Open
@s417-lama

Description

@s417-lama

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v5.0.0rc7

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From a source tarball.

configured with:

../configure CFLAGS="-g" --prefix=${PREFIX} --with-ucx=${UCX_PREFIX} --disable-man-pages --with-pmix=internal --with-hwloc=internal --with-libevent=internal --without-hcoll

Please describe the system on which you are running

  • Operating system/version: Linux 3.10.0-514.26.2.el7.x86_64
  • Computer hardware: Intel Xeon Gold 6154 x 2 (36 cores in total)
  • Network type: InfiniBand EDR 4x (100Gbps)

Details of the problem

MPI_Get causes an internal error under a specific condition.

Error I got:

[1660029730.372314] [sca1282:136709:0]           ib_md.c:379  UCX  ERROR ibv_exp_reg_mr(address=0x2afde7ec4000, length=4096, access=0xf) failed: Resource temporarily unavailable
[1660029730.372354] [sca1282:136709:0]          ucp_mm.c:143  UCX  ERROR failed to register address 0x2afde7ec4000 mem_type bit 0x1 length 4096 on md[4]=mlx5_0: Input/output error (md reg_mem_types 0x1)
[1660029730.372365] [sca1282:136709:0]     ucp_request.c:356  UCX  ERROR failed to register user buffer datatype 0x8 address 0x2afde7ec4000 len 4096: Input/output error
[sca1282:136709] ../../../../../opal/mca/common/ucx/common_ucx_wpool.h:376  Error: ucp_get_nbi failed: -3
[sca1282:00000] *** An error occurred in MPI_Get
[sca1282:00000] *** reported by process [2902982657,0]
[sca1282:00000] *** on win ucx window 3
[sca1282:00000] *** MPI_ERR_OTHER: known error not in list
[sca1282:00000] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[sca1282:00000] ***    and MPI will try to terminate your MPI job as well)

Here is minimal code to reproduce this error:

#include <stdlib.h>
#include <mpi.h>

int main(int argc, char** argv) {
  MPI_Init(&argc, &argv);

  int rank, nranks;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &nranks);

  /* size_t array_size = (size_t)128 * 1024 * 1024; // OK */
  size_t array_size = (size_t)1024 * 1024 * 1024; // error

  /* size_t block_size = 1024;  // OK */
  /* size_t block_size = 2048;  // OK */
  size_t block_size = 4096;  // error
  /* size_t block_size = 8192;  // error */
  /* size_t block_size = 16384; // error */
  /* size_t block_size = 32768; // OK */
  /* size_t block_size = 65536; // OK */

  size_t local_size = array_size / nranks;

  /* char* buf = (char*)aligned_alloc(2048, array_size); // OK */
  char* buf = (char*)aligned_alloc(4096, array_size); // error

  void* baseptr;
  MPI_Win win;
  MPI_Win_allocate(local_size,
                   1,
                   MPI_INFO_NULL,
                   MPI_COMM_WORLD,
                   &baseptr,
                   &win);
  MPI_Win_lock_all(0, win);

  /* int interleave = 0; // OK */
  int interleave = 1; // error

  if (rank == 0) {
    for (size_t i = 0; i < array_size / block_size; i++) {
      int target_rank;
      size_t target_disp;
      if (interleave) {
        target_rank = i % nranks;
        target_disp = i / nranks * block_size;
      } else {
        target_rank = i * block_size / local_size;
        target_disp = i * block_size - target_rank * local_size;
      }
      if (target_rank != rank) {
        MPI_Get(buf + i * block_size,
                block_size,
                MPI_BYTE,
                target_rank,
                target_disp,
                block_size,
                MPI_BYTE,
                win);
      }
    }

    MPI_Win_flush_all(win);
  }

  MPI_Win_unlock_all(win);
  MPI_Win_free(&win);

  free(buf);

  MPI_Finalize();
  return 0;
}

In this code, rank 0 gathers data from all other ranks into a single local array.
MPI_Get is issued for each rank in the granularity of block_size.

The above error happens under the following conditions:

  • only when the array size (array_size) is large enough (in this case 1GB)
  • only when the block size (block_size) for each MPI_Get is 4096, 8192, or 16384
  • only when the local array (buf) is aligned with 4096 bytes
  • only with the interleave policy (interleave=1 means that rank0 chooses the target rank for MPI_Get in a round-robin fashion)
  • only when two or more processes are spawned on different nodes (not intra-node)

Otherwise, the error did not happen in my environment.

Compile the code (test.c) and run on 2 nodes (1 process/node):

mpicc test.c
mpirun -n 2 -N 1 ./a.out

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions