Skip to content

sporadic fatal error messages due to critical bug in madvise() hook with OpenIB #4509

Closed
@bingmann

Description

@bingmann

I believe there is a critical bug in the (new?) madvise() hook:

In a program that does lots of Isend()/Irecv() with Wait()/Test(), I
sporadically see something like the following fatal error message:

--------------------------------------------------------------------------
Open MPI intercepted a call to free memory that is still being used by
an ongoing MPI communication.  This usually reflects an error in the
MPI application; it may signify memory corruption.  Open MPI will now
abort your job.

  rcache name:    grdma
  Local host:     fh1n076
  Buffer address: 0x2af75c4fc000
  Buffer size:    2568192
--------------------------------------------------------------------------

The error message is bogus and suspicious, especially since the program never allocates a buffer of that size. Nor does it use that address as a pointer, but the address is contained in some memory area used.

OpenMPI Version: 3.0.0, installed from the source tarball with debug output. Probably affects all versions with the madvise() commits backported.

Running on a Linux HPC cluster, Kernel 3.10.0-693.2.2.el7.x86_64, with an InfiniBand 4X FDR Interconnect, glibc 2.17, gcc 5.2.0.

The error only occurs with openib BTL, with TCP it apparently never occurs, because the grdma rcache module is not used.

I believe the bug affects all programs using asynchronous communication, openib, and varying buffer sizes. It occurs naturally after running the program for some time. I have added a test program triggering the error artificially.

Backtrace and Autopsy

Lots of debugging leads me to believe there is a bug in the way the interception of madvise() clears memory from the rcache grdma, which frees RDMA memory regions.

The fatal error message occurs when _intercept_madvise() is called,
which calls opal_mem_hooks_release_hook(),
which calls mca_rcache_base_mem_cb(),
which contain the fatal error message.

mca_rcache_base_mem_cb() is supposed to free rcache allocations and prints the message when mca_rcache_grdma_invalidate_range() fails.
The deallocation of areas happens by iterating over the memory area tree using mca_rcache_base_vma_iterate(), and calling gc_add() for areas to invalidate.

gc_add() fails if the invalidated area still has reference counts.

The issues is that madvise() is called in my program by the libc's malloc implementation with MADV_DONTNEED to free up regions no longer needed. This occurs at unpredictable times, probably when malloc decides to consolidate free space in the heap.

The fatal error occurs after the following sequences of operations:

  • allocate and send a (large) buffer. this registers the memory area in grdma.
  • free the buffer. after completion of the send the registration remains cached.
  • allocate a smaller buffer. by chance, malloc() reuses the same memory address for the smaller allocation.
  • perform an MPI_Isend() on the smaller buffer. this raises the reference count of the cached larger memory registration.
  • malloc() decides to consolidate free heap memory, calling madvise() on the second part of our memory area.

This triggers the fatal error, because the cached registration of the large memory area is still marked as used.

The fundamental bug, I believe, is that mca_rcache_base_vma_iterate() returns all memory areas overlapping (? did not check) the queried area. Hence, _intercept_madvise() attempts to free all areas that overlap the area in question.

I believe the right behaviour would be to only free areas fully covered by the madvise() call. While this would lead to some areas not being freed, the current state leads to random fatal aborts. Disabling the _intercept_madvise() hook poses a temporary work-around.

Can someone confirm this bug and maybe its solution?
I also (currently) do not have enough experience with the OpenMPI codebase to write a patch.

I have attached a program which triggers the error by artificially calling madvise(). In my real application is the sporadically done from inside the libc. The error only occurs when using OpenIB over a real InfiniBand network, it does not occur when running with shared-memory or TCP.

test_madvise.cpp.txt

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions