Memory Leak using Cuda-Aware MPI_Send and MPI_Recv for large packets of data

## Background information

### What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
4.0.5


### Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
from source v4.0.5 with cuda-aware enabled



### Please describe the system on which you are running

* Operating system/version: Rhel 7.7 
* Computer hardware: 8 Tesla V-100 GPUs
* Network type: NVLink

-----------------------------

## Details of the problem
When I send large packets of data between GPUs (~1Gigabytes) using `MPI_Send` and `MPI_Recv` and free Cuda variables afterwards, the memory does not get freed on the GPU and starts inflating in subsequent iterations. The expected behavior is that memory in the GPU should be after sending and receiving large packets of data. The following is the code that is producing this behavior.

main.cpp
```
#include <iostream>
#include <cuda_runtime.h>
#include <mpi.h>

#define CUCHK(error, msg)                      \
	if (error != cudaSuccess) {                  \
		throw std::runtime_error(             \
			std::string(msg) + " with "            + \
      std::string(cudaGetErrorName(error))   + \
	    std::string(" -> ")                    + \
      std::string(cudaGetErrorString(error)) + \
			" @" + std::string(__FILE__) + ":" + std::to_string(__LINE__)); \
	}

int main(int argc, char** argv)
{
    /*
     * Initialize MPI
     */
    MPI_Init(&argc, &argv);

    int size;
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    MPI_Status stat;

    if (size !=2) {
        if (rank == 0) {
            printf("This program requires exactly 2 MPI ranks, but you are attempting to use %d! Exiting...\n", size);
        }
        MPI_Finalize();
        exit(0);
    }
    cudaError_t ier;
    cudaSetDevice(rank);
    ier = cudaGetLastError();
    CUCHK(ier, "failed to set device")

    /*
     * Loop 1 GB
     */
    for (int i=0; i<=100; i++) {
        long int N;
        N = 1 << 27;


        // Alocate memory for A on CPU
        auto *A = (double*)malloc(N*sizeof(double));

        // Initialize all elements of A to 0.0
        for (int j=0; j<N; j++) {
            A[j] = 0.0;
        }

        double *d_A;
        cudaMalloc(&d_A, N*sizeof(double));
        ier = cudaGetLastError();
        CUCHK(ier, "could not allocate to device")

        cudaMemcpy(d_A, A, N*sizeof(double), cudaMemcpyHostToDevice);
        ier = cudaGetLastError();
        CUCHK(ier, "could not copy from host to device")

        int tag1 = 10;
        int tag2 = 20;

        int loop_count = 50;

        double start_time, stop_time, elapsed_time;
        start_time = MPI_Wtime();

        for (int j=1; j<=loop_count; j++) {
            if(rank == 0) {
                MPI_Send(d_A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
                MPI_Recv(d_A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat);
            }
            else if(rank == 1) {
                MPI_Recv(d_A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat);
                MPI_Send(d_A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
            }
        }

        stop_time = MPI_Wtime();
        elapsed_time = stop_time - start_time;

        long int num_B = 8*N;
        long int B_in_GB = 1 << 30;
        double num_GB = (double)num_B /(double)B_in_GB;
        double avg_time_per_transfer = elapsed_time / (2.0*(double)loop_count);

        if(rank == 0) printf("Transfer size (B): %10li, Transfer Time (s): %15.9f, Bandwidth (GB/s): %15.9f\n", num_B, avg_time_per_transfer, num_GB/avg_time_per_transfer);

        cudaFree(d_A);
        ier = cudaGetLastError();
        CUCHK(ier, "could not free device")

        free(A);
    }


    std::cout << "Hello, World!" << std::endl;
    MPI_Finalize();

    return 0;
}

```

CMakeLists.txt
```
cmake_minimum_required(VERSION 3.18)

# set the project name
project(mpi_gpu_buffer LANGUAGES CXX)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED true)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

find_package(MPI REQUIRED)
find_package(OpenMP REQUIRED)
find_package(Threads REQUIRED)
add_executable(mpi_gpu_buffer main.cpp)

#-----------------------------------------------------------------------------------------------------------------------
#|                                                       CUDA                                                          |
#-----------------------------------------------------------------------------------------------------------------------

enable_language(CUDA)
find_package(CUDAToolkit REQUIRED)
set(CMAKE_CUDA_STANDARD 14)
set(CMAKE_CUDA_STANDARD_REQUIRED true)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --generate-code arch=compute_70,code=sm_70 -lineinfo")
#set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG} -G -Xcompiler -rdynamic -lineinfo")
set(CUDA_PROPAGATE_HOST_FLAGS OFF)
set(CMAKE_CUDA_HOST_COMPILER ${CMAKE_CXX_COMPILER})
set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)
#set(CMAKE_CUDA_ARCHITECTURES 52 61 70)
set(CMAKE_CUDA_ARCHITECTURES 61 70 75)
set(CUDA_LIBRARY CUDA::cudart)

set_property(TARGET mpi_gpu_buffer PROPERTY CUDA_ARCHITECTURES 61 70 75)
target_include_directories(mpi_gpu_buffer PRIVATE
        ${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES})
target_link_libraries(mpi_gpu_buffer
        ${CUDA_LIBRARY}
        ${MPI_CXX_LIBRARIES}
        MPI::MPI_CXX
        OpenMP::OpenMP_CXX)
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Memory Leak using Cuda-Aware MPI_Send and MPI_Recv for large packets of data #9051

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Memory Leak using Cuda-Aware MPI_Send and MPI_Recv for large packets of data #9051

Description

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions