Description
Background information
In smcuda btl, when MPI sends occur over recycled virtual addresses (due to cudaMalloc MPI_Isend, MPI_Irecv, waitall , cudaFree loop), there are memory leaks that take place from not closing stale cuIpcMemHandles on receiver side quickly enough (i.e they get closed during finalize). This tends to consume 4MB of memory on GPU per open entry of cuIpcMemHandle. As the number of stale entries grow, the size available for other use comes down quickly. One way of avoiding this situation is to limit the size of rcache vma tree. However, version openmpi version 3.0.x seems to have removed the way to control this size through --mca rcache_base_vma_tree_items_min $min_size --mca rcache_base_vma_tree_items_max $max_size
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
2.0.x, 3.0.x
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone
Configured to build cuda with --with-cuda
flag
Details of the problem
Is it possible to
- Enable runtime parameters that allow controlling vma tree size?
- Can that be set to smaller values by default to not allow memory blow up to occur frequently.
- Can custom parameters be provided just to control rgpusm cache vma tree size?
For 2.0.x we were able to control in the following way:
mpirun -np 2 --mca btl_openib_warn_default_gid_prefix 0 \
--mca rcache_base_vma_tree_items_min 64 \
--mca rcache_base_vma_tree_items_max 128 ./mpi_bug