Description
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
3.1.4
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Packaged with Intel OPA 10.10.0.0.445
Please describe the system on which you are running
Back-to-back Xeon systems running RHEL 7.6 on one and RHEL 8.0 on the other.
Details of the problem
I was using OMPI to do some stress testing of some minor changes to the OPA PSM library, when I discovered that the vader transport appears to be leaking memory mapped files.
I wrote a bash script to run the OSU micro benchmarks in a continuous loop, alternating between using the PSM2 MTL and the OFI MTL. After a 24 hour run, I ran into some "resource exhausted" issues when trying to start new shells, execute shell scripts, etc..
Investigating, I found over 100k shared memory files in /dev/shm, all of the form vader_segment.<hostname>.<hex number>.<decimal number>
It's not clear at this point that the shared memory files are the cause of the problems I had, but they certainly shouldn't be there!
Sample run lines:
mpirun --allow-run-as-root --oversubscribe -np 48 --mca osc pt2pt --mca pml cm --mca mtl ofi --mca mtl_ofi_provider_include psm2,ofi_rxd -H hdsmpriv01,hdsmpriv02 ./mpi/pt2pt/osu_mbw_mr
mpirun --allow-run-as-root --oversubscribe -np 48 --mca osc pt2pt --mca pml cm --mca mtl psm2 -H hdsmpriv01,hdsmpriv02 ./mpi/pt2pt/osu_mbw_mr
Script that was used to run the benchmarks:
#!/bin/bash
# mpirun --mca mtl_base_verbose 10 --mca osc pt2pt --allow-run-as-root --mca pml cm --mca mtl ofi --mca mtl_ofi_provider_include psm2,ofi_rxd -np 2 -H hdsmpriv01,hdsmpriv02 $PWD/IMB-EXT accumulate 2>&1 | tee a
OPTS1="--mca osc pt2pt --mca pml cm --mca mtl ofi --mca mtl_ofi_provider_include psm2,ofi_rxd"
OPTS2="--mca osc pt2pt --mca pml cm --mca mtl psm2"
HOSTS="-H hdsmpriv01,hdsmpriv02"
N=48
TEST_PAIR=(./mpi/pt2pt/osu_bw
./mpi/pt2pt/osu_bibw
./mpi/pt2pt/osu_latency_mt
./mpi/pt2pt/osu_latency
./mpi/one-sided/osu_get_latency
./mpi/one-sided/osu_put_latency
./mpi/one-sided/osu_cas_latency
./mpi/one-sided/osu_get_acc_latency
./mpi/one-sided/osu_acc_latency
./mpi/one-sided/osu_fop_latency
./mpi/one-sided/osu_get_bw
./mpi/one-sided/osu_put_bibw
./mpi/one-sided/osu_put_bw
)
TEST_FULL=(
./mpi/pt2pt/osu_mbw_mr
./mpi/pt2pt/osu_multi_lat
./mpi/startup/osu_init
./mpi/startup/osu_hello
./mpi/collective/osu_allreduce
./mpi/collective/osu_scatter
./mpi/collective/osu_iallgatherv
./mpi/collective/osu_alltoallv
./mpi/collective/osu_ireduce
./mpi/collective/osu_alltoall
./mpi/collective/osu_igather
./mpi/collective/osu_allgatherv
./mpi/collective/osu_iallgather
./mpi/collective/osu_reduce
./mpi/collective/osu_ialltoallv
./mpi/collective/osu_ibarrier
./mpi/collective/osu_ibcast
./mpi/collective/osu_gather
./mpi/collective/osu_barrier
./mpi/collective/osu_iscatter
./mpi/collective/osu_scatterv
./mpi/collective/osu_igatherv
./mpi/collective/osu_allgather
./mpi/collective/osu_ialltoall
./mpi/collective/osu_ialltoallw
./mpi/collective/osu_reduce_scatter
./mpi/collective/osu_iscatterv
./mpi/collective/osu_gatherv
./mpi/collective/osu_bcast
./mpi/collective/osu_iallreduce)
while true; do
echo "------------------------"
date
echo "------------------------"
for t in ${TEST_PAIR[@]}
do
CMD="mpirun --allow-run-as-root -np 2 ${OPTS1} ${HOSTS} ${t}"
echo "${CMD}"
eval ${CMD}
CMD="mpirun --allow-run-as-root -np 2 ${OPTS2} ${HOSTS} ${t}"
echo "${CMD}"
eval ${CMD}
done
for t in ${TEST_FULL[@]}
do
CMD="mpirun --allow-run-as-root --oversubscribe -np ${N} ${OPTS1} ${HOSTS} ${t}"
echo "${CMD}"
eval ${CMD}
CMD="mpirun --allow-run-as-root --oversubscribe -np ${N} ${OPTS2} ${HOSTS} ${t}"
echo "${CMD}"
eval ${CMD}
done
sleep 60
done