Skip to content

vader transport appears to leave SHM files laying around after successful termination #7220

Closed
@mwheinz

Description

@mwheinz

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

3.1.4

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Packaged with Intel OPA 10.10.0.0.445

Please describe the system on which you are running

Back-to-back Xeon systems running RHEL 7.6 on one and RHEL 8.0 on the other.


Details of the problem

I was using OMPI to do some stress testing of some minor changes to the OPA PSM library, when I discovered that the vader transport appears to be leaking memory mapped files.

I wrote a bash script to run the OSU micro benchmarks in a continuous loop, alternating between using the PSM2 MTL and the OFI MTL. After a 24 hour run, I ran into some "resource exhausted" issues when trying to start new shells, execute shell scripts, etc..

Investigating, I found over 100k shared memory files in /dev/shm, all of the form vader_segment.<hostname>.<hex number>.<decimal number>

It's not clear at this point that the shared memory files are the cause of the problems I had, but they certainly shouldn't be there!

Sample run lines:

mpirun --allow-run-as-root --oversubscribe -np 48 --mca osc pt2pt --mca pml cm --mca mtl ofi --mca mtl_ofi_provider_include psm2,ofi_rxd -H hdsmpriv01,hdsmpriv02 ./mpi/pt2pt/osu_mbw_mr
mpirun --allow-run-as-root --oversubscribe -np 48 --mca osc pt2pt --mca pml cm --mca mtl psm2 -H hdsmpriv01,hdsmpriv02 ./mpi/pt2pt/osu_mbw_mr

Script that was used to run the benchmarks:

#!/bin/bash

# mpirun --mca mtl_base_verbose 10 --mca osc pt2pt --allow-run-as-root --mca pml cm --mca mtl ofi --mca mtl_ofi_provider_include psm2,ofi_rxd -np 2 -H hdsmpriv01,hdsmpriv02 $PWD/IMB-EXT accumulate 2>&1 | tee a

OPTS1="--mca osc pt2pt --mca pml cm --mca mtl ofi --mca mtl_ofi_provider_include psm2,ofi_rxd"
OPTS2="--mca osc pt2pt --mca pml cm --mca mtl psm2"
HOSTS="-H hdsmpriv01,hdsmpriv02"
N=48

TEST_PAIR=(./mpi/pt2pt/osu_bw
	./mpi/pt2pt/osu_bibw
	./mpi/pt2pt/osu_latency_mt
	./mpi/pt2pt/osu_latency
	./mpi/one-sided/osu_get_latency
	./mpi/one-sided/osu_put_latency
	./mpi/one-sided/osu_cas_latency
	./mpi/one-sided/osu_get_acc_latency
	./mpi/one-sided/osu_acc_latency
	./mpi/one-sided/osu_fop_latency
	./mpi/one-sided/osu_get_bw
	./mpi/one-sided/osu_put_bibw
	./mpi/one-sided/osu_put_bw
)
TEST_FULL=(
	./mpi/pt2pt/osu_mbw_mr
	./mpi/pt2pt/osu_multi_lat
	./mpi/startup/osu_init
	./mpi/startup/osu_hello
	./mpi/collective/osu_allreduce
	./mpi/collective/osu_scatter
	./mpi/collective/osu_iallgatherv
	./mpi/collective/osu_alltoallv
	./mpi/collective/osu_ireduce
	./mpi/collective/osu_alltoall
	./mpi/collective/osu_igather
	./mpi/collective/osu_allgatherv
	./mpi/collective/osu_iallgather
	./mpi/collective/osu_reduce
	./mpi/collective/osu_ialltoallv
	./mpi/collective/osu_ibarrier
	./mpi/collective/osu_ibcast
	./mpi/collective/osu_gather
	./mpi/collective/osu_barrier
	./mpi/collective/osu_iscatter
	./mpi/collective/osu_scatterv
	./mpi/collective/osu_igatherv
	./mpi/collective/osu_allgather
	./mpi/collective/osu_ialltoall
	./mpi/collective/osu_ialltoallw
	./mpi/collective/osu_reduce_scatter
	./mpi/collective/osu_iscatterv
	./mpi/collective/osu_gatherv
	./mpi/collective/osu_bcast
	./mpi/collective/osu_iallreduce)

while true; do
	echo "------------------------"
	date
	echo "------------------------"
	for t in ${TEST_PAIR[@]}
	do
		CMD="mpirun --allow-run-as-root -np 2 ${OPTS1} ${HOSTS} ${t}"
		
		echo "${CMD}"

		eval ${CMD}

		CMD="mpirun --allow-run-as-root -np 2 ${OPTS2} ${HOSTS} ${t}"
		
		echo "${CMD}"

		eval ${CMD}
	done
	for t in ${TEST_FULL[@]}
	do
		CMD="mpirun --allow-run-as-root --oversubscribe -np ${N} ${OPTS1} ${HOSTS} ${t}"
		
		echo "${CMD}"

		eval ${CMD}

		CMD="mpirun --allow-run-as-root --oversubscribe -np ${N} ${OPTS2} ${HOSTS} ${t}"
		
		echo "${CMD}"

		eval ${CMD}
	done
	sleep 60
done

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions