Skip to content

Large message collective performance drops when using coll/han #9062

Open
@hjelmn

Description

@hjelmn

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.1.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from source

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: CentOS Linux 7.9
  • Computer hardware: Intel(R) Xeon(R) CPU
  • Network type: 50 GigE

Details of the problem

I am working to tune Open MPI on a new system type. By default coll/tuned is being selected and is giving so-so performance:

mpirun --mca btl_vader_single_copy_mechanism none --mca coll_base_verbose 0 --hostfile /shared/hostfile.ompi -n 128 -N 16 --bind-to core ./osu_allreduce
App launch reported: 9 (out of 9) daemons - 112 (out of 128) procs

# OSU MPI Allreduce Latency Test v5.7.1
# Size       Avg Latency(us)
4                     599.60
8                     321.18
16                    481.45
32                    483.63
64                    483.59
128                   567.62
256                   472.35
512                   431.96
1024                  609.19
2048                  288.70
4096                  355.52
8192                  425.21
16384                 546.61
32768                 739.76
65536                1501.53
131072               2027.41
262144               1015.34
524288               1328.23
1048576              2101.48

The large messages look ok but small messages are not great.

When forcing coll/han things look way better for small messages at a huge cost to the large message performance:

mpirun --mca btl_vader_single_copy_mechanism none --mca coll_base_verbose 0 --hostfile /shared/hostfile.ompi -n 128 -N 16 --bind-to core --mca coll_han_priority 100 ./osu_allreduce
App launch reported: 9 (out of 9) daemons - 112 (out of 128) procs

# OSU MPI Allreduce Latency Test v5.7.1
# Size       Avg Latency(us)
4                     111.77
8                     112.46
16                    111.98
32                    233.86
64                    198.94
128                   321.43
256                   286.42
512                   212.69
1024                  305.23
2048                  257.34
4096                  332.50
8192                  317.34
16384                 359.07
32768                 432.56
65536                 729.18
131072               1102.87
262144               1801.27
524288               3301.01
1048576              6245.48

Is this expected? Another MPI on the system is getting 74us for the small messages (below 1k) and 1400us for 1MB messages.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions