Description
OMPI version: v2.1
I was recently investigating the issue with PMIx_Get latency of a dstore. I was running on 1 node and observing growing numbers when PPN cont was increased. I was using the default binding policy thinking that it defaults to bind-to core.
The bottleneck was attributed to a thread shift part:
openpmix/openpmix#665 (comment).
Debugging the scheduler that PMIx service thread was assigned to a different core which was causing perf issues. You can see on the plot that starting from 4 procs the performance degrades noticeably. This is due to the fact that if IIRC up to 2 processes mpirun will bind to core and then it will be socket.
Perf confirmed that guess:
- cpu # is enclosed in brackets:
[0004]
; pmix_intra_perf[164802]
is the main threadpmix_intra_perf[164807/164802]
is a service thread.
$ perf sched timehist
...
648540.416283 [0004] pmix_intra_perf[164802] 0.005 0.000 0.005.
648540.416289 [0008] pmix_intra_perf[164807/164802] 0.003 0.000 0.007.
648540.416294 [0004] pmix_intra_perf[164802] 0.004 0.000 0.006.
648540.416299 [0008] pmix_intra_perf[164807/164802] 0.003 0.000 0.006.
...
For 4 PPN case procs was remaining on their CPUs for the whole time (cpu4 and cpu8). But starting from 16PPN they began to actively migrate which caused more rapid growt:
$ perf sched timehist
...
649086.369911 [0019] pmix_intra_perf[165820/165811] 0.004 0.001 0.016.
649086.369914 [0017] pmix_intra_perf[165811] 0.012 0.000 0.006.
649086.369921 [0019] pmix_intra_perf[165820/165811] 0.001 0.000 0.007.
649086.369925 [0017] pmix_intra_perf[165811] 0.005 0.000 0.005.
649086.369933 [0019] pmix_intra_perf[165820/165811] 0.003 0.000 0.008.
649086.369941 [0023] pmix_intra_perf[165811] 0.006 0.000 0.009.
649086.369948 [0019] pmix_intra_perf[165820/165811] 0.006 0.001 0.008.
649086.369953 [0023] pmix_intra_perf[165811] 0.005 0.000 0.006.
649086.369961 [0019] pmix_intra_perf[165820/165811] 0.004 0.001 0.008.
649086.369966 [0023] pmix_intra_perf[165811] 0.005 0.000 0.007.
649086.369984 [0019] pmix_intra_perf[165820/165811] 0.012 0.009 0.010.
649086.369994 [0027] pmix_intra_perf[165811] 0.016 0.001 0.011.
649086.369999 [0019] pmix_intra_perf[165820/165811] 0.008 0.000 0.007.
649086.370004 [0027] pmix_intra_perf[165811] 0.004 0.000 0.006.
649086.370012 [0019] pmix_intra_perf[165820/165811] 0.004 0.000 0.008.
...
After forcing bind-to core performance stabilized (yellow dashed curve):
openpmix/openpmix#665 (comment)
I this an additional input on the impact that default binding policy may have. The suggestion is to consider this at the next OMPI dev meeting.