Closed
Description
Per #6298, we had an accidental change in behavior of mpirun --host aaa,bbb
between version v2.1.x and v3.0.x. A fix just went in to master in #6493.
Here's what happened:
- v2.0.x: behavior X
- v2.1.x: behavior X
- v3.0.x: switch to behavior Y
- v3.1.x: behavior Y
- v4.0.x: behavior Y
- master (to become v5.0.x): after PR Ensure that nodes are always used in order provided #6493, back to behavior X
The question is: should we put this fix on any of v3.0.x, v3.1.x, and/or v4.0.x?
Summary of behavior change
Behavior X
The ordering of hosts in the --host
list matters:
$ mpirun --host aaa,bbb rank_test
aaa: MCW rank 0
bbb: MCW rank 1
$ mpirun --host bbb,aaa rank_test
aaa: MCW rank 1
bbb: MCW rank 0
Behavior Y
The ordering of hosts in the --host
list does not matter (note: this behavior was unintentional. It was always intended that we honor the ordering of hosts in the --host
list):
$ mpirun --host aaa,bbb rank_test
aaa: MCW rank 0
bbb: MCW rank 1
$ mpirun --host bbb,aaa rank_test
aaa: MCW rank 0
bbb: MCW rank 1
Discussion points
We need to discuss this and decide what to do. Points (in no particular order):
- This is a fairly minor change in behavior.
- Apparently no one noticed this change in behavior between v2.1.x and v3.0.x. It was only discovered recently by @bturrubiates, a Cisco employee (while using Open MPI for other / unrelated testing).
- The fix is probably not worth putting into v3.0.x or v3.1.x.
- But it might be worthwhile to put in to v4.0.x...?
- That being said, even putting it in v4.0.x is at least sorta breaking backwards compatibility. You could squint at this and call it a bug and therefore allow it in. Or you could say that it was effectively the behavior of all the v3.x/v4.x releases, and they're backwards compatible with each other, so we should maintain that behavior in v4.0.x.