Description
This is for the v3.0.x branch.
I have some tests that fail with vader on PPC (pass on x86 due to more generous memory ordering rules there). It looks to me like one of the wmb calls has been moved. I don't have much knowledge of what vader's doing, but I'm guessing the use of the function mca_btl_vader_fbox_set_header() should boil down to
set data
wmb
set header that says the data is there
but the fbox_set_header function has its wmb() call at the bottom so I think it's probably ending up as
set data
set header that says the data is there
wmb
which wouldn't ensure the data is visible to the reader.
I can hit the problem using the below "maxsoak.c" testcase as
mpicc -o x maxsoak.c
mpirun -np 6 -mca pml ob1 -mca btl vader,self ./x
and the testcase will detect corruption.
For me the failure message from the testcase ends up something like
4: Invalid data: Act:525138 Exp:850 Peer:2 Datasize:32 Mult:50
I don't know the maxsoak.c testcase well, it's just something I know we didn't write so I don't have to go through any special approval process to share that code:
https://gist.github.com/markalle/a1c203297cb6af22a3fb5c24e62b2ba3