Closed
Description
We have been seeing a lot of fails on collectives in MTT for 2 days in a row.
From what I'm seeing is all the fails are coming from the collective operations. Unfortunately I am not able to reproduce any of this. So I look into the changes in 1.10 and the only commit that might be related to this issue is open-mpi/ompi-release@640bcf6
This is what most of the stacks look like. It also fails on other collective test as well.
[mpi008:15568] *** Process received signal ***
[mpi008:15568] Signal: Segmentation fault (11)
[mpi008:15568] Signal code: Address not mapped (1)
[mpi008:15568] Failing at address: 0x100000030
[mpi008:15569] *** Process received signal ***
[mpi008:15568] [ 0] /lib64/libpthread.so.0[0x3ca080f710]
[mpi008:15568] [ 1] /home/mpiteam/scratches/community/2016-07-26cron/dUiN/installs/WC2g/install/lib/libmpi.so.12(mca_pml_ob1_recv_req_start+0x19e)[0x2aaaaad9aeb3]
[mpi008:15568] [ 2] /home/mpiteam/scratches/community/2016-07-26cron/dUiN/installs/WC2g/install/lib/libmpi.so.12(mca_pml_ob1_irecv+0x318)[0x2aaaaad8e03d]
[mpi008:15568] [ 3] /home/mpiteam/scratches/community/2016-07-26cron/dUiN/installs/WC2g/install/lib/libmpi.so.12(mca_coll_inter_allgather_inter+0x176)[0x2aaaaacac2df]
[mpi008:15568] [ 4] /home/mpiteam/scratches/community/2016-07-26cron/dUiN/installs/WC2g/install/lib/libmpi.so.12(PMPI_Allgather+0x283)[0x2aaaaab4efc8]
[mpi008:15568] [ 5] collective/intercomm/allgather_gap_inter[0x4017a6]
[mpi008:15568] [ 6] collective/intercomm/allgather_gap_inter[0x4014fc]
[mpi008:15568] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3ca041ed1d]
[mpi008:15568] [ 8] collective/intercomm/allgather_gap_inter[0x401259]
[mpi008:15568] *** End of error message ***[mpi008:15569] Signal: Segmentation fault (11)
[mpi008:15569] Signal code: Address not mapped (1)
[mpi008:15569] Failing at address: 0x100000030[mpi008:15569] [ 0]
I will be happy to provide additional information if needed.