Description
Hi,
EDIT: I modified the mentioned SHAs in this first message since it contains wrong info about the wrong sha
up to commit d3587f5, everything was fine, but
as of commit 390e0bc, we have some tests that are failing with errors like this:
[dockercentos7:18478] opal_datatype_pack.c:203 Pointer 0xdf6c970 size 9 is outside [0xdf6c880,0xdf6c969] for base ptr 0xdf6c880 count 10 and data [dockercentos7:18478] Datatype 0xa10f7a0[] size 17 align 8 id 0 length 4 used 3 true_lb 0 true_ub 17 (true_extent 17) lb 0 ub 24 (extent 24) nbElems 3 loops 0 flags 114 (committed contiguous )-cC----GD--[---][---] contain OPAL_INT1:* OPAL_INT8:* OPAL_FLOAT8:* --C---P-D--[---][---] OPAL_FLOAT8 count 1 disp 0x0 (0) blen 1 extent 8 (size 8) --C---P-D--[---][---] OPAL_INT8 count 1 disp 0x8 (8) blen 1 extent 8 (size 8) --C---P-D--[---][---] OPAL_INT1 count 1 disp 0x10 (16) blen 1 extent 1 (size 1) -------G---[---][---] OPAL_LOOP_E prev 3 elements first elem displacement 0 size of data 17 Optimized description -cC---P-DB-[---][---] OPAL_UINT1 count 1 disp 0x0 (0) blen 8 extent 8 (size 8) -cC---P-DB-[---][---] OPAL_UINT1 count 1 disp 0x8 (8) blen 9 extent 9 (size 9) -------G---[---][---] OPAL_LOOP_E prev 2 elements first elem displacement 0 size of data 17 [dockercentos7:18478] opal_datatype_unpack.c:135 Pointer 0xeb57a98 size 9 is outside [0xeb579a8,0xeb57a91] for base ptr 0xeb579a8 count 10 and data [dockercentos7:18478] Datatype 0xa10f7a0[] size 17 align 8 id 0 length 4 used 3 true_lb 0 true_ub 17 (true_extent 17) lb 0 ub 24 (extent 24) nbElems 3 loops 0 flags 114 (committed contiguous )-cC----GD--[---][---] contain OPAL_INT1:* OPAL_INT8:* OPAL_FLOAT8:* --C---P-D--[---][---] OPAL_FLOAT8 count 1 disp 0x0 (0) blen 1 extent 8 (size 8) --C---P-D--[---][---] OPAL_INT8 count 1 disp 0x8 (8) blen 1 extent 8 (size 8) --C---P-D--[---][---] OPAL_INT1 count 1 disp 0x10 (16) blen 1 extent 1 (size 1) -------G---[---][---] OPAL_LOOP_E prev 3 elements first elem displacement 0 size of data 17 Optimized description -cC---P-DB-[---][---] OPAL_UINT1 count 1 disp 0x0 (0) blen 8 extent 8 (size 8) -cC---P-DB-[---][---] OPAL_UINT1 count 1 disp 0x8 (8) blen 9 extent 9 (size 9) -------G---[---][---] OPAL_LOOP_E prev 2 elements first elem displacement 0 size of data 17
Other example:
[dockercentos7:09967] opal_datatype_pack.c:203 Pointer 0x8be7d78 size 9 is outside [0x8be4c40,0x8be7d71] for base ptr 0x8be4c40 count 525 and data [dockercentos7:09967] Datatype 0x8ab8650[] size 17 align 8 id 0 length 4 used 3 true_lb 0 true_ub 17 (true_extent 17) lb 0 ub 24 (extent 24) nbElems 3 loops 0 flags 114 (committed contiguous )-cC----GD--[---][---] contain OPAL_INT8:* OPAL_BOOL:* --C---P-D--[---][---] OPAL_INT8 count 1 disp 0x0 (0) blen 1 extent 8 (size 8) --C---P-D--[---][---] OPAL_INT8 count 1 disp 0x8 (8) blen 1 extent 8 (size 8) --C---P-D--[---][---] OPAL_BOOL count 1 disp 0x10 (16) blen 1 extent 1 (size 1) -------G---[---][---] OPAL_LOOP_E prev 3 elements first elem displacement 0 size of data 17 Optimized description -cC---P-DB-[---][---] OPAL_INT8 count 1 disp 0x0 (0) blen 1 extent 8 (size 8) -cC---P-DB-[---][---] OPAL_UINT1 count 1 disp 0x8 (8) blen 9 extent 9 (size 9) -------G---[---][---] OPAL_LOOP_E prev 2 elements first elem displacement 0 size of data 17 [dockercentos7:09967] *** Process received signal *** [dockercentos7:09967] Signal: Aborted (6) [dockercentos7:09967] Signal code: (-6) [dockercentos7:09967] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7f355e57d5d0] [dockercentos7:09967] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f355d5a2207] [dockercentos7:09967] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f355d5a38f8] [dockercentos7:09967] [ 3] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/BIB/bin/Test.BIBProblemeGD.opt(_Z15attacheDebuggerv+0x2c5e)[0x41a3ee] [dockercentos7:09967] [ 4] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x2bd0)[0x7f356bcfd7e0] [dockercentos7:09967] [ 5] /lib64/libc.so.6(+0x36280)[0x7f355d5a2280] [dockercentos7:09967] [ 6] /lib64/libc.so.6(__sched_yield+0x7)[0x7f355d64ed47] [dockercentos7:09967] [ 7] /opt/openmpi-4.x_debug/lib/libopen-pal.so.40(opal_progress+0xc0)[0x7f355c1988f0] [dockercentos7:09967] [ 8] /opt/openmpi-4.x_debug/lib/libopen-pal.so.40(ompi_sync_wait_mt+0x187)[0x7f355c1a10a5] [dockercentos7:09967] [ 9] /opt/openmpi-4.x_debug/lib/libmpi.so.40(+0x5ef27)[0x7f355f164f27] [dockercentos7:09967] [10] /opt/openmpi-4.x_debug/lib/libmpi.so.40(ompi_request_default_wait+0x27)[0x7f355f164fe9] [dockercentos7:09967] [11] /opt/openmpi-4.x_debug/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xeb)[0x7f355f209957] [dockercentos7:09967] [12] /opt/openmpi-4.x_debug/lib/libmpi.so.40(ompi_coll_base_allreduce_intra_recursivedoubling+0x35e)[0x7f355f20b976] [dockercentos7:09967] [13] /opt/openmpi-4.x_debug/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0xa8)[0x7f354b37e42e] [dockercentos7:09967] [14] /opt/openmpi-4.x_debug/lib/libmpi.so.40(PMPI_Allreduce+0x3c5)[0x7f355f181612]
http://www.giref.ulaval.ca/~cmpgiref/ompi_4.x/2019.08.19.20h08m05s_config.log
http://www.giref.ulaval.ca/~cmpgiref/ompi_4.x/2019.08.19.20h08m05s_confdefs.h
http://www.giref.ulaval.ca/~cmpgiref/ompi_4.x/2019.08.19.20h08m05s_ompi_info_all.txt
All failing tests have more than 1 process.
They are all showing opal_datatype_pack.c:203 and opal_datatype_unpack.c:135 as above.
Note that we are compiling/testing with --enable-debug ...
I do not have a MWE now, but I wanted to report asap so you can be aware of this.
Thanks,
Eric