Skip to content

ompi/v3.x.x bug since August 21: opal_datatype_pack.c:203 and opal_datatype_unpack.c:135 #6932

Closed
@ericch1

Description

@ericch1

Hi,

EDIT: I modified the mentioned SHAs in this first message since it contains wrong info about the wrong sha
up to commit d3587f5, everything was fine, but
as of commit 390e0bc, we have some tests that are failing with errors like this:

[dockercentos7:18478] opal_datatype_pack.c:203
	Pointer 0xdf6c970 size 9 is outside [0xdf6c880,0xdf6c969] for
	base ptr 0xdf6c880 count 10 and data 
[dockercentos7:18478] Datatype 0xa10f7a0[] size 17 align 8 id 0 length 4 used 3
true_lb 0 true_ub 17 (true_extent 17) lb 0 ub 24 (extent 24)
nbElems 3 loops 0 flags 114 (committed contiguous )-cC----GD--[---][---]
   contain OPAL_INT1:* OPAL_INT8:* OPAL_FLOAT8:* 
--C---P-D--[---][---]    OPAL_FLOAT8 count 1 disp 0x0 (0) blen 1 extent 8 (size 8)
--C---P-D--[---][---]      OPAL_INT8 count 1 disp 0x8 (8) blen 1 extent 8 (size 8)
--C---P-D--[---][---]      OPAL_INT1 count 1 disp 0x10 (16) blen 1 extent 1 (size 1)
-------G---[---][---]    OPAL_LOOP_E prev 3 elements first elem displacement 0 size of data 17
Optimized description 
-cC---P-DB-[---][---]     OPAL_UINT1 count 1 disp 0x0 (0) blen 8 extent 8 (size 8)
-cC---P-DB-[---][---]     OPAL_UINT1 count 1 disp 0x8 (8) blen 9 extent 9 (size 9)
-------G---[---][---]    OPAL_LOOP_E prev 2 elements first elem displacement 0 size of data 17

[dockercentos7:18478] opal_datatype_unpack.c:135
	Pointer 0xeb57a98 size 9 is outside [0xeb579a8,0xeb57a91] for
	base ptr 0xeb579a8 count 10 and data 
[dockercentos7:18478] Datatype 0xa10f7a0[] size 17 align 8 id 0 length 4 used 3
true_lb 0 true_ub 17 (true_extent 17) lb 0 ub 24 (extent 24)
nbElems 3 loops 0 flags 114 (committed contiguous )-cC----GD--[---][---]
   contain OPAL_INT1:* OPAL_INT8:* OPAL_FLOAT8:* 
--C---P-D--[---][---]    OPAL_FLOAT8 count 1 disp 0x0 (0) blen 1 extent 8 (size 8)
--C---P-D--[---][---]      OPAL_INT8 count 1 disp 0x8 (8) blen 1 extent 8 (size 8)
--C---P-D--[---][---]      OPAL_INT1 count 1 disp 0x10 (16) blen 1 extent 1 (size 1)
-------G---[---][---]    OPAL_LOOP_E prev 3 elements first elem displacement 0 size of data 17
Optimized description 
-cC---P-DB-[---][---]     OPAL_UINT1 count 1 disp 0x0 (0) blen 8 extent 8 (size 8)
-cC---P-DB-[---][---]     OPAL_UINT1 count 1 disp 0x8 (8) blen 9 extent 9 (size 9)
-------G---[---][---]    OPAL_LOOP_E prev 2 elements first elem displacement 0 size of data 17

Other example:

[dockercentos7:09967] opal_datatype_pack.c:203
	Pointer 0x8be7d78 size 9 is outside [0x8be4c40,0x8be7d71] for
	base ptr 0x8be4c40 count 525 and data 
[dockercentos7:09967] Datatype 0x8ab8650[] size 17 align 8 id 0 length 4 used 3
true_lb 0 true_ub 17 (true_extent 17) lb 0 ub 24 (extent 24)
nbElems 3 loops 0 flags 114 (committed contiguous )-cC----GD--[---][---]
   contain OPAL_INT8:* OPAL_BOOL:* 
--C---P-D--[---][---]      OPAL_INT8 count 1 disp 0x0 (0) blen 1 extent 8 (size 8)
--C---P-D--[---][---]      OPAL_INT8 count 1 disp 0x8 (8) blen 1 extent 8 (size 8)
--C---P-D--[---][---]      OPAL_BOOL count 1 disp 0x10 (16) blen 1 extent 1 (size 1)
-------G---[---][---]    OPAL_LOOP_E prev 3 elements first elem displacement 0 size of data 17
Optimized description 
-cC---P-DB-[---][---]      OPAL_INT8 count 1 disp 0x0 (0) blen 1 extent 8 (size 8)
-cC---P-DB-[---][---]     OPAL_UINT1 count 1 disp 0x8 (8) blen 9 extent 9 (size 9)
-------G---[---][---]    OPAL_LOOP_E prev 2 elements first elem displacement 0 size of data 17

[dockercentos7:09967] *** Process received signal ***
[dockercentos7:09967] Signal: Aborted (6)
[dockercentos7:09967] Signal code:  (-6)
[dockercentos7:09967] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7f355e57d5d0]
[dockercentos7:09967] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f355d5a2207]
[dockercentos7:09967] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f355d5a38f8]
[dockercentos7:09967] [ 3] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/BIB/bin/Test.BIBProblemeGD.opt(_Z15attacheDebuggerv+0x2c5e)[0x41a3ee]
[dockercentos7:09967] [ 4] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x2bd0)[0x7f356bcfd7e0]
[dockercentos7:09967] [ 5] /lib64/libc.so.6(+0x36280)[0x7f355d5a2280]
[dockercentos7:09967] [ 6] /lib64/libc.so.6(__sched_yield+0x7)[0x7f355d64ed47]
[dockercentos7:09967] [ 7] /opt/openmpi-4.x_debug/lib/libopen-pal.so.40(opal_progress+0xc0)[0x7f355c1988f0]
[dockercentos7:09967] [ 8] /opt/openmpi-4.x_debug/lib/libopen-pal.so.40(ompi_sync_wait_mt+0x187)[0x7f355c1a10a5]
[dockercentos7:09967] [ 9] /opt/openmpi-4.x_debug/lib/libmpi.so.40(+0x5ef27)[0x7f355f164f27]
[dockercentos7:09967] [10] /opt/openmpi-4.x_debug/lib/libmpi.so.40(ompi_request_default_wait+0x27)[0x7f355f164fe9]
[dockercentos7:09967] [11] /opt/openmpi-4.x_debug/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xeb)[0x7f355f209957]
[dockercentos7:09967] [12] /opt/openmpi-4.x_debug/lib/libmpi.so.40(ompi_coll_base_allreduce_intra_recursivedoubling+0x35e)[0x7f355f20b976]
[dockercentos7:09967] [13] /opt/openmpi-4.x_debug/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0xa8)[0x7f354b37e42e]
[dockercentos7:09967] [14] /opt/openmpi-4.x_debug/lib/libmpi.so.40(PMPI_Allreduce+0x3c5)[0x7f355f181612]

http://www.giref.ulaval.ca/~cmpgiref/ompi_4.x/2019.08.19.20h08m05s_config.log
http://www.giref.ulaval.ca/~cmpgiref/ompi_4.x/2019.08.19.20h08m05s_confdefs.h
http://www.giref.ulaval.ca/~cmpgiref/ompi_4.x/2019.08.19.20h08m05s_ompi_info_all.txt

All failing tests have more than 1 process.
They are all showing opal_datatype_pack.c:203 and opal_datatype_unpack.c:135 as above.

Note that we are compiling/testing with --enable-debug ...

I do not have a MWE now, but I wanted to report asap so you can be aware of this.

Thanks,

Eric

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions