Skip to content

Conversation

@hppritcha
Copy link
Member

Brings some updates on the datatype engine into the 4.0. Among these the most critical is the partial unpack bug from #8466.

Here are the commits from master that are covered by this PR:
e8ebe13
9901325
ef28e8d
73d64cb
fb07960

It must be noted that this PR does not bring the support for MPI_LONG and MPI_UNSIGNED_LONG in external32, because it would have required to break the ABI (because of the 2 new datatypes #define added).

Unfortunately, I had to import 2 additional commits in order to be able to build and run on an M1: 4f2dde0 and 73aae14.

Fixes #8466.

One of these commits is intentionally not a cherry pick: bot:notacherrypick

@hppritcha hppritcha requested review from bosilca and gpaulsen April 26, 2021 20:57
@gpaulsen
Copy link
Member

Hmm Mellanox (https://dev.azure.com/mlnx-swx/mellanox-ompi/_build/results?buildId=2782&view=logs&j=05ee1795-5e75-566a-c260-61630d0851a3&t=c5df7113-1021-5252-6808-853389e8d517&l=18904) failed with:

../../config/test-driver: line 107:  3193 Aborted                 (core dumped) "$@" > $log_file 2>&1
FAIL: partial

Also check out #8845, as that may resolve this???

@hppritcha
Copy link
Member Author

i'm able to reproduce the fpe on x86_64.

@jsquyres jsquyres changed the title Topic/fix data type issue v4.0.x: Fix "partial" datatype issue Apr 27, 2021
@hppritcha
Copy link
Member Author

looks like I didn't resolve a conflict correctly.

Copy link
Member

@gpaulsen gpaulsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hppritcha I noticed two differences between this PR and PR #8769 on master.

Marking this as Request change, but really just need to look at more closely.

* and keep it hot until the next round.
*/
assert( iov_len_local < opal_datatype_basicDatatypes[pElem->elem.common.type]->size );
COMPUTE_CSUM( iov_ptr, iov_len_local, pConvertor );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition above if ( 0 != iov_len_local ) { does not match the master if after the first assert after the complete_loop label:
fb07960#diff-596161c653313895d0f7a8d253a134205ae515ec3f7a28ac08355e3a7f1278f3R422

&temp );

pConvertor->partial_length = iov_len_local;
iov_len_local = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Master has an additional assert here before the opal_unpack_partial_predefined:
assert( iov_len_local < opal_datatype_basicDatatypes[pElem->elem.common.type]->size );
see: fb07960#diff-596161c653313895d0f7a8d253a134205ae515ec3f7a28ac08355e3a7f1278f3R427

bosilca and others added 8 commits April 28, 2021 14:15
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit e8ebe13)
Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
(cherry picked from commit 9901325)
(cherry picked from commit 454d071)
Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
(cherry picked from commit ef28e8d)
(cherry picked from commit 4b2ad7f)
This commit fixes the support for heterogeneous environments and
specifically for external32. The root cause was that during the datatype
optimization process types that are contiguous in memory are collapsed
together in order to decrease the number of conversion (or memcpy)
function calls. The resulting type however, does not have the same
conversion rules as the types it replaced, leading to an incorrect (or
absent) conversion in some cases. This patch marks the datatypes where
types have been collapsed during the optimization process with a flag,
allowing the convertor to detect if the optimized type can be used in
heterogeneous setups.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit 73d64cb)
(cherry picked from commit 2e583f4)
There was a bug allowing for partial packing of non-data elements (such as loop
and end_loop markers) during the exit condition of a pack/unpack call. This has
basically no meaning. Prevent this bug from happening by making sure the element
point to a data before trying to partially pack it.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit 1264b67)
When unpacking a partial predefined element check the boundaries of the
description vector type, and adjust the memory pointer accordingly (to
reflect not only when a single basic type was correctly unpacked, but
also when an entire blocklen has been unpacked).

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit fb07960)
(cherry picked from commit 06f3364)

Conflicts:
	test/datatype/Makefile.am
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

bot:notacherrypick
(cherry picked from commit 565d72e)

Conflicts:
	test/datatype/Makefile.am
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@hppritcha hppritcha force-pushed the topic/fix_data_type_issue branch from 125ae8d to 2fe7de1 Compare April 29, 2021 15:20
@hppritcha
Copy link
Member Author

@gpaulsen repushed

@gpaulsen gpaulsen self-requested a review April 30, 2021 00:47
@gpaulsen
Copy link
Member

gpaulsen commented May 3, 2021

@hppritcha, @bosilca suggested we hold off on merging the v5.0.x version of this PR #8891 (comment) until we know more. I assume we should hold off on this a little while longer as well?

@hppritcha
Copy link
Member Author

sure we can wait

@gpaulsen gpaulsen requested a review from jsquyres May 18, 2021 15:13
@jsquyres jsquyres added this to the v4.0.6 milestone May 18, 2021
@hppritcha hppritcha removed the request for review from jsquyres May 21, 2021 18:15
@gpaulsen gpaulsen merged commit 4ef1243 into open-mpi:v4.0.x May 21, 2021
@hppritcha hppritcha added the NEWS label May 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants