-
Notifications
You must be signed in to change notification settings - Fork 931
v4.0.x: Fix "partial" datatype issue #8859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v4.0.x: Fix "partial" datatype issue #8859
Conversation
|
Hmm Mellanox (https://dev.azure.com/mlnx-swx/mellanox-ompi/_build/results?buildId=2782&view=logs&j=05ee1795-5e75-566a-c260-61630d0851a3&t=c5df7113-1021-5252-6808-853389e8d517&l=18904) failed with: Also check out #8845, as that may resolve this??? |
|
i'm able to reproduce the fpe on x86_64. |
|
looks like I didn't resolve a conflict correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hppritcha I noticed two differences between this PR and PR #8769 on master.
Marking this as Request change, but really just need to look at more closely.
| * and keep it hot until the next round. | ||
| */ | ||
| assert( iov_len_local < opal_datatype_basicDatatypes[pElem->elem.common.type]->size ); | ||
| COMPUTE_CSUM( iov_ptr, iov_len_local, pConvertor ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition above if ( 0 != iov_len_local ) { does not match the master if after the first assert after the complete_loop label:
fb07960#diff-596161c653313895d0f7a8d253a134205ae515ec3f7a28ac08355e3a7f1278f3R422
| &temp ); | ||
|
|
||
| pConvertor->partial_length = iov_len_local; | ||
| iov_len_local = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Master has an additional assert here before the opal_unpack_partial_predefined:
assert( iov_len_local < opal_datatype_basicDatatypes[pElem->elem.common.type]->size );
see: fb07960#diff-596161c653313895d0f7a8d253a134205ae515ec3f7a28ac08355e3a7f1278f3R427
Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit e8ebe13)
This commit fixes the support for heterogeneous environments and specifically for external32. The root cause was that during the datatype optimization process types that are contiguous in memory are collapsed together in order to decrease the number of conversion (or memcpy) function calls. The resulting type however, does not have the same conversion rules as the types it replaced, leading to an incorrect (or absent) conversion in some cases. This patch marks the datatypes where types have been collapsed during the optimization process with a flag, allowing the convertor to detect if the optimized type can be used in heterogeneous setups. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit 73d64cb) (cherry picked from commit 2e583f4)
There was a bug allowing for partial packing of non-data elements (such as loop and end_loop markers) during the exit condition of a pack/unpack call. This has basically no meaning. Prevent this bug from happening by making sure the element point to a data before trying to partially pack it. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit 1264b67)
When unpacking a partial predefined element check the boundaries of the description vector type, and adjust the memory pointer accordingly (to reflect not only when a single basic type was correctly unpacked, but also when an entire blocklen has been unpacked). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit fb07960) (cherry picked from commit 06f3364) Conflicts: test/datatype/Makefile.am
Signed-off-by: George Bosilca <bosilca@icl.utk.edu> bot:notacherrypick (cherry picked from commit 565d72e) Conflicts: test/datatype/Makefile.am
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
125ae8d to
2fe7de1
Compare
|
@gpaulsen repushed |
|
@hppritcha, @bosilca suggested we hold off on merging the v5.0.x version of this PR #8891 (comment) until we know more. I assume we should hold off on this a little while longer as well? |
|
sure we can wait |
Brings some updates on the datatype engine into the 4.0. Among these the most critical is the partial unpack bug from #8466.
Here are the commits from master that are covered by this PR:
e8ebe13
9901325
ef28e8d
73d64cb
fb07960
It must be noted that this PR does not bring the support for MPI_LONG and MPI_UNSIGNED_LONG in external32, because it would have required to break the ABI (because of the 2 new datatypes #define added).
Unfortunately, I had to import 2 additional commits in order to be able to build and run on an M1: 4f2dde0 and 73aae14.
Fixes #8466.
One of these commits is intentionally not a cherry pick: bot:notacherrypick