Skip to content

Conversation

@ggouaillardet
Copy link
Contributor

No description provided.

@ggouaillardet
Copy link
Contributor Author

@bosilca can you please review this ?

Refs. #6285

correctly handle the case in which iovec is full and the
last accessed element of the datatype is the beginning of a loop

Refs. open-mpi#6285

Thanks Axel Huebl for reporting this

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
@ggouaillardet ggouaillardet force-pushed the topic/opal_convertor_raw branch from 9aa778f to 0832ab5 Compare January 23, 2019 06:38
@ggouaillardet
Copy link
Contributor Author

@edgargabriel FYI, this fixes the hdf5 data corruption reported in #6285

@edgargabriel
Copy link
Member

@ggouaillardet thank you, I will test it later today. I can however confirm that it is inline with my analysis over the weekend, that the values that ompio received from the convertor_raw function looked incorrect.

@hppritcha
Copy link
Member

bot:lanl:retest

@ax3l
Copy link

ax3l commented Jan 26, 2019

@ggouaillardet @edgargabriel thank you a lot for finding the root source of this and the PR!
Just to understand the implications for users, does this fix only affect I/O or is opal_convertor_raw used somewhere else as well which might affect other components?

@edgargabriel
Copy link
Member

@ax3l if I remember correctly, the convertor_raw function is also used in the osc/rdma component as well

@hppritcha
Copy link
Member

bot:ompi:retest

@hjelmn
Copy link
Member

hjelmn commented Jan 29, 2019

Looks fine to me. Let's go ahead and bring this into master. Please PR to the affected branches and tag @bosilca to review.

@hjelmn hjelmn merged commit ea40d48 into open-mpi:master Jan 29, 2019
@bosilca
Copy link
Member

bosilca commented Jan 30, 2019

For the sake of completeness, the solution proposed here is not really working on all cases. A better approach has been proposed in #6326 .

@hjelmn
Copy link
Member

hjelmn commented Jan 30, 2019

@bosilca Great. Will test with that. Trying to get ARMCI fully tested with osc/rdma and this helped it pass.

@ggouaillardet
Copy link
Contributor Author

thanks @bosilca, I will test the new PR

@hjelmn FWIW, I do not think the derived datatypes used by ARMCI are complex enough (read the description exceeds OMPI_OSC_RDMA_DECODE_MAX = 64 elements) to be impacted by this fix.

@bosilca
Copy link
Member

bosilca commented Jan 30, 2019

The underlying issue can be replicated with a simple vector, as long as the iov_count is smaller than the count of the vector. Moreover, now that I understood the issue I think we might have the same type of problems in the pack/unpack, which is what #6172 is trying to address. Will try to take a look tomorrow.

@ax3l ax3l mentioned this pull request Feb 23, 2019
bosilca pushed a commit to bosilca/ompi that referenced this pull request Feb 27, 2019
correctly handle the case in which iovec is full and the
last accessed element of the datatype is the beginning of a loop

Refs. open-mpi#6285

Thanks Axel Huebl for reporting this

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

cherry-pick open-mpi#6295 into 3.1
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
bosilca pushed a commit to bosilca/ompi that referenced this pull request Feb 27, 2019
correctly handle the case in which iovec is full and the
last accessed element of the datatype is the beginning of a loop

Refs. open-mpi#6285

Thanks Axel Huebl for reporting this

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

cherry-pick open-mpi#6295 into 3.1
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants