Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Jan 24, 2019

This commit fixes a bug which can lead to incorrect results when using
MPI_OP_NO_OP. The component was not ignoring the source address,
count, and datatype as it should have.

This commit also cleans up some unused/unnecessary code.

References #6275

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

@hjelmn
Copy link
Member Author

hjelmn commented Jan 24, 2019

:bot:retest:ompi:

@ggouaillardet
Copy link
Contributor

@hjelmn I pushed a trivial fix to your commit. Feel free to squash it into yours.

@hppritcha
Copy link
Member

bot:lanl:retest

@hppritcha
Copy link
Member

bot:ompi:retest

@hjelmn hjelmn force-pushed the fix_osc_rdma_issue_6275_mpi_op_no_op_bug branch from 7af907b to cad89f8 Compare January 28, 2019 16:11
@hjelmn hjelmn force-pushed the fix_osc_rdma_issue_6275_mpi_op_no_op_bug branch from cad89f8 to 1aeb91b Compare January 30, 2019 23:00
@hjelmn
Copy link
Member Author

hjelmn commented Jan 30, 2019

@ggouaillardet Cleaned up things a bit and handled some other problems found when attempting to use ARMCI on Crays. Please test this version.

@ibm-ompi
Copy link

ibm-ompi commented Feb 6, 2020

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/81a4fcac04cdd747f4f07bc8dae20c13

@awlauria
Copy link
Contributor

awlauria commented Mar 6, 2020

If nothing else it would be good to get the bug fix in for 5.0.

@awlauria awlauria added this to the v5.0.0 milestone Mar 6, 2020
@hjelmn
Copy link
Member Author

hjelmn commented Mar 6, 2020

wow. completely forgot about this one. will take a look today

@awlauria
Copy link
Contributor

Re-ping. 5.0 branching is targeted for April 30th. If you want this in for 5.0, please target to get it in master by then. Thanks!

@awlauria
Copy link
Contributor

Update - Since this is a bug fix, we don't need it by April 30th to make the 5.0 cut, but the sooner the better. Thanks.

@ibm-ompi
Copy link

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/899d8d91e2ad598b59d3fcbfe3241f88

@hjelmn
Copy link
Member Author

hjelmn commented Jun 23, 2020

Will actually finish this up today.

@hjelmn hjelmn force-pushed the fix_osc_rdma_issue_6275_mpi_op_no_op_bug branch from 1aeb91b to f0f3bab Compare June 23, 2020 16:13
@hjelmn
Copy link
Member Author

hjelmn commented Jun 23, 2020

Should be good to go. Will run some tests shortly to verify.

@lanl-ompi
Copy link
Contributor

Can one of the admins verify this patch?

@jsquyres
Copy link
Member

bot:aws:retest

@jsquyres
Copy link
Member

@hjelmn tells me that this is still relevant / just fell off the radar. It should probably go to releases as well (which I interpret to mean v4.0.x and v4.1.x).

It passed a retest 3 days ago, so it's probably good to go.

@jsquyres
Copy link
Member

@bosilca @ggouaillardet ...any other one-sided people: can you review? I can do a functional test on this, but can't really review the code.

@rajachan
Copy link
Member

rajachan commented Nov 2, 2020

@zhngaj can you review this one?

@zhngaj
Copy link
Contributor

zhngaj commented Nov 2, 2020

Sure, please give me an hour or two if possible.

@hjelmn hjelmn force-pushed the fix_osc_rdma_issue_6275_mpi_op_no_op_bug branch from f0f3bab to ec71780 Compare February 5, 2021 17:20
@ibm-ompi
Copy link

ibm-ompi commented Feb 5, 2021

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/25baea9bad1c453c6e22190672ef709d

This commit fixes a number of bugs in the handling of derived
datatypes when using MPI__Accumulate, MP_Get_accumulate, and
MPI_Fetch_and_op. The following bugs are fixed:

 - Incorrect results when using MPI_OP_NO_OP with a 0 origin
   count. osc/rdma was not ignoring the source address, count, and
   datatype in the MPI_OP_NO_OP case as is required by the standard.

 - Correctly handle result_buffer=MPI_BOTTOM in
   ompi_osc_rdma_gacc_local().

 - Test result_datatype in order to figure out whether a get is
   performed.

References open-mpi#6275

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@ibm-ompi
Copy link

ibm-ompi commented Feb 5, 2021

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/8ba5f583c0c4816cbe59b61c7dfef83c

@ibm-ompi
Copy link

ibm-ompi commented Feb 5, 2021

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/eb9eb190c9d2bdaa4a206158f22ee3b6

@hjelmn
Copy link
Member Author

hjelmn commented Feb 5, 2021

@zhngaj Ready for review again. Rebased and all good to go.

@hjelmn
Copy link
Member Author

hjelmn commented Feb 5, 2021

Looks like a build issue from the rebase fixing.

This commit rearranges the accumulate code so that network AMOs can be
used in a larger number of potential situations. This commit adds a
new MCA variable: osc_rdma_network_max_amo. This variable controls the
maximum datatype count where AMOs will be used. The old default for
this support was count == 1. The new default is count == 32.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a resource leak when using network atomics. There
was a leak when the underlying BTL did not support the attempted
atomic.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn hjelmn force-pushed the fix_osc_rdma_issue_6275_mpi_op_no_op_bug branch from ec71780 to 9b2ed1e Compare February 5, 2021 20:31
@ibm-ompi
Copy link

ibm-ompi commented Feb 5, 2021

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/a405129f92ef05223da08d53104f10ef

@ibm-ompi
Copy link

ibm-ompi commented Feb 5, 2021

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/75905cbb59d1f238d71618f85a690e9e

@hjelmn
Copy link
Member Author

hjelmn commented Feb 5, 2021

@jjhursey Odd failures. I see only a couple of warnings (i should fix) but no code errors.

@jjhursey
Copy link
Member

jjhursey commented Feb 8, 2021

I'm looking at the CI failures now. They seem infrastructure related.

@jjhursey
Copy link
Member

jjhursey commented Feb 8, 2021

bot:ibm:retest

@jjhursey
Copy link
Member

jjhursey commented Feb 8, 2021

IBM Ci is looking good now. Sorry for the noise.

@jsquyres
Copy link
Member

jsquyres commented Feb 8, 2021

@zhngaj Nathan says that this is rebased, still necessary, and ready for review.

Can you review in the near future?

Thanks!

@zhngaj
Copy link
Contributor

zhngaj commented Feb 8, 2021

@zhngaj Nathan says that this is rebased, still necessary, and ready for review.

Can you review in the near future?

Thanks!

Sure, will review.

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhngaj approved, but is not a committer. I'm approving to effectively proxy @zhngaj's approval.

@hjelmn hjelmn merged commit 4cd8e91 into open-mpi:master Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants