Skip to content

Conversation

@awlauria
Copy link
Contributor

No description provided.

@awlauria awlauria force-pushed the fix_rdma_accumulate_segv branch from ab0b49f to 2bb2c16 Compare February 11, 2021 18:54
@jjhursey jjhursey requested a review from hjelmn February 11, 2021 19:05
@awlauria awlauria force-pushed the fix_rdma_accumulate_segv branch from 2bb2c16 to fc6a66d Compare February 11, 2021 19:21
@hjelmn
Copy link
Member

hjelmn commented Feb 11, 2021

See #8469.

@hjelmn
Copy link
Member

hjelmn commented Feb 11, 2021

Please review #8469 and let me know if there are still issues. The warnings fixes look ok to me.

@awlauria
Copy link
Contributor Author

@hjelmn thanks - didn't see your pr. Merged, I'll rebase this PR if there's anything left.

@awlauria awlauria force-pushed the fix_rdma_accumulate_segv branch from fc6a66d to ac93b4f Compare February 11, 2021 23:19
@awlauria
Copy link
Contributor Author

@hjelmn I rebased with your pr + another fix - I found that this test was failing with OMPI_ERR_RMA_RANGE on a relatively low task count (>= 3):
https://scm.nowlab.cse.ohio-state.edu/svn/mpi/mvapich2/trunk/test/mpi/rma/linked_list_bench_lock_all.c

it looks like a timing gap where rank A hasn't quite finished exposing the memory while B is looking for it. I attempted to patch it, and the test now runs clean for me at 40+ ranks with --oversubscribe. Can you take a look?

@awlauria awlauria changed the title Fix rdma accumulate segv Fix rdma warnings + possible race Feb 11, 2021
@awlauria awlauria changed the title Fix rdma warnings + possible race Fix osc/rdma gcc warnings + possible race Feb 11, 2021
@hjelmn
Copy link
Member

hjelmn commented Feb 12, 2021

@awlauria Well. For dynamic memory windows they must use an external synchronization method. MPI_Barrier for example. Will look at the test.

@hjelmn
Copy link
Member

hjelmn commented Feb 12, 2021

Hmm, though in that test the attach should be complete before the pointer is put in the list. Hmmm. I may need to root cause this.

@awlauria awlauria force-pushed the fix_rdma_accumulate_segv branch 3 times, most recently from a07c2da to 9d7af02 Compare February 12, 2021 02:46
@awlauria
Copy link
Contributor Author

@hjelmn thanks - I took another look. I think I found some spots where the exclusive lock could use some tightening around some shared variables, specifically when caching the region count. The test is running pretty clean for me now at 100 procs per run on one node - but I'll keep it running for a while to see how it goes.

@awlauria awlauria force-pushed the fix_rdma_accumulate_segv branch from 817b8bd to f97cb12 Compare February 12, 2021 03:00
@awlauria
Copy link
Contributor Author

@hjelmn I took out the 'retry code' and ran these commits over-night. All passed over 10k runs.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
There were some instances where the exclusive lock needed some
tightening around the region structure.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
@awlauria awlauria force-pushed the fix_rdma_accumulate_segv branch from f97cb12 to 7856a25 Compare February 12, 2021 19:46
@jsquyres
Copy link
Member

Do we need this on v4.1 as well?

@awlauria
Copy link
Contributor Author

If @hjelmn agrees with the change, then yeah I think it should go to the 4.x.x branches.

@hjelmn hjelmn merged commit d3cba3e into open-mpi:master Feb 15, 2021
@awlauria awlauria deleted the fix_rdma_accumulate_segv branch March 17, 2022 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants