Skip to content

v3.1.x: REF6976 Silent failure of OMPI over OFI with large messages sizes #7003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 4, 2019
Merged

v3.1.x: REF6976 Silent failure of OMPI over OFI with large messages sizes #7003

merged 1 commit into from
Oct 4, 2019

Conversation

mwheinz
Copy link

@mwheinz mwheinz commented Sep 23, 2019

INTERNAL: STL-59403

The OFI (libfabric) MTL does not respect the maximum message size
parameter that OFI provides in the fi_info data.

This patch adds this missing max_msg_size field to the mca_ofi_module_t
structure and adds a length check to the low-level send routines.

(cherry-picked from commit 3aca4af)
Change-Id: I65ab130794574393edab4a318365c1c125e50712
Signed-off-by: Michael Heinz michael.william.heinz@intel.com
Reviewed-by: Adam Goldman adam.goldman@intel.com
Reviewed-by: Brendan Cunningham brendan.cunningham@intel.com

@@ -247,13 +247,20 @@ ompi_mtl_ofi_send_start(struct mca_mtl_base_module_t *mtl,
endpoint = ompi_mtl_ofi_get_endpoint(mtl, ompi_proc);

ompi_ret = ompi_mtl_datatype_pack(convertor, &start, &length, &free_after);
if (OMPI_SUCCESS != ompi_ret) return ompi_ret;
if (OPAL_UNLIKELY(OMPI_SUCCESS != ompi_ret)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a very small number of style guidelines (https://github.com/open-mpi/ompi/wiki/CodingStyle), but one of them is: absolutely no tabs whatsoever. Always spaces. 😄

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh! okay.

"message too big", false,
length, endpoint->mtl_ofi_module->max_msg_size);
return OMPI_ERROR;
} else if (OPAL_UNLIKELY(MCA_PML_BASE_SEND_SYNCHRONOUS == mode)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why "else if" ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly personal preference - I tend to prefer the form

if (A) { return fail; } else if (B) { return fail; }

to
if (A) { return fail; } if (B) { return fail; }

but that's purely cosmetic. I can change it if you like.

@mwheinz
Copy link
Author

mwheinz commented Sep 23, 2019

Note that this patch is for 3.1.x only. The patch for 4.x/5.x/master needs to be slightly different.

@mwheinz mwheinz requested a review from jsquyres September 23, 2019 18:54
@hppritcha
Copy link
Member

hppritcha commented Sep 23, 2019

@mwheinz first need to open a PR on master then we cherry pick to release branches.
As you noted we probably have to adjust some for different branches.

@mwheinz
Copy link
Author

mwheinz commented Sep 23, 2019

@hppritcha

@mwheinz first need to open a PR on master then we cherry pick to release branches.

Okay.

@mwheinz
Copy link
Author

mwheinz commented Sep 23, 2019

Are there any other issues I should look at before I submit the master branch version?

@gpaulsen
Copy link
Member

@mwheinz, If you can get the v4.0.x PR done soon (presumably a cherry-pick -x from master if possible), we will get it into the next v4.0.2 rc.

@jsquyres
Copy link
Member

Are there any other issues I should look at before I submit the master branch version?

No, I think you're good. I should have mentioned this in my first review: we generally commit stuff to master first, and then git cherry-pick -x ... to get them to release branches (potentially adapting / porting the cherry-pick if necessary). We like to see the (cherry-picked from ...) tagline in the release branch PRs / cherry-picks so that we can track what commits on the release branches came from where.

So once the master one is merged, can you amend this commit with a cherry-picked message (like it would have been if you git cherry-pick -x ...'ed)?

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good code-wise, now; thanks. Just need to get the cherry-pick notice in the commit message once #7004 is merged to master.

@jsquyres jsquyres added this to the v3.1.5 milestone Sep 23, 2019
@jsquyres jsquyres changed the title REF6976 Silent failure of OMPI over OFI with large messages sizes v3.1.x: REF6976 Silent failure of OMPI over OFI with large messages sizes Sep 23, 2019
INTERNAL: STL-59403

The OFI (libfabric) MTL does not respect the maximum message size
parameter that OFI provides in the fi_info data.

This patch adds this missing max_msg_size field to the mca_ofi_module_t
structure and adds a length check to the low-level send routines.

(cherry-picked from commit 3aca4af)
Change-Id: I65ab130794574393edab4a318365c1c125e50712
Signed-off-by: Michael Heinz <michael.william.heinz@intel.com>
Reviewed-by: Adam Goldman <adam.goldman@intel.com>
Reviewed-by: Brendan Cunningham <brendan.cunningham@intel.com>
@mwheinz
Copy link
Author

mwheinz commented Sep 23, 2019

Everything looks good code-wise, now; thanks. Just need to get the cherry-pick notice in the commit message once #7004 is merged to master.

Will do.

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Master PR #7004 merged. The cherry-pick message is now good/valid. Good to go.

@jsquyres
Copy link
Member

@mwheinz Do you plan to make a v3.0.x PR? (we usually make v3.0.x, v3.1.x, and v4.0.x PRs unless there's a reason not to go back that far)

@bwbarrett
Copy link
Member

I'd prefer we not merge this into release branches until my comment on the master pull request (#7004) is discussed. I don't think this was the right fix.

@jsquyres
Copy link
Member

jsquyres commented Oct 4, 2019

Good to go for v3.1.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants