Skip to content

Conversation

@devreal
Copy link
Contributor

@devreal devreal commented Jan 2, 2021

This PR adds a new OPAL class opal_cstring_t that represents a reference-counted immutable string. These are used to store and pass info keys and values, which reduces the instances in which strings actually have to copied.

The interface is relatively simple: the opal_cstring_t struct contains two relevant members length and string that contain the number of characters (excl. the null terminator) and the string itself. They are marked const so that they cannot be modified once the object has been created. A new object is created using opal_cstring_create(string) (creates a new cstring with string as its value) or opal_cstring_create_l(string, len) (creates a new cstring with the first len characters of string as its value). The cstring is eventually free'd once a call to OBJ_RELEASE on them finds the object unreferenced.

The function opal_info_get now provides a pointer to a cstring if the key is found, with its reference count incremented. It is thus important that the caller releases the object again once processing has completed. The same holds for the key returned by opal_info_get_nthkey. If another thread removes the key/value pair from the info object the string remains valid until OBJ_RELEASE is called. A new function opal_info_set_cstring allows setting a value directly using a cstring, which avoids unnecessary memory allocation and only causes the cstring's reference count to be incremented.

Some notes

  • The opal_cstring_t struct contains a single character member that is meant to force the compiler to add padding bytes at the end, which we are used to null-terminate the string even if it is not created through opal_cstring_create but through OBJ_NEW and thus no additional memory is allocated. The latter will create an empty, null-terminated string. This works because the flexible array member (FAM) string points to the first byte following the members of the struct but the compiler has to insert padding bytes to ensure alignment as if the FAM was empty. This provides us with 3B (32bit systems) or 7B (64bit systems), respectively, which are sufficient to store the null-terminator for empty strings. The padding bytes will also be used for non-empty strings and are thus factored in during memory allocation in opal_cstring_create.

  • The introduction of const char* info strings had some ripple effects.

    1. I changed the info subscriber interface to using const char* strings (which it should have been from the beginning) and reusing cstring objects where possible (namely when saving old info values).
    2. The PMIx_Get_attribute_string function takes a char* argument but is passed string constants in many places. Digging into it, it doesn't seem to actually modify the string (which it shouldn't anyway if passed string literals). It really should take a const char* argument. Long story short, this currently triggers a warning in dpm_convert about a discarded const qualifier. @rhc54 do you agree with this analysis?
  • The MPI level info handling functions now deal with cstring objects and handle the strings according to their own requirements (C/Fortran). Implementing MPI-4 info string functions on top of this interface should be straight forward without touching the opal layer again.

  • I renamed and moved the functions opal_info_value_to_int and opal_info_value_to_bool to opal_cstring_to_int and opal_cstring_to_bool. They were not exactly info-specific anyway and are imo better suited in the context of the cstring implementation. I did not change opal_string_copy as that operates on char* strings.

  • This PR touches quite a few places that I am not familiar with (all places that process info values) so having some more eyes look at these changes would be greatly appreciated :)

  • This PR supersedes Go back to using opal_string_copy in opal_info_get #7972, which contains the initial discussion on this approach.

@ibm-ompi
Copy link

ibm-ompi commented Jan 2, 2021

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/c7c0b6dee7cd6fc661cff61f6d09c935

@ibm-ompi
Copy link

ibm-ompi commented Jan 2, 2021

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/067a7731e09c8fd095275e5dfebfd889

@devreal devreal force-pushed the opal-cstring-info branch 2 times, most recently from a7e42a1 to 70b7d7d Compare January 3, 2021 10:15
@ibm-ompi
Copy link

ibm-ompi commented Jan 3, 2021

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/76fbd5a8ef087d885cdc943eae75d519

@devreal
Copy link
Contributor Author

devreal commented Jan 3, 2021

I fixed two issues I didn't see in my local setup (in the gpfs integration and with _Static_assert) so now all CI tests pass with the exception of the PGI CI (due to #8325) and the Mellanox CI (which seems to throw an unrelated error).

@rhc54
Copy link
Contributor

rhc54 commented Jan 3, 2021

@rhc54 do you agree with this analysis?

May or may not be true - I'd have to look at it more generally. However, given that this is an MPI_Info problem you are addressing, I would recommend constraining it strictly to that realm and suggest you simply cast the string to silence the warning.

@rhc54
Copy link
Contributor

rhc54 commented Jan 3, 2021

@artemry-mlnx Looks like another issue in the Mellanox CI environment:

ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'hpcx-gcc-stack'

@devreal
Copy link
Contributor Author

devreal commented Jan 3, 2021

I would recommend constraining it strictly to that realm and suggest you simply cast the string to silence the warning.

That is not a solution. In fact, the current implementation was not correct either. Casting away const or ignoring the warning is contradicting the idea of immutable strings. Unless (and until) PMIx_Get_attribute_string takes a const char* argument the only correct fix is to pass a copy of the info string. Fortunately, this is only relevant in error-handling paths in dpm_convert.

It is an unfortunate part of C's history that string literals are of type char[], not const char[] (as they are in C++). I believe that calls such as PMIx_Get_attribute_string("PMIX_PERSONALITY") are invalid in C++ so it really might be worth taking a look at this in PMIx.

@rhc54
Copy link
Contributor

rhc54 commented Jan 3, 2021

I would recommend constraining it strictly to that realm and suggest you simply cast the string to silence the warning.

That is not a solution. In fact, the current implementation was not correct either. Casting away const or ignoring the warning is contradicting the idea of immutable strings. Unless (and until) PMIx_Get_attribute_string takes a const char* argument the only correct fix is to pass a copy of the info string. Fortunately, this is only relevant in error-handling paths in dpm_convert.

It is an unfortunate part of C's history that string literals are of type char[], not const char[] (as they are in C++). I believe that calls such as PMIx_Get_attribute_string("PMIX_PERSONALITY") are invalid in C++ so it really might be worth taking a look at this in PMIx.

There is such a thing as being too anal. I suggest you rethink your philosophy a bit and take a more practical turn. My point here is that OMPI doesn't dictate to PMIx as to when and how to define its APIs. This is true of all the libraries that OMPI uses. So declaring that you don't like something isn't very motivational to those who write/maintain those libraries.

Not everyone feels the need to be a "purist" 😄

@bosilca
Copy link
Member

bosilca commented Jan 3, 2021

This is not about purism but about correctness and consistency, and about the fact that we, the developers' community, learned to do better. Many major libraries (such as the libc) have updated their API for all string manipulation functions to clearly mark the immutability of strings via const char*, because making a function intent clear upfront instead of relying on assumptions or so-called "best practices" leads to better, cleaner and defintively safer code.

There are better practices put forward by the community, it is only constructive to suggest they should be followed more strictly. And for what is worth, even the MPI standard now follows such practices having added const about everywhere it was necessary and made sense. Quite a feat !

@rhc54
Copy link
Contributor

rhc54 commented Jan 3, 2021

Some of us who have dug deeper into this tend not to view it as strongly as your team clearly does. The const qualifier is nothing but a compiler directive which instructs the compiler to (a) generate a warning if the variable is assigned to another non-const variable without a cast, and (b) generate a warning if the variable is altered in that routine. It specifically does not guarantee that the variable is not altered - i.e., the string is not truly immutable.

Compilers don't offer that stronger guarantee for the simple reason that it is unobtainable. The purpose of the directive is to aid developers of that specific package with their internal work. In other words, it is to assist the developers in ensuring that their own code lives up to certain inter-functional directives.

What it cannot do is ensure that any secondary library you call with that variable will behave in a particular manner. The reason is simple: the compiler has no visibility into that library and its layered calls. Thus, it ceases to have any meaning once you transition to a library that is being linked against but lies outside of your own source code.

The question raised here regarded precisely this last situation. OMPI sees a warning because it is passing a const variable to a 3rd party routine that whose parameter is not likewise declared const. The solution is not to ask that library to change their API - you are not likely to get every library that OMPI might call to make such changes just to resolve your internal compiler warnings. Instead, you need to deal with it - after all, you have no guarantee that the variable won't be altered by the library you call, even if they did use a const param because they might very well pass it to some other library that lies outside their control.

This is why even the meaning of the const directive has come into question in the compiler community. Does it mean that the variable cannot be altered at all in the function? Or does it mean only that the value upon return from the routine must be the same as upon entry? For languages such as C, this is an important distinction especially when dealing with strings. Many libraries (including OMPI!) alter const strings while parsing them, restoring the strings to their original value before returning.

I took the time to write all this solely to help educate the community as to why some of us might not feel quite so strongly about the value of const directives. I personally don't care if OMPI wants to ingest all this code to swat the gnat or not - totally up to you. Just please don't suck PMIx down into that black hole. We are a 3rd party library so far as OMPI is concerned - let's keep it that way. If we decide to add const to some API's signature, it will be because we have determined it is best for PMIx to do so - not to silence some warning inside OMPI.

@devreal
Copy link
Contributor Author

devreal commented Jan 4, 2021

@rhc54 You are right that the compiler cannot enforce that a const string is never altered when passed to third-party libraries. However, in C/C++ it is the only way we have to codify a contract as part of the API that states that the variable will remain unchanged. If that third-party breaks that contract and modifies the string (even if it is by ignoring/casting away const and passing it to another library, e.g., libc's strtok) there is nothing I can do but debug for hours and report a bug against that library for not adhering to their own API. I still believe that having such codified contract is better than relying on verbal assertions in the documentation (if any) that are even less enforceable and might change without notice.

Does it mean that the variable cannot be altered at all in the function? Or does it mean only that the value upon return from the routine must be the same as upon entry?

The answer is fairly clear: unless you are willing to jeopardize thread-safety and risk access violations (by writing to write-protected memory such as string literals on some platforms) the variable should not be changed, period.

Many libraries (including OMPI!) alter const strings while parsing them, restoring the strings to their original value before returning.

I'd be happy if you could point out such instances in OMPI. Unsafe behavior should not justify more of it...

Anyway, I was only trying to be helpful and provide a suggestion (won't happen again, I promise). By all means, keep the PMIx API as it is, I have adjusted the OMPI side in the last commit last night.

@jjhursey
Copy link
Member

jjhursey commented Jan 4, 2021

bot:ibm:pgi:retest

@open-mpi open-mpi deleted a comment from ibm-ompi Jan 4, 2021
@open-mpi open-mpi deleted a comment from ibm-ompi Jan 4, 2021
@open-mpi open-mpi deleted a comment from ibm-ompi Jan 4, 2021
@bosilca
Copy link
Member

bosilca commented Jan 4, 2021

There are so many incorrect statements above that this discussion is doing more of a disservice to the community than anything else. To give a simple example, if a const argument is altered inside a function segfault can ensure if the caller, thrusting the API calls the function with a pointer to a const variable (because they could be placed in a read-only data segment, and thus protected).

As @devreal suggested, if @rhc54 can pinpoint us (and here I am talking the OMPI community not my team) to the places where OMPI internally alters const char* that would be extremely helpful, as this is not only bad practice, but incorrect code.

@jsquyres
Copy link
Member

jsquyres commented Jan 4, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is great, actually. I think it cleans up the info code tremendously, and actually gives us a path to implementing the cleaned-up MPI-4 info semantics, too.

Per one of my comments, it would be nice to also clean up the _IN_ monstrosity for info keys, but that wasn't technically in the scope of this PR.

if (0 == strncmp(iterator->ie_key, OPAL_INFO_SAVE_PREFIX,
strlen(OPAL_INFO_SAVE_PREFIX)))
// If we see an __IN_<key> key but no <key>, decide what to do based on mode.
// If we see an __IN_<key> and a <key>, skip since it'll be handled when
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate this whole "prefix" thing, but I realize it's not of this PR's making.

...but now that we have proper and info and string classes, can we do something better?

This is a whole hunk of new/heavily modified code here. Do we have any tests that test what the whole "prefix" thing was doing to know if this new code is working properly? (i.e., preserving user-set values, and exercising the different cases)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding for having this mechanism is that we allow info subscribers to change the value and are not allowed to hand out a different value in MPI_Info_get. I don't know why this is useful, components that don't support an info value should just ignore it.

@markalle Can you shed some light on why we allow subscribers to change the value? What was the use case for that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have dim recollections that this has something to do with distinguishing -- and preserving -- the info values that users set vs. the info values that Open MPI set. I don't remember the exact semantics, but it may have had something to do with what MPI_INFO_GET is supposed to return (or the info that results from MPI_COMM_DUP and the like?) -- i.e., what values the user set vs. the values that Open MPI is actually using.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this actually works correctly on master then: MPI_Info_get will only query the key provided by the user and not check whether there is a backup key prefixed with __IN_. So if we set another value for a key we will actually hand out that new value to the user (my reading of the standard is that this is incorrect behavior). The only place where I can see the prefixed key being used is in opal_info_dup_mode which is called from MPI_Info_dup. There it is used to drop new values and use the backup instead.

Allowing subscriber callbacks to alter an info value is problematic for another reason: assuming we have more than one component subscribing to an info value on the same object (which I don't think we have right now), the value they see depends on the order in which they subscribed (the one set by the user or the one changed by any previous component). That doesn't seem right to me...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this actually works correctly on master then: MPI_Info_get will only query the key provided by the user and not check whether there is a backup key prefixed with __IN_. So if we set another value for a key we will actually hand out that new value to the user (my reading of the standard is that this is incorrect behavior). The only place where I can see the prefixed key being used is in opal_info_dup_mode which is called from MPI_Info_dup. There it is used to drop new values and use the backup instead.

It's quite possible that I'm not remembering correctly.

I think that the intent was to preserve 2 sets of values for cases of conflict... but I don't remember what the precise conflict was -- something to do with what OMPI set vs. what the user set. I don't remember exactly why we had to maintain both values. ☹️ Maybe it had something to do with the imprecise description of INFO_GET (which has since been fixed for MPI-4)...? Something to do with that weird phrasing about "MPI returns the value actually being used", or somesuch...?

Allowing subscriber callbacks to alter an info value is problematic for another reason: assuming we have more than one component subscribing to an info value on the same object (which I don't think we have right now), the value they see depends on the order in which they subscribed (the one set by the user or the one changed by any previous component). That doesn't seem right to me...

Agreed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bosilca Pointed me to §12.2.7 (MPI_Win_set_info and MPI_Win_get_info) which says that the info object retrieved through MPI_Win_get_info should contain the actual values used for hints, not necessarily the ones set through MPI_Win_set_info. So it makes sense that the info subscriber can signal the use of a different value.

However, I still don't understand why we store a backup then. The info object returned by MPI_Win_get_info is a new info object that is duplicated from the info object stored in the window object. It is completely independent of the info object used during window creation and if I MPI_Win_dup the info returned by MPI_Win_get_info I still want to see these values, not some backup. @markalle What am I missing here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed -- we need to figure out what the right behavior is for dup'ing infos, etc.

That being said, I still kinda hate the whole _IN prefix thing. Can we just have storage for 2 strings -- one that the user set and one that Open MPI is using? In many (most?) cases, they will be the same (and could be ref counted, or one of them could even be NULL or something). But more importantly, it would make the code more clear about the explicit choice to save a 2nd value sometimes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to figure out whether we actually need this functionality at all. If not, we can remove it and don't have to bother fixing it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to figure out whether we actually need this functionality at all. If not, we can remove it and don't have to bother fixing it...

Sorry: I wasn't disagreeing. All I was trying to say was: if we need the "have 2 set possible values" functionality, then it would be better to ditch the _IN prefix stuff and just explicitly store 2 strings (e.g., in two different members/variables).

@jsquyres
Copy link
Member

jsquyres commented Feb 6, 2021

Oops -- I didn't mean to approve yet 😉 -- there's one or two simple issues to resolve before we can merge, but regardless: in general, I am strongly in favor of this PR.

@devreal devreal force-pushed the opal-cstring-info branch 2 times, most recently from b119733 to 5b35ebc Compare February 16, 2021 17:25
@ibm-ompi
Copy link

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/484f472b5bfc9e235f9491b6a9feed3c

@awlauria
Copy link
Contributor

@devreal can you rebase? We can merge this today and probably cherry-pick to v5.0.x. @gpaulsen

devreal added 4 commits March 11, 2021 18:59
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
The key and info strings are now stored as opal_cstring_t objects so
returning them works by incrementing their reference count. The
caller is responsible for releasing the objects eventually. This avoids
many unnecessary string copy operations, which are now only required if
the string is to be modified (e.g., passed to strtok).

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
@devreal devreal force-pushed the opal-cstring-info branch from 8fddcb8 to 3c5c132 Compare March 11, 2021 20:14
@devreal
Copy link
Contributor Author

devreal commented Mar 11, 2021

@awlauria I've rebased it. I'd like to get approval from @jsquyres before merge since his first approval was unintentional :)

@awlauria
Copy link
Contributor

Thanks - will wait for @jsquyres 👍

@jsquyres
Copy link
Member

bot:aws:retest

1 similar comment
@jsquyres
Copy link
Member

bot:aws:retest

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@jsquyres
Copy link
Member

This should be cherry-picked to 5.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants