Skip to content

doc: Make setting MCA parameter more prominent #11459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wckzhang
Copy link
Contributor

@wckzhang wckzhang commented Mar 2, 2023

Add link to the quick start doc for more user visibility.

@wckzhang
Copy link
Contributor Author

wckzhang commented Mar 3, 2023

bot:aws:retest

@wckzhang
Copy link
Contributor Author

wckzhang commented Mar 3, 2023

bot:ibm:retest

@wckzhang wckzhang force-pushed the visible_mca_param branch 2 times, most recently from 45d88ee to cb0ecbb Compare March 7, 2023 00:36
@wckzhang wckzhang force-pushed the visible_mca_param branch 2 times, most recently from 201fa0d to 2ce5ac7 Compare March 9, 2023 22:27
@wckzhang wckzhang force-pushed the visible_mca_param branch 2 times, most recently from 8872231 to c78df2d Compare March 10, 2023 22:04
@wckzhang
Copy link
Contributor Author

I also added a note in the doc that explains the deprecated --am (AMCA) option.

jsquyres
jsquyres previously approved these changes Mar 22, 2023
Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that you verified that --tune a,b,c and --mca mca_param_files x,t,z actually work.

@rhc54
Copy link
Contributor

rhc54 commented Mar 22, 2023

--mca mca_param_files x,t,z

Does not exist - only "tune" is recognized

@wckzhang
Copy link
Contributor Author

I verified the tune section, didn't check the mca param files - I can go verify and modify the PR

@wckzhang
Copy link
Contributor Author

I see in the code "mca_param_files" is a deprecated synonym of "mca_base_var_files"...but it doesn't seem like that works either. I can't see why it's not being picked up from the code itself without spending time debugging it. I'll just remove the mca param files section from the quickstart and create an issue about it

@rhc54
Copy link
Contributor

rhc54 commented Mar 24, 2023

IIRC, it was discussed and decided that "tune" was good enough - there was no value in having multiple names for the exact same thing.

@wckzhang
Copy link
Contributor Author

Issue: #11532

@wckzhang
Copy link
Contributor Author

@rhc54 that's understandable, but shouldn't the parameter be removed from the code and ompi_info?

@rhc54
Copy link
Contributor

rhc54 commented Mar 24, 2023

What code are you referring to? I suspect there is some stale entry in the OMPI code base (probably down in OPAL) from the time when the RTE was an integral part of OMPI. I can find nothing in PRRTE (where those params are now read) that references it.

@wckzhang
Copy link
Contributor Author

In opal/mca/base/mca_base_var.c the param is registered:

    ret = mca_base_var_register("opal", "mca", "base", "param_files",
                                "Path for MCA "
                                "configuration files containing variable values",
                                MCA_BASE_VAR_TYPE_STRING, NULL, 0, 0, OPAL_INFO_LVL_2,
                                MCA_BASE_VAR_SCOPE_READONLY, &mca_base_var_files);

and there is logic in this file for reading this file and such

@wckzhang
Copy link
Contributor Author

Did the entire mca_param_file concept disappear? IE. is the openmpi-mca-params.conf file now ignored? That's a significant change if so

@rhc54
Copy link
Contributor

rhc54 commented Mar 24, 2023

Let's not overgeneralize, please. I didn't say it was gone. What I said was that we had three names for the same thing - "tune", "mca_param_files", and "mca_base_param_files". We also have the system default MCA param file, which is the "openmpi-mca-params.conf", and a user-level default MCA param file.

We consolidated the first three (tune, etc) in the RTE down to just "tune". Those files and the default param files (both system and user) are read by the RTE (using the PMIx code so it extends to direct launched apps) and propagated. We set an envar to indicate that we have done this so that OPAL won't duplicate the effort. This also provides for more scalable launch as we don't have every process opening these files.

The code in OPAL therefore won't do anything if the app is launched via mpirun. It should, however, process the file if it the app is direct launched and the direct launcher and OMPI fail to tell the PMIx library it is an OMPI-based app (in which case the PMIx layer will take care of it). If the schizo/ompi component isn't aliasing the mca_base_param_files variable and/or reading it, then it won't get picked up by PRRTE - and it will also be ignored by OPAL if launched via mpirun.

@wckzhang
Copy link
Contributor Author

Updated and removed the mca_param_files options

This reformats the MCA section to be more inline with the rest
of the sections. It also adds a quickstart section for easier
user visibility of setting MCA parameters. Also verifies
TODO's and incorrect sections as well as adds a brief
note about the deprecated --am option.

Signed-off-by: William Zhang <wilzhang@amazon.com>
@wckzhang wckzhang force-pushed the visible_mca_param branch from 4282547 to 2db3828 Compare March 28, 2023 21:20
@wckzhang
Copy link
Contributor Author

@jsquyres can you take a look again?

@wckzhang
Copy link
Contributor Author

Asking since I made some changes in regards to the mca param files in response to testing + Ralph's comments

@jsquyres jsquyres self-requested a review March 31, 2023 11:47
@jsquyres jsquyres dismissed their stale review March 31, 2023 11:47

Need to re-review based on changes

@jsquyres
Copy link
Member

@wckzhang It sounds like the real answer is to fix schitzo:

If the schizo/ompi component isn't aliasing the mca_base_param_files variable and/or reading it, then it won't get picked up by PRRTE - and it will also be ignored by OPAL if launched via mpirun.

Sounds like it should be a pretty simple change to make (i.e., add aliases for legacy behavior) -- can you do so? I'd be a bit wary of having mca_[base_]param_files "work" (from a user perspective) in v4.x. but then "not work" in v5.x. This option isn't used too often, but I'm pretty sure that there are some real-world users out there who use it.

@rhc54
Copy link
Contributor

rhc54 commented Mar 31, 2023

So a little word of caution. I grok why you want to do this, but here is why we made the change. Imagine the case where a user specifies tune=A.txt, mca_param_files=B.txt, and mca_base_param_files=C.txt. I know that sounds silly, but it actually happens in the wild because the user manually sticks one in their environment, not realizing (or forgetting) that another was set in a login file. Or someone will add one to their login file, forgetting that they already specify the other param elsewhere in the file.

Regardless, you now have to deal with the precedence issue. If the files contain no overlap (i.e., the same param isn't set in more than one file), then you can just aggregate them. However, if they do overlap, then you have to decide what to do about it. You could error out, but that breaks some things that were previously allowed - but if you don't, then the final result depends upon the order in which you process them.

Add in that just declaring an MCA synonym doesn't help - the system doesn't know what to do with the case where someone sets a param by its new name, and also sets the synonym's name (IIRC, it just keeps whichever one it sees last - worth checking).

It's a non-trivial problem. 🤷‍♂️

@wckzhang
Copy link
Contributor Author

wckzhang commented Apr 3, 2023

@jsquyres note that it's not a simple alias change. The format of tune files and mca param files are different, so they can't be aliased

@rhc54
Copy link
Contributor

rhc54 commented Apr 3, 2023

The format of tune files and mca param files are different

Well, if it is any comfort, that statement isn't actually supposed to be true. If you look at the code, you'll see that the old aggregate MCA param file is simply deprecated back to the --tune option, and the format is specified to be param=value.

However, amusingly, neither the prte nor ompi schizo components actually do anything with the official PRTE_CLI_TUNE option. It is silently ignored. The ompi component does look specifically for a tune option which it then processes using some unspecified format that involves -x and -mca lines, but that isn't what the documentation indicates:

File(s) containing PRRTE and PMIx MCA params for tuning DVM and/or application operations.
Parameters in the file will be treated as _generic_ parameters and subject to the
translation rules/uncertainties. See "--help mca" for more information.

Syntax in the file is:

param=value

with one parameter and its associated value per line. Empty lines and lines beginning
with the '#' character are ignored.

There must not be any whitespace characters before or after the '=' in the param/value
specification.

So I guess you can do whatever you want with it from the cmd line - either fix the doc or fix the code.

I have to correct myself as well. The tune option is solely a cmd line one. PMIx looks at an MCA paramsmca_base_param_files and a synonym mca_param_files - with the exact problems mentioned above. I'll need to ferret those out. There is no corresponding param for ompi or prte files that I can find so far as PRRTE is concerned, and it appears that the code for processing such files has been removed from the schizo components (not sure when or why).

We do pass OPAL_SYS_PARAMS_GIVEN and OPAL_USER_PARAMS_GIVEN envars telling OPAL to ignore the system and user default MCA param files. Doesn't affect processing of any mca_base_param_files setting so far as I can see. However, that means mpirun is not processing nor passing any values from those files.

The whole thing seems rather a mess, frankly. I can cleanup PMIx and the prte schizo component - OPAL and the ompi one are yours 😄

@gpaulsen
Copy link
Member

@wckzhang are you looking at this?

@gpaulsen
Copy link
Member

@lrbison agreed to open another issue for mca params and point to this doc issue. Thanks

@wckzhang
Copy link
Contributor Author

There is already an issue opened for this: #11532

@wckzhang
Copy link
Contributor Author

@jsquyres since we have decided against using mca_param_files, I think this PR can be merged now unless you have any issues?

@wenduwan
Copy link
Contributor

@jsquyres Do we still want this change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants