Skip to content

Provide warnings if unknown MCA params used #3

Closed
@ompiteam

Description

@ompiteam

It has long been a problem that users may supply incorrect or unknown MCA parameters and therefore get incorrect or undesired behavior. For example, a user may misspell an MCA parameter name on the mpirun command line and OMPI effectively ignores it because the name that the user provides is effectively never "seen" by the MCA base. That is, there is no error checking in the MCA base to see if there are MCA parameters supplied that do not exist.

While such consistency checking would be extremely helpful to users, it is a fairly difficult problem to solve. Here's a recent mail that I sent on the topic:


I think we all agree that this is something that would be Very Good to have. The reason that it hasn't been done is because I'm not sure how to do it. :-( Actually, more specifically, I can think of several complex ways to do it, but they're all quite unattractive.

The problem is that we don't necessarily have global knowledge of all MCA parameters. Consider this example:

mpirun --mca pls_tm_foo 1 --mca btl_openib_foo 1 -np 4 a.out

These MCA params are going to be visible to three types of processes:

  • mpirun
  • orted
  • a.out (assumedly an MPI process)

So how do we tell mpirun and orted that they should ignore the btl_openib MCA parameter, and tell a.out that it should ignore the pls_tm MCA parameter? There are other, similar corner cases (e.g., what if some node doesn't have the openib BTL component, but others do?).

There are a few ways to do this that I can think of:

  1. each app registers frameworks that it is and is not interested in -- assuming that all MCA params follow the prefix rule, we can parse out which params in the environment belong to which framework (ugh) and then find a) any that fall outside of that (e.g., mis-typed frameworks), and b) any that are in the frameworks of interest that do not match registered params. This doesn't handle all corner cases, though (e.g., openib on some nodes but not all).
  2. some entity (mpirun, most likely) does an ompi_info-like "open all frameworks" and can directly check all MCA params right away. This is an abstraction violation because orterun will be opening frameworks that it should have no knowledge of (e.g., MPI frameworks).
  3. some entity (mpirun, most likely) fork/exec's ompi_info in a special mode that checks for invalid MCA params in the environment (because it will inherit the params for mpirun). This is nice because then mpirun doesn't have to open all the frameworks, but it's an abstraction violation because orterun doesn't know about ompi_info (different layers).

So the first one is the only one that is actually viable (i.e., doesn't cause abstraction violation). But it's still klunky, awkward, and doesn't handle all cases. If anyone has any better ideas, I'm all ears...


Since writing the above e-mail, I had another idea -- address the common case and provide a workaround for the others. Specifically, do not worry about the case where some nodes have component A and others do not. Hence, in this scenario if a user supplies an MCA param for component A, the processes on some nodes will be ok with it (because they have component A), but others will consider it "unrecognized" (because they do not have component A), and will print a warning/error -- potentially causing the job to fail.

To address this, we can add [yet another] MCA parameter to disable this MCA parameter checking. The default value will be to enable MCA parameter checking, but if a user knows what they're doing, or if they fall into the corner case above, they can disable MCA parameter checking and be "good enough."

It's not perfect and it certainly doesn't cover all cases, but it does cover today's common case (where all nodes are homogeneous) and would probably be a good step forward.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions