Skip to content

try to resync prrte fork master with upstream #54

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 157 commits into from

Conversation

hppritcha
Copy link
Member

No description provided.

rhc54 and others added 30 commits April 3, 2024 06:48
Always default the number of slots to the available cpus
in the topology. Ensure that we always display some form
of the resulting proces map, or else we will silently
exit.

Signed-off-by: Ralph Castain <rhc@pmix.org>
It should be `help-hostfile.txt`, not `help-hostfiles.txt`

Signed-off-by: Ralph Castain <rhc@pmix.org>
If we use one cpu from an object, then we will get a NULL
response if we ask for the next object of that type within
the remaining cpuset since not all of the cpus in the object
are still available. This problem resulted from the recent
change to only use available cpus in PRRTE topologies.

So instead scan across the cpus, check to see if it is
inside the object of interest - if so, then we can bind
to that cpu, if not then we keep searching.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Only automatically set the display map flag if we are not
launching the job.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Attempt to make it clearer that the binding failed
due to a lack of cpus for the given map/bind
policies.

Signed-off-by: Ralph Castain <rhc@pmix.org>
PRRTE itself no longer requires specific resilience settings.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Add a new cmd line option that corresponds to this
attribute. Add the attribute to the prun payload.
When received, it will default to including in the
job info for the spawned job. Add query support for it.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Homebrew has broken something and I cannot figure
out how to fix it.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Changes will need to be made to Open MPI to parse the contents of
the OMPI_MCA_mpi_memory_alloc_kinds environment variable to
determine how to use the user supplied memory-alloc-kinds information.

See section 11.4.3 of the MPI 4.1 standard.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Get takes a (pmix_value_t**), so don't cast it to (void**)

Signed-off-by: Ralph Castain <rhc@pmix.org>
If we haven't requested LSF support, then don't warn
about not finding yp_all - we didn't ask for LSF,
so no need to warn us if support cannot be built.
It will show in the summary at end of configure.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Now that we have a broader group of contributors starting
to show up, we probably need to start paying more attention
to code quality of contributions. Enable devel-check by
default in Git clones that are configured with enable-debug.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Try adding a build using latest Clang

Signed-off-by: Ralph Castain <rhc@pmix.org>
When building against older PMIx

Signed-off-by: Ralph Castain <rhc@pmix.org>
Need to unpack the ctrls object to maintain pack/unpack
ordering. Update the client example to illustrate that
all the modex info for a proc is returned upon first
request for that proc's info.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Refs open-mpi/ompi#12540

Signed-off-by: Ralph Castain <rhc@pmix.org>
It has been reported (and confirmed) that building against
one version of PMIx and then running with another version
will cause PRRTE to segfault. This isn't a universal rule.
For example, one can switch v5.0 and master without a
problem. However, switching v5.0 and v4.2 is a definite
segfault.

The root cause of the problem is a change in the layout
of the base pmix_object_t definition. This renders all
PMIx objects binary incompatible when crossing between
the v5 and v4 (and below) series.

Changing the v5 definition back to match v4 is an
overly complex task. The changes were required to
accommodate the new shared memory support that
was introduced in v5.

So instead, we check the runtime version of PMIx against
the build version. If the runtime version is incompatible
with the build version, then we print an explanatory
error message and error out.

Signed-off-by: Ralph Castain <rhc@pmix.org>

dd

Signed-off-by: Ralph Castain <rhc@pmix.org>
We had problems in the past with quoted params, but stripping
quotes also has consequences - not clear of the best solution.
For now, let's try going the other way and see how many
problems we encounter.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Fix the issues with the MacOS builds so that they work again in Github
Action environments.

Signed-off-by: Jeff Squyres <jeff@squyres.com>
Enables build against v1.11.8 and above.

Signed-off-by: Ralph Castain <rhc@pmix.org>
If we are trying to bind to an HWLOC object type that is not
defined on a given node, then (a) if the binding policy was
specified by user, then error out; and (b) if we are using
a default binding policy, then simply do not bind.

Signed-off-by: Ralph Castain <rhc@pmix.org>
In some recent Slurm versions, the Slurm runtime is inserting
custom arguments to the PRRTE launcher's `srun` cmd line without
the user being aware of it. In many cases, this may not be a
problem - but in some cases (where the user or the system
admin needs/wants particular cmd line arguments used) this can
cause problems as it happens silently, without the user being
aware of it.

Make this visible when it happens, and provide a mechanism by
which the user/admin can override it. Provide a fairly long
help message explaining what happened and offering advice on
resolution, along with a param for disabling the warning. Add
a param for overriding the "args" param if necessary, along
with a caution as to possible consequences.

Signed-off-by: Ralph Castain <rhc@pmix.org>
RTD is rolling out some changes. Per
https://about.readthedocs.com/blog/2024/07/addons-by-default/, these are the changes we need to make.

Port of open-mpi/ompi#12687

Signed-off-by: Ralph Castain <rhc@pmix.org>
We currently do not support the LTO optimizer
as it is incompatible with our plugin component
architecture. So detect it has been specified
in configure and error out with an explanation.

Includes suggestions from @jsquyres

Signed-off-by: Ralph Castain <rhc@pmix.org>
Break the multi-loop thru loading of param files
that caused us to overwrite values. Defer to the
PMIx pmdl components for obtaining envars and for
checking MCA param overlaps across projects.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Luke Robison <lrbison@amazon.com>
Python 3.12 no longer allows escapes in regular expressions. Instead, use "r" strings.

Signed-off-by: Ralph Castain <rhc@pmix.org>
The formatting is messed up in places, so try and fit it.

Signed-off-by: Ralph Castain <rhc@pmix.org>
rhc54 and others added 26 commits March 18, 2025 18:15
Cleanup mixing of index vs kernel index when calling
interface matching routines. Sanitize the passing
of the interface argv-array.

Signed-off-by: Ralph Castain <rhc@pmix.org>
PMIx supports forward/set, unset, append, and prepend of
environmental variables. However, PRRTE didn't provide
cmd line parsing support for these operations. PMIx has
been extended to do so - add those options to the schizo
components.

Forward (-x) of envars can be just the envar name (to pickup
the local value and forward it), or can be envar=value to
set the envar to a specific value.

Unset (--unset-env) takes just the name of the envar.

Append (--append-env) takes two arguments:
   * the name of the envar, appended with a "[c]" where
     the 'c' is the character to be used as the separator
     between envar values
   * the value to be appended
So it looks like "--append-env FOO[:] 20"

Prepend (--prepend-env) behaves exactly like append except
it prepends the value to whatever current envar value it finds

Multiple instances of any of these options may be present on
the cmd line. Each instance will have its arguments appended
to the parameter's pmix_cli_item_t's values argv-array.

Fix precedence so that app's env overwrites local environment.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Use the flags to set the PMIx paths so we can simply
use the standard compiler to test for PMIx capability
flags.

Signed-off-by: Ralph Castain <rhc@pmix.org>
If configure cannot find "pmixcc", then don't attempt
to create the "pcc" link as that creates an infinite
loop when someone attempts to resolve it.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Update the PRRTE submodule to track upstream master
with PR, including updating PMIx submodule. Test
the build for integration problems.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Needs a colon at the end.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
PMIx master has deprecated the pmix_show_help_add_dir function,
so remove it for now. Will replace it with in-memory help
messages in a follow-on PR.

Signed-off-by: Ralph Castain <rhc@pmix.org>
It isn't possible to install the environments required
to test every launcher in PRRTE. What we can do, though,
is provide a new configure option "--enable-testbuild-launchers"
that will utilize shim headers to allow the components to
at least build.

Note that we are NOT testing the components - we only
verify that they should build.

Signed-off-by: Ralph Castain <rhc@pmix.org>
We no longer support solaris, so remove references to it in the
configure code. Delete two m4 files that duplicated OAC functions.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Show the build time of the docs. Ported from
open-mpi/ompi#13236

Signed-off-by: Ralph Castain <rhc@pmix.org>
Store the show-help strings in memory, thereby
removing the need to find/read files to generate
the full strings. Only works with PMIx versions
greater than v3.x.

Signed-off-by: Ralph Castain <rhc@pmix.org>
No longer supported

Signed-off-by: Ralph Castain <rhc@pmix.org>
PRRTE now requires Python to build when in a Git clone for building
the show-help in-memory text (prte_show_help_content.c) and the
Sphinx-based documentation pages.

Check for an adequate Python version for these purposes.

Signed-off-by: Ralph Castain <rhc@pmix.org>
We don't support endianness mixes. However, we can hit situations
where the topology is different across the allocation. This isn't
just a case of different chips - for example, if a scheduler is
allocating at the CPU instead of node level, it might allocate
different CPUs on the various nodes. In the eyes of the runtime,
this equates to a hetero node situation since the bitmap within
the topology of each node will differ.

Resolving this required:

  * fix some logic errors when handling hetero nodes so
    we don't hang

  * Add a new "--hetero-nodes" cmd line option to help
    optimize DVM startup in the case where allocation is
    being done by CPU - no point in requesting topology
    from every node in that case, just have each daemon
    send its topology

  * Add a new "prte_hetero_nodes" MCA param so that sys
    admins can declare hetero-nodes in the default param
    file on systems where the scheduler is allocating by
    CPU

Update show-help and RST files to cover the new option.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Need "--hetero-nodes"

Signed-off-by: Ralph Castain <rhc@pmix.org>
Decrease the minimum required Python version to v3.6.
Note that this only applies when building from a git
clone, not a tarball. Ensure we cleanup the show-help
content file when doing "make clean".

Signed-off-by: Ralph Castain <rhc@pmix.org>
PMIx_Notify_event is a non-blocking API, so we have to
"hold" all input data until the callback is received.
This includes the procID of the source, so it cannot
be a local variable.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Correctly implement fwd-env as a runtime-options directive,
marking the former "--fwd-environment" cmd line option as
deprecated. Let the MCA param set the default behavior.
Ensure that child jobs can inherit their parent's setting.
Inherit by default unless the spawn request specifies
otherwise with "noinherit" directive or provides its own
fwd environment directive.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Preserve empty lines in the show-help array so
that we retain the author's intended formatting
when displayed.

Signed-off-by: Ralph Castain <rhc@pmix.org>
It looks like the schizo/ompi/schizo-ompi-cli.rstxt grew a reference
to the prrte-rst-content/cli-no-app-prefix.rst file in
f7cc125, but this file was mistakenly
not included to src/docs/prrte-rst-content/Makefile.am's
dist_rst_DATA, even though cli-no-app-prefix.rst was already present
in the source tree.

Signed-off-by: Jeff Squyres <jeff@squyres.com>
Check that we can build OMPI with external copies of PMIx
and PRRTE - ensures that documentation is correct.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Thanks to @sonjahapp for the report!

Signed-off-by: Ralph Castain <rhc@pmix.org>
If a child inherits the fwd environment directive of its parent,
then update the child's attributes as well as forwarding its
environment so that any subsequent grandchildren also inherit
the flag.

Add a user-provided reproducer

Signed-off-by: Ralph Castain <rhc@pmix.org>
We use the pthread_setaffinity_np function if it is found in the standard pthread library. Apparently, however, some folks split the definition of that function from pthreads.h into a separate header, even though they leave the function itself in the pthread library. Go figure.

Port of openpmix/openpmix#3615

Signed-off-by: Ralph Castain <rhc@pmix.org>
Copy link

github-actions bot commented Jun 2, 2025

Hello! The Git Commit Checker CI bot found a few problems with this PR:

ec3f646: Add v4 news file and adjust CI workflows

  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: f8b7b9c

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

Pickup latest changes

Signed-off-by: Ralph Castain <rhc@pmix.org>
Copy link

github-actions bot commented Jun 4, 2025

Hello! The Git Commit Checker CI bot found a few problems with this PR:

ec3f646: Add v4 news file and adjust CI workflows

  • check_cherry_pick: contains a cherry pick message that refers to a commit that exists, but is in an as-yet unmerged pull request: f8b7b9c

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@hppritcha hppritcha closed this Jun 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants