Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracking: locality aware scheduling in Fluxion #1033

Open
1 task
grondo opened this issue May 16, 2023 · 3 comments
Open
1 task

tracking: locality aware scheduling in Fluxion #1033

grondo opened this issue May 16, 2023 · 3 comments

Comments

@grondo
Copy link
Contributor

grondo commented May 16, 2023

This is a tracking issue for handling locality aware scheduling of on-node resources (currently only GPUs I think) in Fluxion.
This is a requirement for our Production Ready System Instance milestone, tracked in this project:

https://github.com/orgs/flux-framework/projects/33

It may be that the desired effect is already achievable by adding some topology information to R or JGF for Fluxion.
We can discuss solutions at that level directly in this issue.

Notes from the original feature request

Especially for CORAL 2, if a user asks for 2 cpus and 1 gpu, ensure that they're in the same NUMA domain. Knowing where NICs are could also be important.
Possibly related, on CORAL 2, users have asked for the ability to oversubscribe GPUs.
Note: mpibind does a lot of this for us at LLNL.

However, it would be better if Fluxion could assign GPUs that are "near" cores when users ask for 1 GPU per core, etc.

Related issues:

@grondo
Copy link
Contributor Author

grondo commented May 25, 2023

Some notes from the meeting

  • Currently node local topology information is lost to Fluxion because we initialize from configured R, which has a flat list of core ids and gpu ids on a node.
  • @trws noted in the meeting that even if we can recover topology information in Fluxion for locality aware scheduling of cores and gpus, gpu logical ids may not be consistent between system restarts. We need a way to ensure a logical GPU id on a node is consistent between the scheduler and the job shell (or mpibind shell plugin for that matter).

@grondo
Copy link
Contributor Author

grondo commented Jul 17, 2023

FYI - a user wants to run 2 jobs on a system, each using 4 cores and 4 gpus, but they are hitting a roadblock due to this issue. If they request two jobs each with 4 cpus and 4 gpus, then they each get different GPUs, but the cores are not necessarily adjacent to the allocated GPUs, and in fact the sets of cores will likely be sequential.

Until we fully resolve this issue, is there some easy way to rewrite the instance R in a batch job to include JGF with the full topology? Then we might be able to get this user going. @milroy @trws?

@grondo
Copy link
Contributor Author

grondo commented Sep 26, 2024

This came up again in yesterday's dev meeting.

In production we currently rely on mpibind to choose the "best" mapping of tasks to cores based on a heuristic. However, mpbind is restricted to the cores handed to it by Flux, and in the case where mpibind doesn't have access to the whole node, it has to make bad choices.

What we're looking for here is a way to seed Fluxion with the on-node topology for batch jobs so that it is able to do better core selection given, e.g. NUMA node layout, GPU locality, etc.

Any ideas here would be appreciated.

cc: @ryanday36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant