tracking: locality aware scheduling in Fluxion #1033

grondo · 2023-05-16T20:06:01Z

This is a tracking issue for handling locality aware scheduling of on-node resources (currently only GPUs I think) in Fluxion.
This is a requirement for our Production Ready System Instance milestone, tracked in this project:

https://github.com/orgs/flux-framework/projects/33

It may be that the desired effect is already achievable by adding some topology information to R or JGF for Fluxion.
We can discuss solutions at that level directly in this issue.

Notes from the original feature request

Especially for CORAL 2, if a user asks for 2 cpus and 1 gpu, ensure that they're in the same NUMA domain. Knowing where NICs are could also be important.
Possibly related, on CORAL 2, users have asked for the ability to oversubscribe GPUs.
Note: mpibind does a lot of this for us at LLNL.

However, it would be better if Fluxion could assign GPUs that are "near" cores when users ask for 1 GPU per core, etc.

Related issues:

-o gpu-affinity=per-task choosing 'wrong' gpus on tioga #986

grondo · 2023-05-25T16:37:20Z

Some notes from the meeting

Currently node local topology information is lost to Fluxion because we initialize from configured R, which has a flat list of core ids and gpu ids on a node.
@trws noted in the meeting that even if we can recover topology information in Fluxion for locality aware scheduling of cores and gpus, gpu logical ids may not be consistent between system restarts. We need a way to ensure a logical GPU id on a node is consistent between the scheduler and the job shell (or mpibind shell plugin for that matter).

grondo · 2023-07-17T16:24:41Z

FYI - a user wants to run 2 jobs on a system, each using 4 cores and 4 gpus, but they are hitting a roadblock due to this issue. If they request two jobs each with 4 cpus and 4 gpus, then they each get different GPUs, but the cores are not necessarily adjacent to the allocated GPUs, and in fact the sets of cores will likely be sequential.

Until we fully resolve this issue, is there some easy way to rewrite the instance R in a batch job to include JGF with the full topology? Then we might be able to get this user going. @milroy @trws?

grondo · 2024-09-26T16:17:42Z

This came up again in yesterday's dev meeting.

In production we currently rely on mpibind to choose the "best" mapping of tasks to cores based on a heuristic. However, mpbind is restricted to the cores handed to it by Flux, and in the case where mpibind doesn't have access to the whole node, it has to make bad choices.

What we're looking for here is a way to seed Fluxion with the on-node topology for batch jobs so that it is able to do better core selection given, e.g. NUMA node layout, GPU locality, etc.

Any ideas here would be appreciated.

cc: @ryanday36

grondo mentioned this issue Aug 27, 2024

Flux not correctly binding a single GPU per task with --gpus-per-task=1 flux-framework/flux-core#6239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracking: locality aware scheduling in Fluxion #1033

tracking: locality aware scheduling in Fluxion #1033

grondo commented May 16, 2023 •

edited

Loading

grondo commented May 25, 2023

grondo commented Jul 17, 2023

grondo commented Sep 26, 2024

tracking: locality aware scheduling in Fluxion #1033

tracking: locality aware scheduling in Fluxion #1033

Comments

grondo commented May 16, 2023 • edited Loading

grondo commented May 25, 2023

grondo commented Jul 17, 2023

grondo commented Sep 26, 2024

grondo commented May 16, 2023 •

edited

Loading