-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tracking: locality aware scheduling in Fluxion #1033
Comments
Some notes from the meeting
|
FYI - a user wants to run 2 jobs on a system, each using 4 cores and 4 gpus, but they are hitting a roadblock due to this issue. If they request two jobs each with 4 cpus and 4 gpus, then they each get different GPUs, but the cores are not necessarily adjacent to the allocated GPUs, and in fact the sets of cores will likely be sequential. Until we fully resolve this issue, is there some easy way to rewrite the instance R in a batch job to include JGF with the full topology? Then we might be able to get this user going. @milroy @trws? |
This came up again in yesterday's dev meeting. In production we currently rely on mpibind to choose the "best" mapping of tasks to cores based on a heuristic. However, mpbind is restricted to the cores handed to it by Flux, and in the case where mpibind doesn't have access to the whole node, it has to make bad choices. What we're looking for here is a way to seed Fluxion with the on-node topology for batch jobs so that it is able to do better core selection given, e.g. NUMA node layout, GPU locality, etc. Any ideas here would be appreciated. cc: @ryanday36 |
This is a tracking issue for handling locality aware scheduling of on-node resources (currently only GPUs I think) in Fluxion.
This is a requirement for our Production Ready System Instance milestone, tracked in this project:
https://github.com/orgs/flux-framework/projects/33
It may be that the desired effect is already achievable by adding some topology information to R or JGF for Fluxion.
We can discuss solutions at that level directly in this issue.
Notes from the original feature request
Related issues:
The text was updated successfully, but these errors were encountered: