-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
-o gpu-affinity=per-task choosing 'wrong' gpus on tioga #986
Comments
We may need to augment the gpubind shell plugin to use hwloc to assign GPUs to each task with There also may be an issue with the Fluxion scheduler here. It will have to know which GPUs are closest to which CPUs when assigning resources to jobs that share nodes, e.g. within a batch job. E.g. if a job asks for 1 task with 2 gpus per task and it is assigned Processor 0, does it get GPUS 4,5? I kind of doubt it. To fix, we may have to make Fluxion aware of the topology on this system somehow, e.g. by generating JGF to stick in the Rv1 |
Mpibind is getting there. Historically (i.e. in the Slurm plugin version), it does well when one job has all of the resources on the node, but has trouble when multiple jobs are running on a node. I'm not sure yet how well the Flux plugin will do with the same cases. If I run multiple 'flux mini run -n1 ...' commands inside of an instance from a 'flux mini alloc -N1', are those using fluxion or are they scheduled by sched-simple? They appear to have the same GPU and CPU affinity sets as the 'flux mini run -n4' case. |
The scheduler inside of a Plus sched-simple doesn't support GPUs, so you'd get an error trying to request gpus with It would be interesting to see what R looks like for jobs within your BTW, related to |
It might take a sched change if we aren't encoding the locality of the GPUs yet, we'll have to think about that. Depending on what the situation is, we might be able to encode the GPUs shown to sched and selected by it in a way that we don't have to know their number on the final node, but index them by say socket and logical id off of the socket? Will have to look into this. The mpibind plugin should select the local GPUs when it can, but if we're only giving it access to the ones sched selected that will not help. |
Injecting my name here so I get updates (I opened the jira) |
Transferred this issue to flux-sched since it is the thing assigning GPUs in this case. |
I investigated the behavior of First some background. On tioga, [milroy1@tioga11:~]$ lstopo
Machine (503GB total)
Package L#0
Group0 L#0
NUMANode L#0 (P#0 125GB)
L3 L#0 (32MB)
L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
HostBridge
PCIBridge
PCI d1:00.0 (Display)
GPU(RSMI) "rsmi4"
[...] Loading the corresponding XML file works as expected with $ ./resource-query -L test.xml -f hwloc
resource-query> find status=up
[...]
---------------L3cache6[32768:x]
---------------------pu56[1:x]
------------------core56[1:x]
---------------------pu57[1:x]
------------------core57[1:x]
---------------------pu58[1:x]
------------------core58[1:x]
---------------------pu59[1:x]
------------------core59[1:x]
---------------------pu60[1:x]
------------------core60[1:x]
---------------------pu61[1:x]
------------------core61[1:x]
---------------------pu62[1:x]
------------------core62[1:x]
---------------------pu63[1:x]
------------------core63[1:x]
------------------gpu1[1:x]
---------------L3cache7[32768:x]
---------------numanode3[1:x]
------------group3[1:x]
---------socket0[1:x]
---------storage13[52:x]
---------storage14[52:x]
------tioga11[1:x]
---cluster0[1:x]
INFO: =============================
INFO: EXPRESSION="status=up"
INFO: ============================= But there's a lot more detail in the graph and hence number of resource vertices than can be represented by a jobspec generated from, e.g., {
"resources": [
{
"type": "node",
"count": 1,
"with": [
{
"type": "slot",
"count": 1,
"with": [
{
"type": "core",
"count": 4
},
{
"type": "gpu",
"count": 4
}
],
"label": "task"
}
]
}
],
"tasks": [
{
"command": [
"sleep",
"2"
],
"slot": "task",
"count": {
"per_slot": 1
}
}
],
"attributes": {
"system": {
"duration": 0,
"cwd": "",
"shell": {
"options": {
"rlimit": {
"cpu": -1,
"fsize": -1,
"data": -1,
"stack": -1,
"core": 16384,
"nofile": 128000,
"as": -1,
"rss": -1,
"nproc": 8192
},
"cpu-affinity": "per-task",
"gpu-affinity": "per-task"
}
}
}
},
"version": 1
} This means that Fluxion won't return a match (converted to YAML): resource-query> match allocate jobspec.yaml
INFO: =============================
INFO: No matching resources found
INFO: JOBID=1
INFO: ============================= However, a jobspec like this will work: version: 1
resources:
- type: node
count: 1
with:
- type: socket
count: 1
with:
- type: slot
label: task
count: 4
with:
- type: group
count: 1
with:
- type: cache
count: 65536
with:
- type: gpu
count: 1
tasks:
- command:
- sleep
- '2'
slot: task
count:
per_slot: 1
attributes:
system:
duration: 0
cwd: ""
shell:
options:
rlimit:
cpu: -1
fsize: -1
data: -1
stack: -1
core: 16384
nofile: 128000
as: -1
rss: -1
nproc: 8192
cpu-affinity: per-task
gpu-affinity: per-task Which produces the following mapping: resource-query> match allocate jobspec.yaml
------------------gpu4[1:x]
---------------L3cache0[32768:x]
------------------gpu5[1:x]
---------------L3cache1[32768:x]
------------group0[1:x]
------------------gpu2[1:x]
---------------L3cache2[32768:x]
------------------gpu3[1:x]
---------------L3cache3[32768:x]
------------group1[1:x]
------------------gpu6[1:x]
---------------L3cache4[32768:x]
------------------gpu7[1:x]
---------------L3cache5[32768:x]
------------group2[1:x]
------------------gpu0[1:x]
---------------L3cache6[32768:x]
------------------gpu1[1:x]
---------------L3cache7[32768:x]
------------group3[1:x]
---------socket0[1:s]
------tioga11[1:s]
---cluster0[1:s]
INFO: =============================
INFO: JOBID=1
INFO: RESOURCES=ALLOCATED
INFO: SCHEDULED AT=Now
INFO: ============================= Note the mapping group0 --> GPU4,5, group1 --> GPU2,3, group2 --> GPU6,7, group3 --> GPU0,1. |
To clarify above, the jobspec I listed requests four "groups" (should they be discovered as sockets?) each with two GPUs to illustrate the mapping @ryanday36 reported in the first comment. To get the desired resources corresponding to the version: 1
resources:
- type: node
count: 1
with:
- type: socket
count: 1
with:
- type: slot
label: task
count: 1
with:
- type: group
count: 4
with:
- type: cache
count: 32768
with:
- type: gpu
count: 1
tasks:
- command:
- sleep
- '2'
slot: task
count:
per_slot: 1
attributes:
system:
duration: 0
cwd: ""
shell:
options:
rlimit:
cpu: -1
fsize: -1
data: -1
stack: -1
core: 16384
nofile: 128000
as: -1
rss: -1
nproc: 8192
cpu-affinity: per-task
gpu-affinity: per-task Note the ugliness with handling the cache count. |
I'll add that my findings don't demonstrate the mapping for an actual job. They strongly suggest that Fluxion will make the correct rank mapping. I'll figure out a way to get the mapping for an actual job ASAP. |
@grondo, I think I figured out a way to coerce core and sched to output the mapping we want. With the environment variable [milroy1@tioga10]$ FLUXION_RESOURCE_OPTIONS="load-allowlist=node,gpu,group,core load-format=hwloc load-file=tioga_node.xml match-format=simple" flux start
[milroy1@tioga10]$ flux submit -c1 -n4 -g2 -o cpu-affinity=per-task -o gpu-affinity=per-task sleep 1
f3AwVzTy
Sep 14 01:09:45.396952 job-manager.err[0]: cray_pals_port_distributor: Error fetching R from shell-counting future: Invalid argument
Sep 14 01:09:45.397108 job-list.err[0]: parse_R: job f3AwVzTy invalid R: '[' or '{' expected near '-'
[milroy1@tioga10]$ flux job info f3AwVzTy R
------------gpu4[1:x]
------------core15[1:x]
------------gpu5[1:x]
---------group0[1:s]
------------gpu2[1:x]
------------core31[1:x]
------------gpu3[1:x]
---------group1[1:s]
------------gpu6[1:x]
------------core47[1:x]
------------gpu7[1:x]
---------group2[1:s]
------------gpu0[1:x]
------------core63[1:x]
------------gpu1[1:x]
---------group3[1:s]
------tioga10[1:s]
---cluster0[1:s] Note the mapping |
It is possible that the task-to-core mapping is not what's desired. A follow-up test very strongly suggests that the setup produces the right mapping (note the [milroy1@tioga11:utilities]$ export HWLOC_COMPONENTS=x86
[milroy1@tioga11:utilities]$ FLUXION_RESOURCE_OPTIONS="load-allowlist=node,gpu,group,core load-format=hwloc load-file=tioga_node.xml match-policy=high" flux start
[milroy1@tioga11:utilities]$ flux submit -c1 -n1 -g2 -o cpu-affinity=per-task -o gpu-affinity=per-task sleep 1
f87cF7wD
Sep 14 10:15:29.606505 job-list.err[0]: rlist_from_json: : Invalid argument
[milroy1@tioga11:utilities]$ flux job info f87cF7wD R
{"version": 1, "execution": {"R_lite": [{"rank": "-1", "children": {"core": "63", "gpu": "0-1"}}], "nodelist": ["tioga10"], "starttime": 1694711729, "expiration": 4848311729}}
[milroy1@tioga11:utilities]$ exit
[milroy1@tioga11:utilities]$ FLUXION_RESOURCE_OPTIONS="load-allowlist=node,gpu,group,core load-format=hwloc load-file=tioga_node.xml match-policy=low" flux start
[milroy1@tioga11:utilities]$ flux submit -c1 -n1 -g2 -o cpu-affinity=per-task -o gpu-affinity=per-task sleep 1
f3K7v2eK
Sep 14 10:16:03.794322 job-list.err[0]: rlist_from_json: : Invalid argument
[milroy1@tioga11:utilities]$ flux job info f3K7v2eK R
{"version": 1, "execution": {"R_lite": [{"rank": "-1", "children": {"core": "0", "gpu": "4-5"}}], "nodelist": ["tioga10"], "starttime": 1694711763, "expiration": 4848311763}} The two tests individually produce the desired locality-aware mapping. |
Great! I wonder if we can write a shell plugin, activated by an |
One more question: This works for a single node, but if a job has multiple nodes I assume we'll need to fetch the topology for each node and load them separately into Fluxion. The topology of a rank can currently be fetched via the |
I actually wouldn't go so far as to say it works for a single node. My demo above just illustrates that the mapping can be done, but the jobs themselves fail: [milroy1@tioga10:utilities]$ FLUXION_RESOURCE_OPTIONS="load-allowlist=node,gpu,group,core load-format=hwloc load-file=tioga_node.xml" flux start
[milroy1@tioga10:utilities]$ flux submit -c1 -n1 -g2 -o cpu-affinity=per-task -o gpu-affinity=per-task sleep 1
f9rhi9V1
Sep 14 23:41:40.633830 job-list.err[0]: rlist_from_json: : Invalid argument
[milroy1@tioga10:utilities]$ flux job info f9rhi9V1 eventlog
{"timestamp":1694760100.607444,"name":"submit","context":{"userid":<>,"urgency":16,"flags":0,"version":1}}
{"timestamp":1694760100.6202316,"name":"validate"}
{"timestamp":1694760100.6313109,"name":"depend"}
{"timestamp":1694760100.6313372,"name":"priority","context":{"priority":16}}
{"timestamp":1694760100.6334589,"name":"alloc"}
{"timestamp":1694760100.6334941,"name":"prolog-start","context":{"description":"cray-pals-port-distributor"}}
{"timestamp":1694760100.6337798,"name":"prolog-finish","context":{"description":"cray-pals-port-distributor","status":0}}
{"timestamp":1694760100.6350386,"name":"exception","context":{"type":"exec","severity":0,"userid":<>,"note":"reading R: R_lite: failed to read target rank list: Invalid argument"}}
{"timestamp":1694760100.636466,"name":"release","context":{"ranks":"all","final":true}}
{"timestamp":1694760100.636601,"name":"free"}
{"timestamp":1694760100.6366169,"name":"clean"}
[milroy1@tioga10:utilities]$ flux resource R
flux-resource: ERROR: Rlist: invalid argument In this test case at least there's a mismatch between the
Sorry, I'm a bit lost here. If we can use the |
Ah, ok, I see. Fluxion is creating an invalid Rv1 for the jobs: {"version": 1, "execution": {"R_lite": [{"rank": "-1", "children": {"core": "0", "gpu": "4-5"}}], "nodelist": ["tioga10"], "starttime": 1694711763, "expiration": 4848311763}} It appears Fluxion is perhaps just missing rank information in the graph? The rest of R looks fine anyway.
Each core |
see also https://rzlc.llnl.gov/jira/browse/ELCAP-179
The short version of this, I think, is that cpu-affinity and gpu-affinity assign the lowest numbered CPUs and lowest numbered GPUs to the lowest numbered tasks, but on the El Cap hardware, the lowest numbered CPUs are not "closest" (by bandwidth) to the lowest numbered GPUs. The mapping actually looks like:
Processor 0 : GPUs 4,5
Processor 1 : GPUs 2,3
Processor 2 : GPUs 6,7
Processor 3 : GPUs 0,1
whereas '-o cpu-affinity=per-task -o gpu-affinity=per-task' currently gives:
Processor 0 : GPUs 0,1
Processor 1 : GPUs 2,3
Processor 2 : GPUs 4,5
Processor 3 : GPUs 6,7
The text was updated successfully, but these errors were encountered: