Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

-o gpu-affinity=per-task choosing 'wrong' gpus on tioga #986

Open
Tracked by #1033
ryanday36 opened this issue Sep 27, 2022 · 15 comments
Open
Tracked by #1033

-o gpu-affinity=per-task choosing 'wrong' gpus on tioga #986

ryanday36 opened this issue Sep 27, 2022 · 15 comments

Comments

@ryanday36
Copy link

see also https://rzlc.llnl.gov/jira/browse/ELCAP-179

The short version of this, I think, is that cpu-affinity and gpu-affinity assign the lowest numbered CPUs and lowest numbered GPUs to the lowest numbered tasks, but on the El Cap hardware, the lowest numbered CPUs are not "closest" (by bandwidth) to the lowest numbered GPUs. The mapping actually looks like:

Processor 0 : GPUs 4,5
Processor 1 : GPUs 2,3
Processor 2 : GPUs 6,7
Processor 3 : GPUs 0,1

whereas '-o cpu-affinity=per-task -o gpu-affinity=per-task' currently gives:

Processor 0 : GPUs 0,1
Processor 1 : GPUs 2,3
Processor 2 : GPUs 4,5
Processor 3 : GPUs 6,7

@grondo
Copy link
Contributor

grondo commented Sep 27, 2022

We may need to augment the gpubind shell plugin to use hwloc to assign GPUs to each task with -o gpu-affinity=per-task. Alternately, I wonder if mpibind would "just work" here?

There also may be an issue with the Fluxion scheduler here. It will have to know which GPUs are closest to which CPUs when assigning resources to jobs that share nodes, e.g. within a batch job. E.g. if a job asks for 1 task with 2 gpus per task and it is assigned Processor 0, does it get GPUS 4,5? I kind of doubt it. To fix, we may have to make Fluxion aware of the topology on this system somehow, e.g. by generating JGF to stick in the Rv1 .scheduling key. If you can validate if this is problem, then we should open a separate issue in flux-sched and strategize there.

@ryanday36
Copy link
Author

Mpibind is getting there. Historically (i.e. in the Slurm plugin version), it does well when one job has all of the resources on the node, but has trouble when multiple jobs are running on a node. I'm not sure yet how well the Flux plugin will do with the same cases. If I run multiple 'flux mini run -n1 ...' commands inside of an instance from a 'flux mini alloc -N1', are those using fluxion or are they scheduled by sched-simple? They appear to have the same GPU and CPU affinity sets as the 'flux mini run -n4' case.

@grondo
Copy link
Contributor

grondo commented Sep 27, 2022

The scheduler inside of a flux mini alloc or flux mini batch should still be Fluxion whenever flux-sched is installed. You can check with flux module list | grep sched. The only difference is that the configured scheduler "policy" will be the default instead of whatever the system policy is, e.g. one of the exclusive node policies "hinodex" or "lonodex".

Plus sched-simple doesn't support GPUs, so you'd get an error trying to request gpus with --gpus-per-task.

It would be interesting to see what R looks like for jobs within your flux mini alloc session when they request a single processor and 1 or 2 GPUs.

BTW, related to -o gpu-affinity=per-task @trws is our hwloc expert I think, and he may be able to suggest how to fix or replace the gpubind shell plugin so it selects the correct GPU from the set of available GPUs. (However, this assumes that the scheduler has chosen the correct GPUs to assign to the job)

@trws
Copy link
Member

trws commented Sep 27, 2022

It might take a sched change if we aren't encoding the locality of the GPUs yet, we'll have to think about that. Depending on what the situation is, we might be able to encode the GPUs shown to sched and selected by it in a way that we don't have to know their number on the final node, but index them by say socket and logical id off of the socket? Will have to look into this. The mpibind plugin should select the local GPUs when it can, but if we're only giving it access to the ones sched selected that will not help.

@jjellio
Copy link

jjellio commented Sep 28, 2022

Injecting my name here so I get updates (I opened the jira)

@grondo grondo transferred this issue from flux-framework/flux-core Oct 26, 2022
@grondo
Copy link
Contributor

grondo commented Oct 26, 2022

Transferred this issue to flux-sched since it is the thing assigning GPUs in this case.

@milroy
Copy link
Member

milroy commented Aug 9, 2023

I investigated the behavior of hwloc on the Tioga system to see if and how it can generate an XML that can be loaded by the Fluxion resource-query utility. With resource-query I tested whether Fluxion can generate a mapping that takes into account CPU-GPU locality.

First some background. On tioga, lstopo (based onhwloc 2.1.9) returns a warning indicating that it is ignoring invalid hardware topology. The resulting XML file has the GPUs hanging off by themselves underneath the node. To detect the topology correctly, you need to set HWLOC_COMPONENTS=x86 in the environment:

[milroy1@tioga11:~]$ lstopo
Machine (503GB total)
  Package L#0
    Group0 L#0
      NUMANode L#0 (P#0 125GB)
      L3 L#0 (32MB)
        L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
        L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
        L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
        L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
        L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
        L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
        L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
        L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
        HostBridge
          PCIBridge
            PCI d1:00.0 (Display)
              GPU(RSMI) "rsmi4"
[...]

Loading the corresponding XML file works as expected with resource-query:

$ ./resource-query -L test.xml -f hwloc
resource-query> find status=up
[...]
      ---------------L3cache6[32768:x]
      ---------------------pu56[1:x]
      ------------------core56[1:x]
      ---------------------pu57[1:x]
      ------------------core57[1:x]
      ---------------------pu58[1:x]
      ------------------core58[1:x]
      ---------------------pu59[1:x]
      ------------------core59[1:x]
      ---------------------pu60[1:x]
      ------------------core60[1:x]
      ---------------------pu61[1:x]
      ------------------core61[1:x]
      ---------------------pu62[1:x]
      ------------------core62[1:x]
      ---------------------pu63[1:x]
      ------------------core63[1:x]
      ------------------gpu1[1:x]
      ---------------L3cache7[32768:x]
      ---------------numanode3[1:x]
      ------------group3[1:x]
      ---------socket0[1:x]
      ---------storage13[52:x]
      ---------storage14[52:x]
      ------tioga11[1:x]
      ---cluster0[1:x]
INFO: =============================
INFO: EXPRESSION="status=up"
INFO: =============================

But there's a lot more detail in the graph and hence number of resource vertices than can be represented by a jobspec generated from, e.g., flux run -N1 -c4 -n1 -g4 -o cpu-affinity=per-task -o gpu-affinity=per-task sleep 2:

{
  "resources": [
    {
      "type": "node",
      "count": 1,
      "with": [
        {
          "type": "slot",
          "count": 1,
          "with": [
            {
              "type": "core",
              "count": 4
            },
            {
              "type": "gpu",
              "count": 4
            }
          ],
          "label": "task"
        }
      ]
    }
  ],
  "tasks": [
    {
      "command": [
        "sleep",
        "2"
      ],
      "slot": "task",
      "count": {
        "per_slot": 1
      }
    }
  ],
  "attributes": {
    "system": {
      "duration": 0,
      "cwd": "",
      "shell": {
        "options": {
          "rlimit": {
            "cpu": -1,
            "fsize": -1,
            "data": -1,
            "stack": -1,
            "core": 16384,
            "nofile": 128000,
            "as": -1,
            "rss": -1,
            "nproc": 8192
          },
          "cpu-affinity": "per-task",
          "gpu-affinity": "per-task"
        }
      }
    }
  },
  "version": 1
}

This means that Fluxion won't return a match (converted to YAML):

resource-query> match allocate jobspec.yaml
INFO: =============================
INFO: No matching resources found
INFO: JOBID=1
INFO: =============================

However, a jobspec like this will work:

version: 1
resources:
  - type: node
    count: 1
    with:
     - type: socket
       count: 1
       with:
        - type: slot
          label: task
          count: 4
          with:
           - type: group
             count: 1
             with:
              - type: cache
                count: 65536
                with:
                 - type: gpu
                   count: 1
tasks:
- command:
  - sleep
  - '2'
  slot: task
  count:
    per_slot: 1
attributes:
  system:
    duration: 0
    cwd: ""
    shell:
      options:
        rlimit:
          cpu: -1
          fsize: -1
          data: -1
          stack: -1
          core: 16384
          nofile: 128000
          as: -1
          rss: -1
          nproc: 8192
        cpu-affinity: per-task
        gpu-affinity: per-task

Which produces the following mapping:

resource-query> match allocate jobspec.yaml
      ------------------gpu4[1:x]
      ---------------L3cache0[32768:x]
      ------------------gpu5[1:x]
      ---------------L3cache1[32768:x]
      ------------group0[1:x]
      ------------------gpu2[1:x]
      ---------------L3cache2[32768:x]
      ------------------gpu3[1:x]
      ---------------L3cache3[32768:x]
      ------------group1[1:x]
      ------------------gpu6[1:x]
      ---------------L3cache4[32768:x]
      ------------------gpu7[1:x]
      ---------------L3cache5[32768:x]
      ------------group2[1:x]
      ------------------gpu0[1:x]
      ---------------L3cache6[32768:x]
      ------------------gpu1[1:x]
      ---------------L3cache7[32768:x]
      ------------group3[1:x]
      ---------socket0[1:s]
      ------tioga11[1:s]
      ---cluster0[1:s]
INFO: =============================
INFO: JOBID=1
INFO: RESOURCES=ALLOCATED
INFO: SCHEDULED AT=Now
INFO: =============================

Note the mapping group0 --> GPU4,5, group1 --> GPU2,3, group2 --> GPU6,7, group3 --> GPU0,1.
(Edited to fix the jobspec.)

@milroy
Copy link
Member

milroy commented Aug 9, 2023

To clarify above, the jobspec I listed requests four "groups" (should they be discovered as sockets?) each with two GPUs to illustrate the mapping @ryanday36 reported in the first comment. To get the desired resources corresponding to the flux run command above, I'd need this:

version: 1
resources:
  - type: node
    count: 1
    with:
     - type: socket
       count: 1
       with:
        - type: slot
          label: task
          count: 1
          with:
           - type: group
             count: 4
             with:
              - type: cache
                count: 32768
                with:
                 - type: gpu
                   count: 1
tasks:
- command:
  - sleep
  - '2'
  slot: task
  count:
    per_slot: 1
attributes:
  system:
    duration: 0
    cwd: ""
    shell:
      options:
        rlimit:
          cpu: -1
          fsize: -1
          data: -1
          stack: -1
          core: 16384
          nofile: 128000
          as: -1
          rss: -1
          nproc: 8192
        cpu-affinity: per-task
        gpu-affinity: per-task

Note the ugliness with handling the cache count.

@milroy
Copy link
Member

milroy commented Aug 9, 2023

I'll add that my findings don't demonstrate the mapping for an actual job. They strongly suggest that Fluxion will make the correct rank mapping. I'll figure out a way to get the mapping for an actual job ASAP.

@milroy
Copy link
Member

milroy commented Sep 14, 2023

@grondo, I think I figured out a way to coerce core and sched to output the mapping we want. With the environment variable HWLOC_COMPONENTS=x86 set I generated an XML of tioga10 (lstopo --of xml tioga_node.xml). Then I can get Fluxion to load a resource graph based on the XML, passing in an allowlist that generates a resource graph with the locality embedded in the topology. I selected the simple match-format for legibility (which causes parsing errors):

[milroy1@tioga10]$ FLUXION_RESOURCE_OPTIONS="load-allowlist=node,gpu,group,core load-format=hwloc load-file=tioga_node.xml match-format=simple" flux start
[milroy1@tioga10]$ flux submit -c1 -n4 -g2 -o cpu-affinity=per-task -o gpu-affinity=per-task sleep 1
f3AwVzTy
Sep 14 01:09:45.396952 job-manager.err[0]: cray_pals_port_distributor: Error fetching R from shell-counting future: Invalid argument
Sep 14 01:09:45.397108 job-list.err[0]: parse_R: job f3AwVzTy invalid R: '[' or '{' expected near '-'
[milroy1@tioga10]$ flux job info f3AwVzTy R
      ------------gpu4[1:x]
      ------------core15[1:x]
      ------------gpu5[1:x]
      ---------group0[1:s]
      ------------gpu2[1:x]
      ------------core31[1:x]
      ------------gpu3[1:x]
      ---------group1[1:s]
      ------------gpu6[1:x]
      ------------core47[1:x]
      ------------gpu7[1:x]
      ---------group2[1:s]
      ------------gpu0[1:x]
      ------------core63[1:x]
      ------------gpu1[1:x]
      ---------group3[1:s]
      ------tioga10[1:s]
      ---cluster0[1:s]

Note the mapping core15 --> gpu4,5 (in group 0 or processor 0) and the following which appears to respect the true physical locality.

@milroy
Copy link
Member

milroy commented Sep 14, 2023

It is possible that the task-to-core mapping is not what's desired. A follow-up test very strongly suggests that the setup produces the right mapping (note the match-policy=high and match-policy=low):

[milroy1@tioga11:utilities]$ export HWLOC_COMPONENTS=x86
[milroy1@tioga11:utilities]$ FLUXION_RESOURCE_OPTIONS="load-allowlist=node,gpu,group,core load-format=hwloc load-file=tioga_node.xml match-policy=high" flux start
[milroy1@tioga11:utilities]$ flux submit -c1 -n1 -g2 -o cpu-affinity=per-task -o gpu-affinity=per-task sleep 1
f87cF7wD
Sep 14 10:15:29.606505 job-list.err[0]: rlist_from_json: : Invalid argument
[milroy1@tioga11:utilities]$ flux job info f87cF7wD R
{"version": 1, "execution": {"R_lite": [{"rank": "-1", "children": {"core": "63", "gpu": "0-1"}}], "nodelist": ["tioga10"], "starttime": 1694711729, "expiration": 4848311729}}
[milroy1@tioga11:utilities]$ exit
[milroy1@tioga11:utilities]$ FLUXION_RESOURCE_OPTIONS="load-allowlist=node,gpu,group,core load-format=hwloc load-file=tioga_node.xml match-policy=low" flux start
[milroy1@tioga11:utilities]$ flux submit -c1 -n1 -g2 -o cpu-affinity=per-task -o gpu-affinity=per-task sleep 1
f3K7v2eK
Sep 14 10:16:03.794322 job-list.err[0]: rlist_from_json: : Invalid argument
[milroy1@tioga11:utilities]$ flux job info f3K7v2eK R
{"version": 1, "execution": {"R_lite": [{"rank": "-1", "children": {"core": "0", "gpu": "4-5"}}], "nodelist": ["tioga10"], "starttime": 1694711763, "expiration": 4848311763}}

The two tests individually produce the desired locality-aware mapping.

@grondo
Copy link
Contributor

grondo commented Sep 14, 2023

Great! I wonder if we can write a shell plugin, activated by an -o option, to dump the topology and set the environment variable on behalf of users before launching the broker. I can try to do that a bit later.

@grondo
Copy link
Contributor

grondo commented Sep 14, 2023

One more question: This works for a single node, but if a job has multiple nodes I assume we'll need to fetch the topology for each node and load them separately into Fluxion.

The topology of a rank can currently be fetched via the resource.topo-get RPC. The job shell uses this to fetch a copy of the hwloc topology from the enclosing instance without needing to call hwloc_topology_load() which is very expensive. As a next step, we may want to add another option to Fluxion to fetch these XMLs from every rank directly via an RPC, instead of having to collect them into a filesystem location. Maybe this can be done via the config file instead of an environment variable since we now have a --conf=CONFIG option in flux alloc and flux batch.

@milroy
Copy link
Member

milroy commented Sep 15, 2023

One more question: This works for a single node, but if a job has multiple nodes I assume we'll need to fetch the topology for each node and load them separately into Fluxion.

I actually wouldn't go so far as to say it works for a single node. My demo above just illustrates that the mapping can be done, but the jobs themselves fail:

[milroy1@tioga10:utilities]$ FLUXION_RESOURCE_OPTIONS="load-allowlist=node,gpu,group,core load-format=hwloc load-file=tioga_node.xml" flux start
[milroy1@tioga10:utilities]$ flux submit -c1 -n1 -g2 -o cpu-affinity=per-task -o gpu-affinity=per-task sleep 1
f9rhi9V1
Sep 14 23:41:40.633830 job-list.err[0]: rlist_from_json: : Invalid argument
[milroy1@tioga10:utilities]$ flux job info f9rhi9V1 eventlog
{"timestamp":1694760100.607444,"name":"submit","context":{"userid":<>,"urgency":16,"flags":0,"version":1}}
{"timestamp":1694760100.6202316,"name":"validate"}
{"timestamp":1694760100.6313109,"name":"depend"}
{"timestamp":1694760100.6313372,"name":"priority","context":{"priority":16}}
{"timestamp":1694760100.6334589,"name":"alloc"}
{"timestamp":1694760100.6334941,"name":"prolog-start","context":{"description":"cray-pals-port-distributor"}}
{"timestamp":1694760100.6337798,"name":"prolog-finish","context":{"description":"cray-pals-port-distributor","status":0}}
{"timestamp":1694760100.6350386,"name":"exception","context":{"type":"exec","severity":0,"userid":<>,"note":"reading R: R_lite: failed to read target rank list: Invalid argument"}}
{"timestamp":1694760100.636466,"name":"release","context":{"ranks":"all","final":true}}
{"timestamp":1694760100.636601,"name":"free"}
{"timestamp":1694760100.6366169,"name":"clean"}
[milroy1@tioga10:utilities]$ flux resource R
flux-resource: ERROR: Rlist: invalid argument

In this test case at least there's a mismatch between the hwloc reader and rv1exec which I think is causing the "failed to read target rank list" error.

As a next step, we may want to add another option to Fluxion to fetch these XMLs from every rank directly via an RPC, instead of having to collect them into a filesystem location.

Sorry, I'm a bit lost here. If we can use the resource.topo-get RPC why do we need to fetch the XMLs from each rank? Or the enclosing instance won't have sufficient topology information in this case, so we need to have Fluxion generate the resource graph via RPC that creates an XML with hwloc on each node?

@grondo
Copy link
Contributor

grondo commented Sep 15, 2023

In this test case at least there's a mismatch between the hwloc reader and rv1exec which I think is causing the "failed to read target rank list" error.

Ah, ok, I see. Fluxion is creating an invalid Rv1 for the jobs:

{"version": 1, "execution": {"R_lite": [{"rank": "-1", "children": {"core": "0", "gpu": "4-5"}}], "nodelist": ["tioga10"], "starttime": 1694711763, "expiration": 4848311763}}

It appears Fluxion is perhaps just missing rank information in the graph? The rest of R looks fine anyway.

Sorry, I'm a bit lost here. If we can use the resource.topo-get RPC why do we need to fetch the XMLs from each rank?

Each core resource module only keeps the hwloc XML of its local resources. That is, there is not a way to fetch the XML for all ranks in the job with a single RPC. For now I was thinking Fluxion could send an RPC to each rank to collect the XML. Of course, as a stopgap we could perhaps have a shell plugin do this and write the XML to the job's TMPDIR, but it would be more efficient to have Fluxion do this directly. (This is just one idea of many though...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants