flux-core filters out an allocated GPU #3375

dongahn · 2020-11-26T05:17:53Z

Top level

rzansel16{dahn}28: env PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS -n 4 --bind=none --smpiargs="-disable_gpu_hooks" ./blueos_3_ppc64le_ib_p9/bin/flux start

From a proxy

rzansel61{dahn}27: flux mini alloc -n4 -N4 -c20 -g3
2020-11-26T05:05:44.912932Z resource.err[0]: verify: rank 0 (rzansel16) missing resources: gpu2
2020-11-26T05:05:46.531492Z resource.err[1]: verify: rank 1 (rzansel18) missing resources: gpu2
2020-11-26T05:05:46.532046Z resource.err[2]: verify: rank 2 (rzansel47) missing resources: gpu2
2020-11-26T05:05:47.439508Z resource.err[3]: verify: rank 3 (rzansel49) missing resources: gpu2
rzansel16{dahn}21: flux resource list
     STATE NNODES   NCORES    NGPUS
      free      4       80        8
 allocated      0        0        0
      down      0        0        0
rzansel16{dahn}23: echo $CUDA_VISIBLE_DEVICES
0,1,2

From the top level

flux job info fBAY1gobV R
{"version": 1, "execution": {"R_lite": [{"rank": "0-3", "children": {"core": "0-19", "gpu": "0-2"}}], "nodelist": ["rzansel[16,18,47,49]"], "starttime": 1606367143, "expiration": 1606971943}}

Apparently, the top-level scheduler creates correct resource set but one of the nest instances couldn't discover one GPU.

The text was updated successfully, but these errors were encountered:

dongahn · 2020-11-26T05:26:52Z

Interestingly enough, the nested allocation seems to miss the GPU when it is allocated to less than or equal to 20 cores which is equal to the number of cores on a core.

rzansel61{dahn}32: flux mini alloc -n4 -N4 -c21 -g3
2020-11-26T05:22:49.937768Z resource.err[0]: verify: rank 0 (rzansel16) missing resources: gpu3
2020-11-26T05:22:51.539299Z resource.err[1]: verify: rank 1 (rzansel18) missing resources: gpu3
2020-11-26T05:22:51.541808Z resource.err[2]: verify: rank 2 (rzansel47) missing resources: gpu3
2020-11-26T05:22:52.452309Z resource.err[3]: verify: rank 3 (rzansel49) missing resources: gpu3
rzansel16{dahn}21: flux resource list
     STATE NNODES   NCORES    NGPUS
      free      4       84       12
 allocated      0        0        0
      down      0        0        0
exit
rzansel61{dahn}34: flux mini alloc -n4 -N4 -c18 -g3
2020-11-26T05:23:35.357347Z resource.err[0]: verify: rank 0 (rzansel16) missing resources: gpu[2-3]
2020-11-26T05:23:36.964271Z resource.err[2]: verify: rank 2 (rzansel47) missing resources: gpu[2-3]
2020-11-26T05:23:36.970715Z resource.err[1]: verify: rank 1 (rzansel18) missing resources: gpu[2-3]
2020-11-26T05:23:37.891596Z resource.err[3]: verify: rank 3 (rzansel49) missing resources: gpu[2-3]
rzansel16{dahn}21: flux resource list
     STATE NNODES   NCORES    NGPUS
      free      4       72        8
 allocated      0        0        0
      down      0        0        0

dongahn · 2020-11-26T06:16:34Z

I think I found the problem. W/ the process binding done at the top-level, it appears one socket is filtered out for the nested instance such a way that one GPU is also filtered out.

@grondo or @SteVwonder: do you think it is possible to not filter the socket when a GPU on it is allocated (that is, even if no core has been allocated from that socket?)

rzansel61{dahn}25: flux mini alloc -n1 -c20 -g3
2020-11-26T06:09:56.587278Z resource.err[0]: verify: rank 0 (rzansel49) missing resources: gpu[2-3]
node visited
numanode visited
socket visited
gpu visited: 2
gpu visited: 3
rzansel49{dahn}21: exit
exit
[detached: session exiting]
rzansel61{dahn}26: flux mini alloc -n1 -c21 -g3
2020-11-26T06:10:14.683571Z resource.err[0]: verify: rank 0 (rzansel49) missing resources: gpu3
node visited
numanode visited
socket visited
gpu visited: 1
numanode visited
socket visited
gpu visited: 2
gpu visited: 3

grondo · 2020-11-26T15:43:42Z

I fear I am not an hwloc expert. Currently we call hwloc_topology_restrict() with the current allowed cpuset and pass no flags. The docs state:

Topology topology is modified so as to remove all objects that are not included (or partially included) in the CPU set cpuset. All objects CPU and node sets are restricted accordingly.

Perhaps we should be using at least one of the ADAPT flags, so that objects are moved to ancestors during hwloc_topology_restrict()?

dongahn · 2020-11-26T20:34:00Z

I just confirmed that in this case, flux-core doesn't export the missing GPU in hwloc mode.

rzansel61{dahn}106: cat nest.form.xml | grep -i coproc
            <info name="CoProcType" value="CUDA"/>
            <info name="CoProcType" value="CUDA"/>

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topology SYSTEM "hwloc.dtd">
<topology>
  <object type="Machine" os_index="0" cpuset="0x0000f000,,,,,0x0" complete_cpuset="0x0000f000,,,,,0x0" online_cpuset="0x0000f000,,,,,0x0" allowed_cpuset="0x0000f000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100">
    <page_type size="65536" count="0"/>
    <page_type size="2097152" count="0"/>
    <page_type size="1073741824" count="0"/>
    <info name="PlatformName" value="PowerNV"/>
    <info name="PlatformModel" value="PowerNV 8335-GTW"/>
    <info name="Backend" value="Linux"/>
    <info name="LinuxCgroup" value="/allocation_599328"/>
    <info name="OSName" value="Linux"/>
    <info name="OSRelease" value="4.14.0-115.21.2.1chaos.ch6a.ppc64le"/>
    <info name="OSVersion" value="#1 SMP Fri May 22 11:01:06 PDT 2020"/>
    <info name="HostName" value="rzansel18"/>
    <info name="Architecture" value="ppc64le"/>
    <info name="hwlocVersion" value="1.11.10"/>
    <info name="ProcessName" value="broker"/>
    <object type="NUMANode" os_index="8" cpuset="0x0000f000,,,,,0x0" complete_cpuset="0x0000f000,,,,,0x0" online_cpuset="0x0000f000,,,,,0x0" allowed_cpuset="0x0000f000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100" local_memory="137166913536">
      <page_type size="65536" count="2093001"/>
      <page_type size="2097152" count="0"/>
      <page_type size="1073741824" count="0"/>
      <object type="Package" os_index="8" cpuset="0x0000f000,,,,,0x0" complete_cpuset="0x0000f000,,,,,0x0" online_cpuset="0x0000f000,,,,,0x0" allowed_cpuset="0x0000f000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100">
        <info name="CPUModel" value="POWER9, altivec supported"/>
        <info name="CPURevision" value="2.1 (pvr 004e 1201)"/>
        <object type="Core" os_index="2140" cpuset="0x0000f000,,,,,0x0" complete_cpuset="0x0000f000,,,,,0x0" online_cpuset="0x0000f000,,,,,0x0" allowed_cpuset="0x0000f000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100">
          <object type="PU" os_index="172" cpuset="0x00001000,,,,,0x0" complete_cpuset="0x00001000,,,,,0x0" online_cpuset="0x00001000,,,,,0x0" allowed_cpuset="0x00001000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100"/>
          <object type="PU" os_index="173" cpuset="0x00002000,,,,,0x0" complete_cpuset="0x00002000,,,,,0x0" online_cpuset="0x00002000,,,,,0x0" allowed_cpuset="0x00002000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100"/>
          <object type="PU" os_index="174" cpuset="0x00004000,,,,,0x0" complete_cpuset="0x00004000,,,,,0x0" online_cpuset="0x00004000,,,,,0x0" allowed_cpuset="0x00004000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100"/>
          <object type="PU" os_index="175" cpuset="0x00008000,,,,,0x0" complete_cpuset="0x00008000,,,,,0x0" online_cpuset="0x00008000,,,,,0x0" allowed_cpuset="0x00008000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100"/>
        </object>
      </object>
      <object type="Bridge" os_index="9" bridge_type="0-1" depth="0" bridge_pci="0033:[00-01]">
        <object type="PCIDev" os_index="53481472" name="Mellanox Technologies MT28800 Family [ConnectX-5 Ex]" pci_busid="0033:01:00.0" pci_type="0207 [15b3:1019] [1014:0617] 00" pci_link_speed="0.000000">
          <info name="PCIVendor" value="Mellanox Technologies"/>
          <info name="PCIDevice" value="MT28800 Family [ConnectX-5 Ex]"/>
          <object type="OSDev" name="hsi2" osdev_type="2">
            <info name="Address" value="20:00:15:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:ca:bb:d2"/>
            <info name="Port" value="1"/>
          </object>
          <object type="OSDev" name="mlx5_2" osdev_type="3">
            <info name="NodeGUID" value="ec0d:9a03:00ca:bbd2"/>
            <info name="SysImageGUID" value="ec0d:9a03:00ca:bbd0"/>
            <info name="Port1State" value="4"/>
            <info name="Port1LID" value="0xc1"/>
            <info name="Port1LMC" value="0"/>
            <info name="Port1GID0" value="fe80:0000:0000:0000:ec0d:9a03:00ca:bbd2"/>
          </object>
        </object>
        <object type="PCIDev" os_index="53481473" name="Mellanox Technologies MT28800 Family [ConnectX-5 Ex]" pci_busid="0033:01:00.1" pci_type="0207 [15b3:1019] [1014:0617] 00" pci_link_speed="0.000000">
          <info name="PCIVendor" value="Mellanox Technologies"/>
          <info name="PCIDevice" value="MT28800 Family [ConnectX-5 Ex]"/>
          <object type="OSDev" name="hsi3" osdev_type="2">
            <info name="Address" value="20:00:1d:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:ca:bb:d3"/>
            <info name="Port" value="1"/>
          </object>
          <object type="OSDev" name="mlx5_3" osdev_type="3">
            <info name="NodeGUID" value="ec0d:9a03:00ca:bbd3"/>
            <info name="SysImageGUID" value="ec0d:9a03:00ca:bbd0"/>
            <info name="Port1State" value="4"/>
            <info name="Port1LID" value="0xeb"/>
            <info name="Port1LMC" value="0"/>
            <info name="Port1GID0" value="fe80:0000:0000:0000:ec0d:9a03:00ca:bbd3"/>
          </object>
        </object>
      </object>
      <object type="Bridge" os_index="11" bridge_type="0-1" depth="0" bridge_pci="0035:[00-09]">
        <object type="PCIDev" os_index="55586816" name="NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]" pci_busid="0035:03:00.0" pci_type="0302 [10de:1db1] [10de:1212] a1" pci_link_speed="15.753846">
          <info name="PCIVendor" value="NVIDIA Corporation"/>
          <info name="PCIDevice" value="GV100GL [Tesla V100 SXM2 16GB]"/>
          <object type="OSDev" name="card3" osdev_type="1"/>
          <object type="OSDev" name="renderD130" osdev_type="1"/>
          <object type="OSDev" name="cuda1" osdev_type="5">
            <info name="CoProcType" value="CUDA"/>
            <info name="Backend" value="CUDA"/>
            <info name="GPUVendor" value="NVIDIA Corporation"/>
            <info name="GPUModel" value="Tesla V100-SXM2-16GB"/>
            <info name="CUDAGlobalMemorySize" value="16515072"/>
            <info name="CUDAL2CacheSize" value="6144"/>
            <info name="CUDAMultiProcessors" value="80"/>
            <info name="CUDACoresPerMP" value="64"/>
            <info name="CUDASharedMemorySizePerMP" value="48"/>
          </object>
          <object type="OSDev" name="nvml2" osdev_type="1">
            <info name="Backend" value="NVML"/>
            <info name="GPUVendor" value="NVIDIA Corporation"/>
            <info name="GPUModel" value="Tesla V100-SXM2-16GB"/>
            <info name="NVIDIASerial" value="0320618037406"/>
            <info name="NVIDIAUUID" value="GPU-20e492d3-d7e0-c6a3-08c7-edbd8ca6065e"/>
          </object>
        </object>
        <object type="PCIDev" os_index="55590912" name="NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]" pci_busid="0035:04:00.0" pci_type="0302 [10de:1db1] [10de:1212] a1" pci_link_speed="15.753846">
          <info name="PCIVendor" value="NVIDIA Corporation"/>
          <info name="PCIDevice" value="GV100GL [Tesla V100 SXM2 16GB]"/>
          <object type="OSDev" name="renderD131" osdev_type="1"/>
          <object type="OSDev" name="card4" osdev_type="1"/>
          <object type="OSDev" name="cuda2" osdev_type="5">
            <info name="CoProcType" value="CUDA"/>
            <info name="Backend" value="CUDA"/>
            <info name="GPUVendor" value="NVIDIA Corporation"/>
            <info name="GPUModel" value="Tesla V100-SXM2-16GB"/>
            <info name="CUDAGlobalMemorySize" value="16515072"/>
            <info name="CUDAL2CacheSize" value="6144"/>
            <info name="CUDAMultiProcessors" value="80"/>
            <info name="CUDACoresPerMP" value="64"/>
            <info name="CUDASharedMemorySizePerMP" value="48"/>
          </object>
          <object type="OSDev" name="nvml3" osdev_type="1">
            <info name="Backend" value="NVML"/>
            <info name="GPUVendor" value="NVIDIA Corporation"/>
            <info name="GPUModel" value="Tesla V100-SXM2-16GB"/>
            <info name="NVIDIASerial" value="0320618038035"/>
            <info name="NVIDIAUUID" value="GPU-0ec95f76-d8a3-8fc9-866f-c3bd783e1484"/>
          </object>
        </object>
      </object>
    </object>
  </object>
</topology>

dongahn · 2020-11-26T20:35:07Z

Perhaps we should be using at least one of the ADAPT flags, so that objects are moved to ancestors during hwloc_topology_restrict()?

Either flag doesn't seem to work.

dongahn · 2020-11-28T20:13:02Z

Making the flag change for both restrict callsites seems to include GPU but with some hwloc warning:

rzansel61{dahn}37: git diff
diff --git a/src/common/librlist/rhwloc.c b/src/common/librlist/rhwloc.c
index da5c92278..da1d54848 100644
--- a/src/common/librlist/rhwloc.c
+++ b/src/common/librlist/rhwloc.c
@@ -81,7 +81,7 @@ hwloc_topology_t rhwloc_local_topology_load (void)
     if (!(rset = hwloc_bitmap_alloc ())
         || (hwloc_get_cpubind (topo, rset, HWLOC_CPUBIND_PROCESS) < 0))
         goto err;
-    if (hwloc_topology_restrict (topo, rset, 0) < 0)
+    if (hwloc_topology_restrict (topo, rset, HWLOC_RESTRICT_FLAG_ADAPT_IO) < 0)
         goto err;
     hwloc_bitmap_free (rset);
     return (topo);
diff --git a/src/shell/affinity.c b/src/shell/affinity.c
index 4537dcdac..01621d1e3 100644
--- a/src/shell/affinity.c
+++ b/src/shell/affinity.c
@@ -32,7 +32,8 @@ struct shell_affinity {
  */
 static int topology_restrict (hwloc_topology_t topo, hwloc_cpuset_t set)
 {
-    if (hwloc_topology_restrict (topo, set, 0) < 0)
+    if (hwloc_topology_restrict (topo, set, HWLOC_RESTRICT_FLAG_ADAPT_IO) < 0)
         return (-1);
     return (0);
 }

rzansel32{dahn}25: env PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS -n 1 --bind=none --smpiargs="-disable_gpu_hooks" bin/flux start
node visited
numanode visited
socket visited
gpu visited: 0
gpu visited: 1
numanode visited
socket visited
gpu visited: 2
gpu visited: 3
flux mini alloc -N 1 -n1 -c 1 -g3 flux resource list
0.092s: flux-shell[0]: Jobspec does not contain data-staging attributes. No staging necessary.
****************************************************************************
* hwloc has encountered an out-of-order XML topology load.
* Object NUMANode cpuset 0x0000f000,,,,,0x0 complete 0x0000f000,,,,,0x0
* was inserted after object HostBridge with none and none.
* The error occured in hwloc 1.11.10 inside process `broker', while
* the input XML was generated by hwloc 1.11.10 inside process `broker'.
* Please check that your input topology XML file is valid.
****************************************************************************
2020-11-28T20:10:18.367473Z resource.err[0]: verify: rank 0 (rzansel32) missing resources: core0,gpu[1-3]
node visited
gpu visited: 1
numanode visited
socket visited
gpu visited: 2
gpu visited: 3
     STATE NNODES   NCORES    NGPUS
      free      1        1        3
 allocated      0        0        0
      down      0        0        0

dongahn · 2020-11-28T23:24:56Z

hwloc has encountered an out-of-order XML topology load.

Object NUMANode cpuset 0x0000f000,,,,,0x0 complete 0x0000f000,,,,,0x0

was inserted after object HostBridge with none and none.

The error occured in hwloc 1.11.10 inside process `broker', while

the input XML was generated by hwloc 1.11.10 inside process `broker'.

Please check that your input topology XML file is valid.

It turned out hwloc prints out this message when it loads a restricted hwloc xml created with HWLOC_RESTRICT_FLAG_ADAPT_IO. While this seems to solve the problem of missing GPUs, I didn't feel right to print out this error message in exchange. I did confirm that this change didn't cause any testing failure though.

We probably want to bounce this to hwloc team first before doing anything with this.

SteVwonder · 2020-12-21T04:21:15Z

@dongahn: after re-reading flux-framework/flux-sched#658, it looks like this is only a limitation of hwloc 1.x (at least for the case documented in that issue). I wonder if your use case would also be handled correctly by hwloc 2.x+.

dongahn transferred this issue from flux-framework/flux-sched Nov 26, 2020

dongahn changed the title ~~nested hwloc reader miss one GPU~~ flux-core filters out an allocated GPU Nov 26, 2020

dongahn mentioned this issue Nov 28, 2020

Add resource Id remapping support flux-framework/flux-sched#773

Merged

garlick mentioned this issue Nov 30, 2020

resource.err[0]: verify: rank 0 (rzansel16) missing resources: gpu[2-3] #3374

Closed

dongahn mentioned this issue Dec 1, 2020

Document the limitations of hwloc reader flux-framework/flux-docs#79

Open

dongahn mentioned this issue Dec 19, 2020

Hwloc 1.x detection of GPUs depends on CPU binding flux-framework/flux-sched#658

Closed

dongahn mentioned this issue May 20, 2021

resource: prints missing gpu in nested instances #3668

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flux-core filters out an allocated GPU #3375

flux-core filters out an allocated GPU #3375

dongahn commented Nov 26, 2020

dongahn commented Nov 26, 2020

dongahn commented Nov 26, 2020

grondo commented Nov 26, 2020 •

edited

Loading

dongahn commented Nov 26, 2020 •

edited

Loading

dongahn commented Nov 26, 2020

dongahn commented Nov 28, 2020

dongahn commented Nov 28, 2020

SteVwonder commented Dec 21, 2020

flux-core filters out an allocated GPU #3375

flux-core filters out an allocated GPU #3375

Comments

dongahn commented Nov 26, 2020

dongahn commented Nov 26, 2020

dongahn commented Nov 26, 2020

grondo commented Nov 26, 2020 • edited Loading

dongahn commented Nov 26, 2020 • edited Loading

dongahn commented Nov 26, 2020

dongahn commented Nov 28, 2020

dongahn commented Nov 28, 2020

SteVwonder commented Dec 21, 2020

grondo commented Nov 26, 2020 •

edited

Loading

dongahn commented Nov 26, 2020 •

edited

Loading