Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-core filters out an allocated GPU #3375

Open
dongahn opened this issue Nov 26, 2020 · 8 comments
Open

flux-core filters out an allocated GPU #3375

dongahn opened this issue Nov 26, 2020 · 8 comments

Comments

@dongahn
Copy link
Member

dongahn commented Nov 26, 2020

Top level

rzansel16{dahn}28: env PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS -n 4 --bind=none --smpiargs="-disable_gpu_hooks" ./blueos_3_ppc64le_ib_p9/bin/flux start

From a proxy

rzansel61{dahn}27: flux mini alloc -n4 -N4 -c20 -g3
2020-11-26T05:05:44.912932Z resource.err[0]: verify: rank 0 (rzansel16) missing resources: gpu2
2020-11-26T05:05:46.531492Z resource.err[1]: verify: rank 1 (rzansel18) missing resources: gpu2
2020-11-26T05:05:46.532046Z resource.err[2]: verify: rank 2 (rzansel47) missing resources: gpu2
2020-11-26T05:05:47.439508Z resource.err[3]: verify: rank 3 (rzansel49) missing resources: gpu2
rzansel16{dahn}21: flux resource list
     STATE NNODES   NCORES    NGPUS
      free      4       80        8
 allocated      0        0        0
      down      0        0        0
rzansel16{dahn}23: echo $CUDA_VISIBLE_DEVICES
0,1,2

From the top level

flux job info fBAY1gobV R
{"version": 1, "execution": {"R_lite": [{"rank": "0-3", "children": {"core": "0-19", "gpu": "0-2"}}], "nodelist": ["rzansel[16,18,47,49]"], "starttime": 1606367143, "expiration": 1606971943}}

Apparently, the top-level scheduler creates correct resource set but one of the nest instances couldn't discover one GPU.

@dongahn
Copy link
Member Author

dongahn commented Nov 26, 2020

Interestingly enough, the nested allocation seems to miss the GPU when it is allocated to less than or equal to 20 cores which is equal to the number of cores on a core.

rzansel61{dahn}32: flux mini alloc -n4 -N4 -c21 -g3
2020-11-26T05:22:49.937768Z resource.err[0]: verify: rank 0 (rzansel16) missing resources: gpu3
2020-11-26T05:22:51.539299Z resource.err[1]: verify: rank 1 (rzansel18) missing resources: gpu3
2020-11-26T05:22:51.541808Z resource.err[2]: verify: rank 2 (rzansel47) missing resources: gpu3
2020-11-26T05:22:52.452309Z resource.err[3]: verify: rank 3 (rzansel49) missing resources: gpu3
rzansel16{dahn}21: flux resource list
     STATE NNODES   NCORES    NGPUS
      free      4       84       12
 allocated      0        0        0
      down      0        0        0
exit
rzansel61{dahn}34: flux mini alloc -n4 -N4 -c18 -g3
2020-11-26T05:23:35.357347Z resource.err[0]: verify: rank 0 (rzansel16) missing resources: gpu[2-3]
2020-11-26T05:23:36.964271Z resource.err[2]: verify: rank 2 (rzansel47) missing resources: gpu[2-3]
2020-11-26T05:23:36.970715Z resource.err[1]: verify: rank 1 (rzansel18) missing resources: gpu[2-3]
2020-11-26T05:23:37.891596Z resource.err[3]: verify: rank 3 (rzansel49) missing resources: gpu[2-3]
rzansel16{dahn}21: flux resource list
     STATE NNODES   NCORES    NGPUS
      free      4       72        8
 allocated      0        0        0
      down      0        0        0

@dongahn
Copy link
Member Author

dongahn commented Nov 26, 2020

I think I found the problem. W/ the process binding done at the top-level, it appears one socket is filtered out for the nested instance such a way that one GPU is also filtered out.

@grondo or @SteVwonder: do you think it is possible to not filter the socket when a GPU on it is allocated (that is, even if no core has been allocated from that socket?)

rzansel61{dahn}25: flux mini alloc -n1 -c20 -g3
2020-11-26T06:09:56.587278Z resource.err[0]: verify: rank 0 (rzansel49) missing resources: gpu[2-3]
node visited
numanode visited
socket visited
gpu visited: 2
gpu visited: 3
rzansel49{dahn}21: exit
exit
[detached: session exiting]
rzansel61{dahn}26: flux mini alloc -n1 -c21 -g3
2020-11-26T06:10:14.683571Z resource.err[0]: verify: rank 0 (rzansel49) missing resources: gpu3
node visited
numanode visited
socket visited
gpu visited: 1
numanode visited
socket visited
gpu visited: 2
gpu visited: 3

@grondo
Copy link
Contributor

grondo commented Nov 26, 2020

I fear I am not an hwloc expert. Currently we call hwloc_topology_restrict() with the current allowed cpuset and pass no flags. The docs state:

Topology topology is modified so as to remove all objects that are not included (or partially included) in the CPU set cpuset. All objects CPU and node sets are restricted accordingly.

Perhaps we should be using at least one of the ADAPT flags, so that objects are moved to ancestors during hwloc_topology_restrict()?

@dongahn dongahn transferred this issue from flux-framework/flux-sched Nov 26, 2020
@dongahn dongahn changed the title nested hwloc reader miss one GPU flux-core filters out an allocated GPU Nov 26, 2020
@dongahn
Copy link
Member Author

dongahn commented Nov 26, 2020

I just confirmed that in this case, flux-core doesn't export the missing GPU in hwloc mode.

rzansel61{dahn}106: cat nest.form.xml | grep -i coproc
            <info name="CoProcType" value="CUDA"/>
            <info name="CoProcType" value="CUDA"/>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topology SYSTEM "hwloc.dtd">
<topology>
  <object type="Machine" os_index="0" cpuset="0x0000f000,,,,,0x0" complete_cpuset="0x0000f000,,,,,0x0" online_cpuset="0x0000f000,,,,,0x0" allowed_cpuset="0x0000f000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100">
    <page_type size="65536" count="0"/>
    <page_type size="2097152" count="0"/>
    <page_type size="1073741824" count="0"/>
    <info name="PlatformName" value="PowerNV"/>
    <info name="PlatformModel" value="PowerNV 8335-GTW"/>
    <info name="Backend" value="Linux"/>
    <info name="LinuxCgroup" value="/allocation_599328"/>
    <info name="OSName" value="Linux"/>
    <info name="OSRelease" value="4.14.0-115.21.2.1chaos.ch6a.ppc64le"/>
    <info name="OSVersion" value="#1 SMP Fri May 22 11:01:06 PDT 2020"/>
    <info name="HostName" value="rzansel18"/>
    <info name="Architecture" value="ppc64le"/>
    <info name="hwlocVersion" value="1.11.10"/>
    <info name="ProcessName" value="broker"/>
    <object type="NUMANode" os_index="8" cpuset="0x0000f000,,,,,0x0" complete_cpuset="0x0000f000,,,,,0x0" online_cpuset="0x0000f000,,,,,0x0" allowed_cpuset="0x0000f000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100" local_memory="137166913536">
      <page_type size="65536" count="2093001"/>
      <page_type size="2097152" count="0"/>
      <page_type size="1073741824" count="0"/>
      <object type="Package" os_index="8" cpuset="0x0000f000,,,,,0x0" complete_cpuset="0x0000f000,,,,,0x0" online_cpuset="0x0000f000,,,,,0x0" allowed_cpuset="0x0000f000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100">
        <info name="CPUModel" value="POWER9, altivec supported"/>
        <info name="CPURevision" value="2.1 (pvr 004e 1201)"/>
        <object type="Core" os_index="2140" cpuset="0x0000f000,,,,,0x0" complete_cpuset="0x0000f000,,,,,0x0" online_cpuset="0x0000f000,,,,,0x0" allowed_cpuset="0x0000f000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100">
          <object type="PU" os_index="172" cpuset="0x00001000,,,,,0x0" complete_cpuset="0x00001000,,,,,0x0" online_cpuset="0x00001000,,,,,0x0" allowed_cpuset="0x00001000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100"/>
          <object type="PU" os_index="173" cpuset="0x00002000,,,,,0x0" complete_cpuset="0x00002000,,,,,0x0" online_cpuset="0x00002000,,,,,0x0" allowed_cpuset="0x00002000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100"/>
          <object type="PU" os_index="174" cpuset="0x00004000,,,,,0x0" complete_cpuset="0x00004000,,,,,0x0" online_cpuset="0x00004000,,,,,0x0" allowed_cpuset="0x00004000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100"/>
          <object type="PU" os_index="175" cpuset="0x00008000,,,,,0x0" complete_cpuset="0x00008000,,,,,0x0" online_cpuset="0x00008000,,,,,0x0" allowed_cpuset="0x00008000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100"/>
        </object>
      </object>
      <object type="Bridge" os_index="9" bridge_type="0-1" depth="0" bridge_pci="0033:[00-01]">
        <object type="PCIDev" os_index="53481472" name="Mellanox Technologies MT28800 Family [ConnectX-5 Ex]" pci_busid="0033:01:00.0" pci_type="0207 [15b3:1019] [1014:0617] 00" pci_link_speed="0.000000">
          <info name="PCIVendor" value="Mellanox Technologies"/>
          <info name="PCIDevice" value="MT28800 Family [ConnectX-5 Ex]"/>
          <object type="OSDev" name="hsi2" osdev_type="2">
            <info name="Address" value="20:00:15:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:ca:bb:d2"/>
            <info name="Port" value="1"/>
          </object>
          <object type="OSDev" name="mlx5_2" osdev_type="3">
            <info name="NodeGUID" value="ec0d:9a03:00ca:bbd2"/>
            <info name="SysImageGUID" value="ec0d:9a03:00ca:bbd0"/>
            <info name="Port1State" value="4"/>
            <info name="Port1LID" value="0xc1"/>
            <info name="Port1LMC" value="0"/>
            <info name="Port1GID0" value="fe80:0000:0000:0000:ec0d:9a03:00ca:bbd2"/>
          </object>
        </object>
        <object type="PCIDev" os_index="53481473" name="Mellanox Technologies MT28800 Family [ConnectX-5 Ex]" pci_busid="0033:01:00.1" pci_type="0207 [15b3:1019] [1014:0617] 00" pci_link_speed="0.000000">
          <info name="PCIVendor" value="Mellanox Technologies"/>
          <info name="PCIDevice" value="MT28800 Family [ConnectX-5 Ex]"/>
          <object type="OSDev" name="hsi3" osdev_type="2">
            <info name="Address" value="20:00:1d:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:ca:bb:d3"/>
            <info name="Port" value="1"/>
          </object>
          <object type="OSDev" name="mlx5_3" osdev_type="3">
            <info name="NodeGUID" value="ec0d:9a03:00ca:bbd3"/>
            <info name="SysImageGUID" value="ec0d:9a03:00ca:bbd0"/>
            <info name="Port1State" value="4"/>
            <info name="Port1LID" value="0xeb"/>
            <info name="Port1LMC" value="0"/>
            <info name="Port1GID0" value="fe80:0000:0000:0000:ec0d:9a03:00ca:bbd3"/>
          </object>
        </object>
      </object>
      <object type="Bridge" os_index="11" bridge_type="0-1" depth="0" bridge_pci="0035:[00-09]">
        <object type="PCIDev" os_index="55586816" name="NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]" pci_busid="0035:03:00.0" pci_type="0302 [10de:1db1] [10de:1212] a1" pci_link_speed="15.753846">
          <info name="PCIVendor" value="NVIDIA Corporation"/>
          <info name="PCIDevice" value="GV100GL [Tesla V100 SXM2 16GB]"/>
          <object type="OSDev" name="card3" osdev_type="1"/>
          <object type="OSDev" name="renderD130" osdev_type="1"/>
          <object type="OSDev" name="cuda1" osdev_type="5">
            <info name="CoProcType" value="CUDA"/>
            <info name="Backend" value="CUDA"/>
            <info name="GPUVendor" value="NVIDIA Corporation"/>
            <info name="GPUModel" value="Tesla V100-SXM2-16GB"/>
            <info name="CUDAGlobalMemorySize" value="16515072"/>
            <info name="CUDAL2CacheSize" value="6144"/>
            <info name="CUDAMultiProcessors" value="80"/>
            <info name="CUDACoresPerMP" value="64"/>
            <info name="CUDASharedMemorySizePerMP" value="48"/>
          </object>
          <object type="OSDev" name="nvml2" osdev_type="1">
            <info name="Backend" value="NVML"/>
            <info name="GPUVendor" value="NVIDIA Corporation"/>
            <info name="GPUModel" value="Tesla V100-SXM2-16GB"/>
            <info name="NVIDIASerial" value="0320618037406"/>
            <info name="NVIDIAUUID" value="GPU-20e492d3-d7e0-c6a3-08c7-edbd8ca6065e"/>
          </object>
        </object>
        <object type="PCIDev" os_index="55590912" name="NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]" pci_busid="0035:04:00.0" pci_type="0302 [10de:1db1] [10de:1212] a1" pci_link_speed="15.753846">
          <info name="PCIVendor" value="NVIDIA Corporation"/>
          <info name="PCIDevice" value="GV100GL [Tesla V100 SXM2 16GB]"/>
          <object type="OSDev" name="renderD131" osdev_type="1"/>
          <object type="OSDev" name="card4" osdev_type="1"/>
          <object type="OSDev" name="cuda2" osdev_type="5">
            <info name="CoProcType" value="CUDA"/>
            <info name="Backend" value="CUDA"/>
            <info name="GPUVendor" value="NVIDIA Corporation"/>
            <info name="GPUModel" value="Tesla V100-SXM2-16GB"/>
            <info name="CUDAGlobalMemorySize" value="16515072"/>
            <info name="CUDAL2CacheSize" value="6144"/>
            <info name="CUDAMultiProcessors" value="80"/>
            <info name="CUDACoresPerMP" value="64"/>
            <info name="CUDASharedMemorySizePerMP" value="48"/>
          </object>
          <object type="OSDev" name="nvml3" osdev_type="1">
            <info name="Backend" value="NVML"/>
            <info name="GPUVendor" value="NVIDIA Corporation"/>
            <info name="GPUModel" value="Tesla V100-SXM2-16GB"/>
            <info name="NVIDIASerial" value="0320618038035"/>
            <info name="NVIDIAUUID" value="GPU-0ec95f76-d8a3-8fc9-866f-c3bd783e1484"/>
          </object>
        </object>
      </object>
    </object>
  </object>
</topology>

@dongahn
Copy link
Member Author

dongahn commented Nov 26, 2020

Perhaps we should be using at least one of the ADAPT flags, so that objects are moved to ancestors during hwloc_topology_restrict()?

Either flag doesn't seem to work.

@dongahn
Copy link
Member Author

dongahn commented Nov 28, 2020

Making the flag change for both restrict callsites seems to include GPU but with some hwloc warning:

rzansel61{dahn}37: git diff
diff --git a/src/common/librlist/rhwloc.c b/src/common/librlist/rhwloc.c
index da5c92278..da1d54848 100644
--- a/src/common/librlist/rhwloc.c
+++ b/src/common/librlist/rhwloc.c
@@ -81,7 +81,7 @@ hwloc_topology_t rhwloc_local_topology_load (void)
     if (!(rset = hwloc_bitmap_alloc ())
         || (hwloc_get_cpubind (topo, rset, HWLOC_CPUBIND_PROCESS) < 0))
         goto err;
-    if (hwloc_topology_restrict (topo, rset, 0) < 0)
+    if (hwloc_topology_restrict (topo, rset, HWLOC_RESTRICT_FLAG_ADAPT_IO) < 0)
         goto err;
     hwloc_bitmap_free (rset);
     return (topo);
diff --git a/src/shell/affinity.c b/src/shell/affinity.c
index 4537dcdac..01621d1e3 100644
--- a/src/shell/affinity.c
+++ b/src/shell/affinity.c
@@ -32,7 +32,8 @@ struct shell_affinity {
  */
 static int topology_restrict (hwloc_topology_t topo, hwloc_cpuset_t set)
 {
-    if (hwloc_topology_restrict (topo, set, 0) < 0)
+    if (hwloc_topology_restrict (topo, set, HWLOC_RESTRICT_FLAG_ADAPT_IO) < 0)
         return (-1);
     return (0);
 }
rzansel32{dahn}25: env PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS -n 1 --bind=none --smpiargs="-disable_gpu_hooks" bin/flux start
node visited
numanode visited
socket visited
gpu visited: 0
gpu visited: 1
numanode visited
socket visited
gpu visited: 2
gpu visited: 3
flux mini alloc -N 1 -n1 -c 1 -g3 flux resource list
0.092s: flux-shell[0]: Jobspec does not contain data-staging attributes. No staging necessary.
****************************************************************************
* hwloc has encountered an out-of-order XML topology load.
* Object NUMANode cpuset 0x0000f000,,,,,0x0 complete 0x0000f000,,,,,0x0
* was inserted after object HostBridge with none and none.
* The error occured in hwloc 1.11.10 inside process `broker', while
* the input XML was generated by hwloc 1.11.10 inside process `broker'.
* Please check that your input topology XML file is valid.
****************************************************************************
2020-11-28T20:10:18.367473Z resource.err[0]: verify: rank 0 (rzansel32) missing resources: core0,gpu[1-3]
node visited
gpu visited: 1
numanode visited
socket visited
gpu visited: 2
gpu visited: 3
     STATE NNODES   NCORES    NGPUS
      free      1        1        3
 allocated      0        0        0
      down      0        0        0

@dongahn
Copy link
Member Author

dongahn commented Nov 28, 2020


  • hwloc has encountered an out-of-order XML topology load.
  • Object NUMANode cpuset 0x0000f000,,,,,0x0 complete 0x0000f000,,,,,0x0
  • was inserted after object HostBridge with none and none.
  • The error occured in hwloc 1.11.10 inside process `broker', while
  • the input XML was generated by hwloc 1.11.10 inside process `broker'.
  • Please check that your input topology XML file is valid.

It turned out hwloc prints out this message when it loads a restricted hwloc xml created with HWLOC_RESTRICT_FLAG_ADAPT_IO. While this seems to solve the problem of missing GPUs, I didn't feel right to print out this error message in exchange. I did confirm that this change didn't cause any testing failure though.

We probably want to bounce this to hwloc team first before doing anything with this.

@SteVwonder
Copy link
Member

@dongahn: after re-reading flux-framework/flux-sched#658, it looks like this is only a limitation of hwloc 1.x (at least for the case documented in that issue). I wonder if your use case would also be handled correctly by hwloc 2.x+.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants