-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flux-core filters out an allocated GPU #3375
Comments
Interestingly enough, the nested allocation seems to miss the GPU when it is allocated to less than or equal to 20 cores which is equal to the number of cores on a core.
|
I think I found the problem. W/ the process binding done at the top-level, it appears one socket is filtered out for the nested instance such a way that one GPU is also filtered out. @grondo or @SteVwonder: do you think it is possible to not filter the socket when a GPU on it is allocated (that is, even if no core has been allocated from that socket?) rzansel61{dahn}25: flux mini alloc -n1 -c20 -g3
2020-11-26T06:09:56.587278Z resource.err[0]: verify: rank 0 (rzansel49) missing resources: gpu[2-3]
node visited
numanode visited
socket visited
gpu visited: 2
gpu visited: 3
rzansel49{dahn}21: exit
exit
[detached: session exiting]
rzansel61{dahn}26: flux mini alloc -n1 -c21 -g3
2020-11-26T06:10:14.683571Z resource.err[0]: verify: rank 0 (rzansel49) missing resources: gpu3
node visited
numanode visited
socket visited
gpu visited: 1
numanode visited
socket visited
gpu visited: 2
gpu visited: 3 |
I fear I am not an hwloc expert. Currently we call
Perhaps we should be using at least one of the ADAPT flags, so that objects are moved to ancestors during |
I just confirmed that in this case, flux-core doesn't export the missing GPU in hwloc mode. rzansel61{dahn}106: cat nest.form.xml | grep -i coproc
<info name="CoProcType" value="CUDA"/>
<info name="CoProcType" value="CUDA"/> <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topology SYSTEM "hwloc.dtd">
<topology>
<object type="Machine" os_index="0" cpuset="0x0000f000,,,,,0x0" complete_cpuset="0x0000f000,,,,,0x0" online_cpuset="0x0000f000,,,,,0x0" allowed_cpuset="0x0000f000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100">
<page_type size="65536" count="0"/>
<page_type size="2097152" count="0"/>
<page_type size="1073741824" count="0"/>
<info name="PlatformName" value="PowerNV"/>
<info name="PlatformModel" value="PowerNV 8335-GTW"/>
<info name="Backend" value="Linux"/>
<info name="LinuxCgroup" value="/allocation_599328"/>
<info name="OSName" value="Linux"/>
<info name="OSRelease" value="4.14.0-115.21.2.1chaos.ch6a.ppc64le"/>
<info name="OSVersion" value="#1 SMP Fri May 22 11:01:06 PDT 2020"/>
<info name="HostName" value="rzansel18"/>
<info name="Architecture" value="ppc64le"/>
<info name="hwlocVersion" value="1.11.10"/>
<info name="ProcessName" value="broker"/>
<object type="NUMANode" os_index="8" cpuset="0x0000f000,,,,,0x0" complete_cpuset="0x0000f000,,,,,0x0" online_cpuset="0x0000f000,,,,,0x0" allowed_cpuset="0x0000f000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100" local_memory="137166913536">
<page_type size="65536" count="2093001"/>
<page_type size="2097152" count="0"/>
<page_type size="1073741824" count="0"/>
<object type="Package" os_index="8" cpuset="0x0000f000,,,,,0x0" complete_cpuset="0x0000f000,,,,,0x0" online_cpuset="0x0000f000,,,,,0x0" allowed_cpuset="0x0000f000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100">
<info name="CPUModel" value="POWER9, altivec supported"/>
<info name="CPURevision" value="2.1 (pvr 004e 1201)"/>
<object type="Core" os_index="2140" cpuset="0x0000f000,,,,,0x0" complete_cpuset="0x0000f000,,,,,0x0" online_cpuset="0x0000f000,,,,,0x0" allowed_cpuset="0x0000f000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100">
<object type="PU" os_index="172" cpuset="0x00001000,,,,,0x0" complete_cpuset="0x00001000,,,,,0x0" online_cpuset="0x00001000,,,,,0x0" allowed_cpuset="0x00001000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100"/>
<object type="PU" os_index="173" cpuset="0x00002000,,,,,0x0" complete_cpuset="0x00002000,,,,,0x0" online_cpuset="0x00002000,,,,,0x0" allowed_cpuset="0x00002000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100"/>
<object type="PU" os_index="174" cpuset="0x00004000,,,,,0x0" complete_cpuset="0x00004000,,,,,0x0" online_cpuset="0x00004000,,,,,0x0" allowed_cpuset="0x00004000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100"/>
<object type="PU" os_index="175" cpuset="0x00008000,,,,,0x0" complete_cpuset="0x00008000,,,,,0x0" online_cpuset="0x00008000,,,,,0x0" allowed_cpuset="0x00008000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100"/>
</object>
</object>
<object type="Bridge" os_index="9" bridge_type="0-1" depth="0" bridge_pci="0033:[00-01]">
<object type="PCIDev" os_index="53481472" name="Mellanox Technologies MT28800 Family [ConnectX-5 Ex]" pci_busid="0033:01:00.0" pci_type="0207 [15b3:1019] [1014:0617] 00" pci_link_speed="0.000000">
<info name="PCIVendor" value="Mellanox Technologies"/>
<info name="PCIDevice" value="MT28800 Family [ConnectX-5 Ex]"/>
<object type="OSDev" name="hsi2" osdev_type="2">
<info name="Address" value="20:00:15:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:ca:bb:d2"/>
<info name="Port" value="1"/>
</object>
<object type="OSDev" name="mlx5_2" osdev_type="3">
<info name="NodeGUID" value="ec0d:9a03:00ca:bbd2"/>
<info name="SysImageGUID" value="ec0d:9a03:00ca:bbd0"/>
<info name="Port1State" value="4"/>
<info name="Port1LID" value="0xc1"/>
<info name="Port1LMC" value="0"/>
<info name="Port1GID0" value="fe80:0000:0000:0000:ec0d:9a03:00ca:bbd2"/>
</object>
</object>
<object type="PCIDev" os_index="53481473" name="Mellanox Technologies MT28800 Family [ConnectX-5 Ex]" pci_busid="0033:01:00.1" pci_type="0207 [15b3:1019] [1014:0617] 00" pci_link_speed="0.000000">
<info name="PCIVendor" value="Mellanox Technologies"/>
<info name="PCIDevice" value="MT28800 Family [ConnectX-5 Ex]"/>
<object type="OSDev" name="hsi3" osdev_type="2">
<info name="Address" value="20:00:1d:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:ca:bb:d3"/>
<info name="Port" value="1"/>
</object>
<object type="OSDev" name="mlx5_3" osdev_type="3">
<info name="NodeGUID" value="ec0d:9a03:00ca:bbd3"/>
<info name="SysImageGUID" value="ec0d:9a03:00ca:bbd0"/>
<info name="Port1State" value="4"/>
<info name="Port1LID" value="0xeb"/>
<info name="Port1LMC" value="0"/>
<info name="Port1GID0" value="fe80:0000:0000:0000:ec0d:9a03:00ca:bbd3"/>
</object>
</object>
</object>
<object type="Bridge" os_index="11" bridge_type="0-1" depth="0" bridge_pci="0035:[00-09]">
<object type="PCIDev" os_index="55586816" name="NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]" pci_busid="0035:03:00.0" pci_type="0302 [10de:1db1] [10de:1212] a1" pci_link_speed="15.753846">
<info name="PCIVendor" value="NVIDIA Corporation"/>
<info name="PCIDevice" value="GV100GL [Tesla V100 SXM2 16GB]"/>
<object type="OSDev" name="card3" osdev_type="1"/>
<object type="OSDev" name="renderD130" osdev_type="1"/>
<object type="OSDev" name="cuda1" osdev_type="5">
<info name="CoProcType" value="CUDA"/>
<info name="Backend" value="CUDA"/>
<info name="GPUVendor" value="NVIDIA Corporation"/>
<info name="GPUModel" value="Tesla V100-SXM2-16GB"/>
<info name="CUDAGlobalMemorySize" value="16515072"/>
<info name="CUDAL2CacheSize" value="6144"/>
<info name="CUDAMultiProcessors" value="80"/>
<info name="CUDACoresPerMP" value="64"/>
<info name="CUDASharedMemorySizePerMP" value="48"/>
</object>
<object type="OSDev" name="nvml2" osdev_type="1">
<info name="Backend" value="NVML"/>
<info name="GPUVendor" value="NVIDIA Corporation"/>
<info name="GPUModel" value="Tesla V100-SXM2-16GB"/>
<info name="NVIDIASerial" value="0320618037406"/>
<info name="NVIDIAUUID" value="GPU-20e492d3-d7e0-c6a3-08c7-edbd8ca6065e"/>
</object>
</object>
<object type="PCIDev" os_index="55590912" name="NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]" pci_busid="0035:04:00.0" pci_type="0302 [10de:1db1] [10de:1212] a1" pci_link_speed="15.753846">
<info name="PCIVendor" value="NVIDIA Corporation"/>
<info name="PCIDevice" value="GV100GL [Tesla V100 SXM2 16GB]"/>
<object type="OSDev" name="renderD131" osdev_type="1"/>
<object type="OSDev" name="card4" osdev_type="1"/>
<object type="OSDev" name="cuda2" osdev_type="5">
<info name="CoProcType" value="CUDA"/>
<info name="Backend" value="CUDA"/>
<info name="GPUVendor" value="NVIDIA Corporation"/>
<info name="GPUModel" value="Tesla V100-SXM2-16GB"/>
<info name="CUDAGlobalMemorySize" value="16515072"/>
<info name="CUDAL2CacheSize" value="6144"/>
<info name="CUDAMultiProcessors" value="80"/>
<info name="CUDACoresPerMP" value="64"/>
<info name="CUDASharedMemorySizePerMP" value="48"/>
</object>
<object type="OSDev" name="nvml3" osdev_type="1">
<info name="Backend" value="NVML"/>
<info name="GPUVendor" value="NVIDIA Corporation"/>
<info name="GPUModel" value="Tesla V100-SXM2-16GB"/>
<info name="NVIDIASerial" value="0320618038035"/>
<info name="NVIDIAUUID" value="GPU-0ec95f76-d8a3-8fc9-866f-c3bd783e1484"/>
</object>
</object>
</object>
</object>
</object>
</topology> |
Either flag doesn't seem to work. |
Making the flag change for both rzansel61{dahn}37: git diff
diff --git a/src/common/librlist/rhwloc.c b/src/common/librlist/rhwloc.c
index da5c92278..da1d54848 100644
--- a/src/common/librlist/rhwloc.c
+++ b/src/common/librlist/rhwloc.c
@@ -81,7 +81,7 @@ hwloc_topology_t rhwloc_local_topology_load (void)
if (!(rset = hwloc_bitmap_alloc ())
|| (hwloc_get_cpubind (topo, rset, HWLOC_CPUBIND_PROCESS) < 0))
goto err;
- if (hwloc_topology_restrict (topo, rset, 0) < 0)
+ if (hwloc_topology_restrict (topo, rset, HWLOC_RESTRICT_FLAG_ADAPT_IO) < 0)
goto err;
hwloc_bitmap_free (rset);
return (topo);
diff --git a/src/shell/affinity.c b/src/shell/affinity.c
index 4537dcdac..01621d1e3 100644
--- a/src/shell/affinity.c
+++ b/src/shell/affinity.c
@@ -32,7 +32,8 @@ struct shell_affinity {
*/
static int topology_restrict (hwloc_topology_t topo, hwloc_cpuset_t set)
{
- if (hwloc_topology_restrict (topo, set, 0) < 0)
+ if (hwloc_topology_restrict (topo, set, HWLOC_RESTRICT_FLAG_ADAPT_IO) < 0)
return (-1);
return (0);
} rzansel32{dahn}25: env PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS -n 1 --bind=none --smpiargs="-disable_gpu_hooks" bin/flux start
node visited
numanode visited
socket visited
gpu visited: 0
gpu visited: 1
numanode visited
socket visited
gpu visited: 2
gpu visited: 3
flux mini alloc -N 1 -n1 -c 1 -g3 flux resource list
0.092s: flux-shell[0]: Jobspec does not contain data-staging attributes. No staging necessary.
****************************************************************************
* hwloc has encountered an out-of-order XML topology load.
* Object NUMANode cpuset 0x0000f000,,,,,0x0 complete 0x0000f000,,,,,0x0
* was inserted after object HostBridge with none and none.
* The error occured in hwloc 1.11.10 inside process `broker', while
* the input XML was generated by hwloc 1.11.10 inside process `broker'.
* Please check that your input topology XML file is valid.
****************************************************************************
2020-11-28T20:10:18.367473Z resource.err[0]: verify: rank 0 (rzansel32) missing resources: core0,gpu[1-3]
node visited
gpu visited: 1
numanode visited
socket visited
gpu visited: 2
gpu visited: 3
STATE NNODES NCORES NGPUS
free 1 1 3
allocated 0 0 0
down 0 0 0 |
It turned out We probably want to bounce this to hwloc team first before doing anything with this. |
@dongahn: after re-reading flux-framework/flux-sched#658, it looks like this is only a limitation of hwloc 1.x (at least for the case documented in that issue). I wonder if your use case would also be handled correctly by hwloc 2.x+. |
Top level
From a proxy
From the top level
Apparently, the top-level scheduler creates correct resource set but one of the nest instances couldn't discover one GPU.
The text was updated successfully, but these errors were encountered: