-
Couldn't load subscription status.
- Fork 928
Closed
Description
@bosilca the distgraph_test_4 test from the ibm test suite might hang depending on the machine topology.
with 4 mpi tasks, it works fine on a server with 2 sockets / 12 cores / 24 threads, but it hangs on my VM with 1 socket / 4 cores / 4 threads
the inlined topology can be used to evidence the issue
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topology SYSTEM "hwloc.dtd">
<topology>
<object type="Machine" os_index="0" cpuset="0x0000000f" complete_cpuset="0x0000000f" allowed_cpuset="0x0000000f" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001">
<object type="Package" os_index="0" cpuset="0x0000000f" complete_cpuset="0x0000000f" allowed_cpuset="0x0000000f" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001">
<object type="PU" os_index="0" cpuset="0x00000001" complete_cpuset="0x00000001" allowed_cpuset="0x00000001" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001"/>
<object type="PU" os_index="1" cpuset="0x00000002" complete_cpuset="0x00000002" allowed_cpuset="0x00000002" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001"/>
<object type="PU" os_index="2" cpuset="0x00000004" complete_cpuset="0x00000004" allowed_cpuset="0x00000004" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001"/>
<object type="PU" os_index="3" cpuset="0x00000008" complete_cpuset="0x00000008" allowed_cpuset="0x00000008" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001"/>
</object>
</object>
</topology>run as is
$ mpirun -np 4 ./distgraph_test_4
pass!!
run with a simple topology
$ mpirun --mca hwloc_base_topo_file smp.xml -np 4 ./distgraph_test_4
========== Centralized Reordering =========
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering =========
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering =========
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering =========
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering =========
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering =========
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering =========
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering =========
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering =========
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering =========
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering =========
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering =========
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering =========
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering =========
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering =========
HANG
but this works fine without the treematch module
$ mpirun --mca hwloc_base_topo_file smp.xml --mca topo ^treematch -np 4 ./distgraph_test_4
pass!!
note i had to push a5440ad so the topology file is used by the treematch module (otherwise, hwloc use the topology file, but treematch use the topology of the node)
can you please have a look at this ?