Skip to content

treematch hang #1183

@ggouaillardet

Description

@ggouaillardet

@bosilca the distgraph_test_4 test from the ibm test suite might hang depending on the machine topology.
with 4 mpi tasks, it works fine on a server with 2 sockets / 12 cores / 24 threads, but it hangs on my VM with 1 socket / 4 cores / 4 threads

the inlined topology can be used to evidence the issue

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topology SYSTEM "hwloc.dtd">
<topology>
  <object type="Machine" os_index="0" cpuset="0x0000000f" complete_cpuset="0x0000000f" allowed_cpuset="0x0000000f" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001">
    <object type="Package" os_index="0" cpuset="0x0000000f" complete_cpuset="0x0000000f" allowed_cpuset="0x0000000f" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001">
      <object type="PU" os_index="0" cpuset="0x00000001" complete_cpuset="0x00000001" allowed_cpuset="0x00000001" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001"/>
      <object type="PU" os_index="1" cpuset="0x00000002" complete_cpuset="0x00000002" allowed_cpuset="0x00000002" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001"/>
      <object type="PU" os_index="2" cpuset="0x00000004" complete_cpuset="0x00000004" allowed_cpuset="0x00000004" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001"/>
      <object type="PU" os_index="3" cpuset="0x00000008" complete_cpuset="0x00000008" allowed_cpuset="0x00000008" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001"/>
    </object>
  </object>
</topology>

run as is

$ mpirun -np 4 ./distgraph_test_4
pass!!

run with a simple topology

$ mpirun --mca hwloc_base_topo_file smp.xml -np 4 ./distgraph_test_4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 

HANG

but this works fine without the treematch module

$ mpirun --mca hwloc_base_topo_file smp.xml --mca topo ^treematch -np 4 ./distgraph_test_4
pass!!

note i had to push a5440ad so the topology file is used by the treematch module (otherwise, hwloc use the topology file, but treematch use the topology of the node)

can you please have a look at this ?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions