Description
Hello, I have a question about the design of ucc topology. As we can see, the topo structs are defined in ucc_topo.h:
typedef struct ucc_context_topo {
ucc_proc_info_t procs;
ucc_rank_t n_procs;
ucc_rank_t nnodes;
ucc_rank_t min_ppn; /< smallest ppn value across the nodes
spanned by the group of processes
defined by ucc_addr_storage_t /
ucc_rank_t max_ppn; /< biggest ppn across the nodes /
ucc_rank_t max_n_sockets; /< max number of different sockets
on a node /
uint32_t sock_bound; /< global flag, 1 if processes are bound
to sockets /
ucc_rank_t max_n_numas; /< max number of different numa domains
on a node /
uint32_t numa_bound; /< global flag, 1 if processes are bound
to numa nodes */
} ucc_context_topo_t;
typedef struct ucc_addr_storage ucc_addr_storage_t;
/* This topo structure is initialized over a SUBSET of processes
from ucc_context_topo_t.
For example, if ucc_context_t is global then address exchange
is performed during ucc_context_create and we have ctx wide
ucc_addr_storage_t. So, we init ucc_context_topo_t on ucc_context.
Then, ucc_team is a subset of ucc_context mapped via team->ctx_map.
It represents a subset of ranks and we can initialize ucc_topo_t
for that subset, ie for a team. */
typedef struct ucc_topo {
ucc_context_topo_t topo; /< Cached pointer of the ctx topo /
ucc_sbgp_t sbgps[UCC_SBGP_LAST]; /< LOCAL sbgps initialized on demand */
ucc_sbgp_t all_sockets; /< array of socket sbgps, init on demand */
int n_sockets;
ucc_sbgp_t all_numas; /< array of numa sbgps, init on demand /
int n_numas;
ucc_rank_t node_leader_rank_id; /< defines which rank on a node will be
node leader. Similar to local node rank.
currently set to 0, can be selected differently
in the future /
ucc_rank_t node_leader_rank; /< actual rank of the node leader in the original
(ucc_team) ranking /
ucc_subset_t set; /< subset of procs from the ucc_context_topo.
for ucc_team topo it is team->ctx_map /
ucc_rank_t min_ppn; /< min ppn across the nodes for a team /
ucc_rank_t max_ppn; /< max ppn across the nodes for a team /
ucc_rank_t min_socket_size; /< min number of processes on a socket,
across all nodes of a team /
ucc_rank_t max_socket_size; /< max number of processes on a socket,
across all nodes of a team /
ucc_rank_t min_numa_size; /< min number of processes on a numa,
across all nodes of a team /
ucc_rank_t max_numa_size; /< max number of processes on a numa,
across all nodes of a team */
} ucc_topo_t;
They concern more about the node-level items including processor, socket, numa. What are they designed for? Besides, where can I find the intra-node link info (NVLink, PCIe, etc.) and the inter-node net info (RDMA, TCP/IP, etc.)?
Thanks for your reply in advance.