Skip to content

Why does UCC consistently insist on using software to simulate PUT/GET operations? #1232

@TroyMitchell911

Description

@TroyMitchell911

I'm running two systems (sharing the same DRAM) on a single chip, each with its own dedicated network interface card for wireup. I'm then modifying a POSIX template to explore the possibility of shared memory across nodes. First, I changed it to INTERNODE to enable cross-node functionality, and then I implemented my own memory allocation functions. Currently, pt2pt is working perfectly. However, when I tried to test allreduce, I found that all RDMA operations were being simulated using AM (and it appears to be done by UCC, since UCX has reported lane availability). Why is this happening? Here is the information I can provide.
The ucx log(I added):

[1764656842.337472] [a:31893:0]           mpool.c:281  UCX  DEBUG mpool tl_ucp_req_mp: allocated chunk 0x2ae862cb44 of 6228 bytes with 8 elements
[1764656842.337930] [a:31893:0]          ucp_ep.c:408  UCX  DEBUG created ep 0x3f93ede000 to <no debug data> from api call
[1764656842.338350] [a:31893:0]          ucp_ep.c:2954 UCX  WARN    Lane 0 iface flags: PUT_SHORT=YES PUT_BCOPY=YES GET_SHORT=YES GET_BCOPY=YES
[1764656842.338376] [a:31893:0]          ucp_ep.c:2963 UCX  WARN    *************put_short: 4294967295, iface->put_max_short: 4294967295*********
[1764656842.338401] [a:31893:0]          ucp_ep.c:3011 UCX  WARN    RMA lane 0: max_put_short=4294967295, max_get_short=4294967295
[1764656842.338421] [a:31893:0]      ucp_worker.c:1892 UCX  WARN    !!!!!!!!!!!!!!!!!!!!!!!!rma_emul: 0!!!!!!!!!!!!!!!!!!!, rma_lanes_map = 1
[1764656842.338442] [a:31893:0]      ucp_worker.c:1905 UCX  INFO    UCC_UCP_CONTEXT intra-node cfg#0 tag(mytest/memory)  rma(mytest/memory)  amo(mytest/memory)  am(mytest/memory)

And I got:

[1764656842.339034] [a:31893:0]   +----------------------------------+-------------------------------------------------------------+
[1764656842.339054] [a:31893:0]   | UCC_UCP_CONTEXT intra-node cfg#0 | remote memory write by ucp_put* from host memory to host    |
[1764656842.339063] [a:31893:0]   +----------------------------------+-----------------------------------------------+-------------+
[1764656842.339072] [a:31893:0]   |                           0..inf | software emulation                            | mytest/memory |
[1764656842.339079] [a:31893:0]   +----------------------------------+-----------------------------------------------+-------------+

I suspect it's because the intra-node flag is printed here, but I have indeed set up the inter-node and they can communicate normally (if the intra-node flag is displayed, MPI+UCC+UCX cannot organize the two hosts).

Can anyone offer some advice?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions