Skip to content

Issues when debuging NCCL2 distributed training #10499

Closed
@typhoonzero

Description

@typhoonzero

Below are errors when running tests based on #10349

NOTE: if set NCCL_IB_DISABLE=1, the job can run well, and RDMA test tools has passed (udaddy, ib_write_bw)

case 1:

settings:

export NCCL_IB_CUDA_SUPPORT=0

error log:

server 1:

instance-fjhn02ur:454:484 [0] include/socket.h:185 WARN Call to connect failed : Connection refused
instance-fjhn02ur:454:484 [0] INFO transport/net_socket.cu:101 -> 2
instance-fjhn02ur:454:484 [0] INFO include/net.h:28 -> 2 [Net]
instance-fjhn02ur:454:484 [0] INFO bootstrap.cu:76 -> 2 [bthread]

instance-fjhn02ur:454:587 [0] include/socket.h:196 WARN Connection closed by remote peer
instance-fjhn02ur:454:587 [0] INFO transport/net_socket.cu:149 -> 2
instance-fjhn02ur:454:587 [0] INFO include/net.h:31 -> 2 [Net]
instance-fjhn02ur:454:587 [0] INFO include/net.h:45 -> 2 [Net]
instance-fjhn02ur:454:587 [0] INFO bootstrap.cu:189 -> 2
instance-fjhn02ur:454:587 [0] INFO init.cu:404 -> 2
instance-fjhn02ur:454:587 [0] INFO init.cu:517 -> 2
instance-fjhn02ur:454:587 [0] INFO misc/group.cu:70 -> 2 [Async thread]

instance-fjhn02ur:454:588 [1] include/socket.h:196 WARN Connection closed by remote peer
instance-fjhn02ur:454:588 [1] INFO transport/net_socket.cu:149 -> 2
instance-fjhn02ur:454:588 [1] INFO include/net.h:31 -> 2 [Net]
instance-fjhn02ur:454:588 [1] INFO include/net.h:45 -> 2 [Net]
instance-fjhn02ur:454:588 [1] INFO bootstrap.cu:189 -> 2
instance-fjhn02ur:454:588 [1] INFO init.cu:404 -> 2
instance-fjhn02ur:454:588 [1] INFO init.cu:517 -> 2
instance-fjhn02ur:454:588 [1] INFO misc/group.cu:70 -> 2 [Async thread]
terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
  what():  unhandled system error at [/paddle/paddle/fluid/platform/nccl_helper.h:54]

server 2:

instance-wh4a6pq2:178:215 [0] include/socket.h:194 WARN Call to recv failed : Connection reset by peer
instance-wh4a6pq2:178:215 [0] INFO transport/net_ib.cu:537 -> 2
instance-wh4a6pq2:178:215 [0] INFO transport/net_ib.cu:610 -> 2
instance-wh4a6pq2:178:215 [0] INFO include/net.h:30 -> 2 [Net]
instance-wh4a6pq2:178:215 [0] INFO include/net.h:38 -> 2 [Net]
instance-wh4a6pq2:178:215 [0] INFO bootstrap.cu:169 -> 2
instance-wh4a6pq2:178:215 [0] INFO init.cu:400 -> 2
instance-wh4a6pq2:178:215 [0] INFO init.cu:517 -> 2
instance-wh4a6pq2:178:215 [0] INFO misc/group.cu:70 -> 2 [Async thread]

instance-wh4a6pq2:178:216 [1] include/socket.h:214 WARN Call to write failed : Connection reset by peer
instance-wh4a6pq2:178:216 [1] INFO transport/net_ib.cu:430 -> 2
instance-wh4a6pq2:178:216 [1] INFO include/net.h:28 -> 2 [Net]
instance-wh4a6pq2:178:216 [1] INFO bootstrap.cu:168 -> 2
instance-wh4a6pq2:178:216 [1] INFO init.cu:400 -> 2
instance-wh4a6pq2:178:216 [1] INFO init.cu:517 -> 2
instance-wh4a6pq2:178:216 [1] INFO misc/group.cu:70 -> 2 [Async thread]
terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
  what():  unhandled system error at [/paddle/paddle/fluid/platform/nccl_helper.h:54]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions