Closed
Description
Below are errors when running tests based on #10349
NOTE: if set NCCL_IB_DISABLE=1, the job can run well, and RDMA test tools has passed (udaddy, ib_write_bw)
case 1:
settings:
export NCCL_IB_CUDA_SUPPORT=0
error log:
server 1:
instance-fjhn02ur:454:484 [0] include/socket.h:185 WARN Call to connect failed : Connection refused
instance-fjhn02ur:454:484 [0] INFO transport/net_socket.cu:101 -> 2
instance-fjhn02ur:454:484 [0] INFO include/net.h:28 -> 2 [Net]
instance-fjhn02ur:454:484 [0] INFO bootstrap.cu:76 -> 2 [bthread]
instance-fjhn02ur:454:587 [0] include/socket.h:196 WARN Connection closed by remote peer
instance-fjhn02ur:454:587 [0] INFO transport/net_socket.cu:149 -> 2
instance-fjhn02ur:454:587 [0] INFO include/net.h:31 -> 2 [Net]
instance-fjhn02ur:454:587 [0] INFO include/net.h:45 -> 2 [Net]
instance-fjhn02ur:454:587 [0] INFO bootstrap.cu:189 -> 2
instance-fjhn02ur:454:587 [0] INFO init.cu:404 -> 2
instance-fjhn02ur:454:587 [0] INFO init.cu:517 -> 2
instance-fjhn02ur:454:587 [0] INFO misc/group.cu:70 -> 2 [Async thread]
instance-fjhn02ur:454:588 [1] include/socket.h:196 WARN Connection closed by remote peer
instance-fjhn02ur:454:588 [1] INFO transport/net_socket.cu:149 -> 2
instance-fjhn02ur:454:588 [1] INFO include/net.h:31 -> 2 [Net]
instance-fjhn02ur:454:588 [1] INFO include/net.h:45 -> 2 [Net]
instance-fjhn02ur:454:588 [1] INFO bootstrap.cu:189 -> 2
instance-fjhn02ur:454:588 [1] INFO init.cu:404 -> 2
instance-fjhn02ur:454:588 [1] INFO init.cu:517 -> 2
instance-fjhn02ur:454:588 [1] INFO misc/group.cu:70 -> 2 [Async thread]
terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
what(): unhandled system error at [/paddle/paddle/fluid/platform/nccl_helper.h:54]
server 2:
instance-wh4a6pq2:178:215 [0] include/socket.h:194 WARN Call to recv failed : Connection reset by peer
instance-wh4a6pq2:178:215 [0] INFO transport/net_ib.cu:537 -> 2
instance-wh4a6pq2:178:215 [0] INFO transport/net_ib.cu:610 -> 2
instance-wh4a6pq2:178:215 [0] INFO include/net.h:30 -> 2 [Net]
instance-wh4a6pq2:178:215 [0] INFO include/net.h:38 -> 2 [Net]
instance-wh4a6pq2:178:215 [0] INFO bootstrap.cu:169 -> 2
instance-wh4a6pq2:178:215 [0] INFO init.cu:400 -> 2
instance-wh4a6pq2:178:215 [0] INFO init.cu:517 -> 2
instance-wh4a6pq2:178:215 [0] INFO misc/group.cu:70 -> 2 [Async thread]
instance-wh4a6pq2:178:216 [1] include/socket.h:214 WARN Call to write failed : Connection reset by peer
instance-wh4a6pq2:178:216 [1] INFO transport/net_ib.cu:430 -> 2
instance-wh4a6pq2:178:216 [1] INFO include/net.h:28 -> 2 [Net]
instance-wh4a6pq2:178:216 [1] INFO bootstrap.cu:168 -> 2
instance-wh4a6pq2:178:216 [1] INFO init.cu:400 -> 2
instance-wh4a6pq2:178:216 [1] INFO init.cu:517 -> 2
instance-wh4a6pq2:178:216 [1] INFO misc/group.cu:70 -> 2 [Async thread]
terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
what(): unhandled system error at [/paddle/paddle/fluid/platform/nccl_helper.h:54]
Metadata
Metadata
Assignees
Labels
No labels