Open
Description
- Setup: DGX 8*V100 32GB, CUDA 12.4, node "dgx1v-loki-23" in dlcluster
- 1 node, two processes
- reproducer:
docker run \
--rm --net=host --uts=host --ipc=host --ulimit stack=67108864 --ulimit memlock=-1 \
--security-opt seccomp=unconfined --cap-add=SYS_ADMIN \
--cap-add=SYS_PTRACE --privileged \
--device=/dev/infiniband \
--gpus all \
gitlab-master.nvidia.com:5005/dl/pytorch/update-scripts:pjnl-latest \
/bin/bash -c 'mpirun -np 2 build/test_multidevice --gtest_filter=*Gather/UCC*'
- Error:
[1730117228.178968] [dgx1v-loki-23:3000 :0] tl_cuda_cache.c:231 UCC ERROR ipc-cache: failed to open ipc mem handle. addr:0x7f65a8000000 len:16777216 err:201
- or sometimes it just segfaults
- UCX version:
# API headers version: 1.18.0, Git branch 'master', revision 9da106a
- UCC version=1.4.0 revision 2bb2b73
Metadata
Metadata
Assignees
Labels
No labels