Description
Problem Description
i am aware dmabuf is currently experimental.
Using HSA_ENABLE_IPC_MODE_LEGACY=0 on a kernel with no KFD patches currently causes pytorch ddp to excessively underperform.
Every device to device copy is accompanied by a ~500mS stall on all devices.
Operating System
Ubuntu 24.04
CPU
Amd Epyc 7552
GPU
3x MI100
ROCm Version
ROCm 6.3.0
ROCm Component
rccl
Steps to Reproduce
As a test case i am using the pytorch ddp example from https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series, run with:
torchrun --nnode=1 --node_rank=0 --nproc_per_node=2 multigpu_torchrun.py --batch_size 8 100 10
I tested upstream Kernel 6.6.64 and 6.12.8 with CONFIG_HSA_AMD_P2P and CONFIG_DMABUF_MOVE_NOTIFY
ROCm bandwith test shows good p2p performance.
ROCm validation suit shows device pice p2p to be working.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Activity