Skip to content

[Issue]: Very low performance when dmabuf is used #281

Open
@IMbackK

Description

Problem Description

i am aware dmabuf is currently experimental.

Using HSA_ENABLE_IPC_MODE_LEGACY=0 on a kernel with no KFD patches currently causes pytorch ddp to excessively underperform.

Every device to device copy is accompanied by a ~500mS stall on all devices.

Operating System

Ubuntu 24.04

CPU

Amd Epyc 7552

GPU

3x MI100

ROCm Version

ROCm 6.3.0

ROCm Component

rccl

Steps to Reproduce

As a test case i am using the pytorch ddp example from https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series, run with:

torchrun --nnode=1 --node_rank=0 --nproc_per_node=2 multigpu_torchrun.py --batch_size 8 100 10

I tested upstream Kernel 6.6.64 and 6.12.8 with CONFIG_HSA_AMD_P2P and CONFIG_DMABUF_MOVE_NOTIFY
ROCm bandwith test shows good p2p performance.
ROCm validation suit shows device pice p2p to be working.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions