Skip to content

multi gpu error #278

@Linda6521

Description

@Linda6521

多卡训练时报错ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cpu')}.
[2025-12-14 12:17:48,931] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1916410 closing signal SIGTERM
[2025-12-14 12:17:48,948] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1916407) of binary: /root/miniconda3/envs/openstereo/bin/python3.8,我用的命令是export CUDA_VISIBLE_DEVICES=0,1,2,3
torchrun --nnodes=1 --nproc_per_node=4 --rdzv_backend=c10d --rdzv_endpoint=localhost:23456 tools/train.py --dist_mode --cfg_file /root/OpenStereo/cfgs/cfnet/cfnet_kitti15.yaml请问是什么原因

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions