Skip to content

Best practices for multi-GPU training? #304

@gkaissis

Description

@gkaissis

Congratulations on the new release! Could you maybe provide official recommendations for how to train on multiple GPUs, possibly alongside a full example? The examples in the repo fail due to unset environment variables and I am not sure which integration (Data Parallel, Distributed Data Parallel etc.) to use. The official PyTorch documentation is very thorough but not exactly intuitive for someone just wanting to run a model quickly. My use-case is that I would like to train a large dataset on a single machine with two or more GPUs with opacus. This, I believe, is what most users would like to do so I believe an end-to-end tutorial would be very useful.

I also tried Lightning, but this fails when setting >1 GPU. Is this something which is anticipated or am I doing something wrong. Your official example doesn’t include multi-GPU support AFAIK.

Thank you very much!

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentation

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions