Best practices for multi-GPU training?

Congratulations on the new release! Could you maybe provide official recommendations for how to train on multiple GPUs, possibly alongside a full example? The examples in the repo fail due to unset environment variables and I am not sure which integration (Data Parallel, Distributed Data Parallel etc.) to use. The official PyTorch documentation is very thorough but not exactly intuitive for someone just wanting to run a model quickly. My use-case is that I would like to train a large dataset on a single machine with two or more GPUs with opacus. This, I believe, is what most users would like to do so I believe an end-to-end tutorial would be very useful.

I also tried Lightning, but this fails when setting >1 GPU. Is this something which is anticipated or am I doing something wrong. Your official example doesn’t include multi-GPU support AFAIK.

Thank you very much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best practices for multi-GPU training? #304

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Best practices for multi-GPU training? #304

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions