Increased VRAM consumption when coupled with DDP

### System Info

Hello there,

I'm fine-tuning a Llama 3 model from HuggingFace with PeFT and BitsAndBytes. Interestingly, when wrapping the model with DDP, the training end up taking more VRAM on the master GPU. More interestingly, this VRAM increases with the number of GPUs. Do you see any reason why this could happen?

### Reproduction

Not easy to produce

### Expected behavior

VRAM consumption is constant w.r.t number of DDP processes. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increased VRAM consumption when coupled with DDP #1555

System Info

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Increased VRAM consumption when coupled with DDP #1555

Description

System Info

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions