Open
Description
System Info
Hello there,
I'm fine-tuning a Llama 3 model from HuggingFace with PeFT and BitsAndBytes. Interestingly, when wrapping the model with DDP, the training end up taking more VRAM on the master GPU. More interestingly, this VRAM increases with the number of GPUs. Do you see any reason why this could happen?
Reproduction
Not easy to produce
Expected behavior
VRAM consumption is constant w.r.t number of DDP processes.
Metadata
Metadata
Assignees
Labels
No labels