Description
In train_text_to_image_lora.py
, I notice that the LORA parameters are extracted into an AttnProcsLayers
class:
518 lora_layers = AttnProcsLayers(unet.attn_processors)
And it is only the lora_layers
that is wrapped by DistributedDataParallel in the following code:
670 lora_layers, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
lora_layers, optimizer, train_dataloader, lr_scheduler
)
In the training process, it seems that the lora_layers
are not explicitly used but only the unet
is used:
776 model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
My question is that when using Multi-GPUs or Multi-Machines, will the gradients be successfully averaged across all processes in the above way?
It is true that in each process, the gradients will be backward to unet.attn_processors
, and these gradients will be shared by lora_layers
, so we can use optimizer
to update the weights. However, since we actually use unet.attn_processors
to do the forward operation, but not the wrapped lora_layers
, can the gradients be correctly averaged? From here, it seems that a wrapped module will have a different forward compared to its original forward operation.
I am not quite familiar with torch.nn.parallel.DistributedDataParallel
wrapper, and I do worry about whether the current code in train_text_to_image_lora.py
will lead to different LORA weights in different processes (if the gradients failed to broadcast among processes).
Hope to find some help here, thank you.