Skip to content

[BUG] train_text_to_image_lora.py not support Multi-nodes or Multi-gpus training. #4046

Closed
@WindVChen

Description

@WindVChen

In train_text_to_image_lora.py, I notice that the LORA parameters are extracted into an AttnProcsLayers class:

518    lora_layers = AttnProcsLayers(unet.attn_processors)

And it is only the lora_layers that is wrapped by DistributedDataParallel in the following code:

670    lora_layers, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
            lora_layers, optimizer, train_dataloader, lr_scheduler
          )

In the training process, it seems that the lora_layers are not explicitly used but only the unet is used:

776    model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample

My question is that when using Multi-GPUs or Multi-Machines, will the gradients be successfully averaged across all processes in the above way?

It is true that in each process, the gradients will be backward to unet.attn_processors, and these gradients will be shared by lora_layers, so we can use optimizer to update the weights. However, since we actually use unet.attn_processors to do the forward operation, but not the wrapped lora_layers, can the gradients be correctly averaged? From here, it seems that a wrapped module will have a different forward compared to its original forward operation.

I am not quite familiar with torch.nn.parallel.DistributedDataParallel wrapper, and I do worry about whether the current code in train_text_to_image_lora.py will lead to different LORA weights in different processes (if the gradients failed to broadcast among processes).

Hope to find some help here, thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleIssues that haven't received updates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions