[LoRA] allow loras to be loaded with low_cpu_mem_usage. #9510

sayakpaul · 2024-09-24T06:51:55Z

What does this PR do?

huggingface/peft#1961 added the ability to set low_cpu_mem_usage while loading LoRAs. This can be quite helpful in speeding up the loading of LoRAs that are large and have many layers.

#8953 is a good example where this feature could be beneficial.

Benchmarking code

from diffusers import DiffusionPipeline 
import torch 
import time 
import fire

def main(ckpt_id: str, lora_id: str, low_cpu_mem_usage: bool = False):
    pipeline = DiffusionPipeline.from_pretrained(ckpt_id, torch_dtype=torch.bfloat16).to("cuda")
    
    for _ in range(10):
        start_time = time.time()  
        pipeline.load_lora_weights(lora_id, low_cpu_mem_usage=low_cpu_mem_usage)
        end_time = time.time()  
        pipeline.unload_lora_weights()
        elapsed_time = end_time - start_time 
        
        print(f"Iteration {_ + 1}: Load Lora weights took {elapsed_time:.6f} seconds with {low_cpu_mem_usage=}")

if __name__ == "__main__":
    fire.Fire(main)

Results:

Iteration 1: Load Lora weights took 13.924374 seconds with low_cpu_mem_usage=True
Iteration 2: Load Lora weights took 1.621597 seconds with low_cpu_mem_usage=True
Iteration 3: Load Lora weights took 1.612010 seconds with low_cpu_mem_usage=True
Iteration 4: Load Lora weights took 1.670260 seconds with low_cpu_mem_usage=True
Iteration 5: Load Lora weights took 1.664858 seconds with low_cpu_mem_usage=True
Iteration 6: Load Lora weights took 1.482521 seconds with low_cpu_mem_usage=True
Iteration 7: Load Lora weights took 1.633697 seconds with low_cpu_mem_usage=True
Iteration 8: Load Lora weights took 1.593326 seconds with low_cpu_mem_usage=True
Iteration 9: Load Lora weights took 1.503672 seconds with low_cpu_mem_usage=True
Iteration 10: Load Lora weights took 1.566633 seconds with low_cpu_mem_usage=True

Iteration 1: Load Lora weights took 33.370373 seconds with low_cpu_mem_usage=False
Iteration 2: Load Lora weights took 3.937800 seconds with low_cpu_mem_usage=False
Iteration 3: Load Lora weights took 4.364943 seconds with low_cpu_mem_usage=False
Iteration 4: Load Lora weights took 4.303800 seconds with low_cpu_mem_usage=False
Iteration 5: Load Lora weights took 4.154818 seconds with low_cpu_mem_usage=False
Iteration 6: Load Lora weights took 3.869319 seconds with low_cpu_mem_usage=False
Iteration 7: Load Lora weights took 4.153911 seconds with low_cpu_mem_usage=False
Iteration 8: Load Lora weights took 4.275074 seconds with low_cpu_mem_usage=False
Iteration 9: Load Lora weights took 4.395445 seconds with low_cpu_mem_usage=False
Iteration 10: Load Lora weights took 4.071344 seconds with low_cpu_mem_usage=False

The feature currently needs the user to install peft and transformers from the source. So, I suggest we wait until both the libraries have made stable releases to merge this PR.

Once Ben reviews the PR, will request for a review from Yiyi.

HuggingFaceDocBuilderDev · 2024-09-24T06:57:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan · 2024-09-24T13:32:56Z

I think we need to have a low_cpu_mem_usage tag in the load_adapter() method too: https://github.com/huggingface/diffusers/blob/28f9d84549c0b1d24ef00d69a4c723f3a11cffb6/src/diffusers/loaders/lora_pipeline.py#L371C30-L371C42

If so, then this PR would be contingent on that. We could, however, use a combination of inject_adapter_in_model() and set_peft_model_state_dict() to mimic the same thing, I assume. I wouldn't personally prefer that because load_adapter() has been there for a while in diffusers.

Could you please clarify that bit? On the PEFT side, we have low_cpu_mem_usage on load_adapter but that's not the method being used here (just has the same name), right? Is this method coming from transformers (i.e. here)?

sayakpaul · 2024-09-24T14:21:12Z

that's not the method being used here (just has the same name), right? Is this method coming from transformers (i.e. here)?

Yes, that is correct.

sayakpaul · 2024-09-27T03:03:51Z

@BenjaminBossan when I used:

import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16).to("cuda")
pipe.load_lora_weights(
    "TheLastBen/The_Hound", 
    weight_name="sandor_clegane_single_layer.safetensors", 
    low_cpu_mem_usage=True
)

prompt = "sandor clegane drinking in a pub"
image = pipe(
    prompt=prompt,
    num_inference_steps=30,
    width=1024,
    generator=torch.manual_seed(42),
    height=1024,
).images[0]
image.save("sandor.png")

It leads to:

Error trace

  File "/home/sayak/diffusers/src/diffusers/models/transformers/transformer_flux.py", line 98, in forward
    hidden_states = gate * self.proj_out(hidden_states)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 585, in forward
    result = result + lora_B(lora_A(dropout(x))) * scaling
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

I investigated this a bit and confirmed that the LoRA params are kept on CPU which causes this failure. In case of low_cpu_mem_usage=False the LoRA parameters are on the expected device ("cuda" in the above example).

I further investigated why the tests added in this PR don't fail. That is because the state dict we're supplying to set_peft_model_state_dict() (here) -- the tensors of that state dict are already on the desired device. When I forcibly changed their device to a CPU and ran the tests on a GPU, the tests failed, and they complained about the same thing.

Possible to look into this?

…ll/9510\#issuecomment-2378316687

BenjaminBossan · 2024-09-27T16:00:04Z

Hmm, I could not reproduce the issue :-/ I had to change the code slightly due to memory constraints:

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16, device_map="balanced", max_memory={0: "24GB", 1: "20GB"},
)
# after loading the LoRA adapter {p.device for p in pipe.transformer.parameters()} returns:
# {device(type='cuda', index=0)}

Could this be the reason why it works for me?

I also tried this with a normal PEFT model that I moved to CUDA and then loaded with low_cpu_mem_usage and it worked.

sayakpaul · 2024-09-27T16:05:56Z

Yeah you need to be on the exact same setup to replicate this. We cannot assume people will do load_lora_weights() only in a specific manner.

You can perhaps use an SD LoRA:

diffusers/tests/lora/test_lora_layers_sd.py

Line 329 in 534848c

def test_a1111(self):

BenjaminBossan · 2024-09-27T16:31:23Z

Quick update, I couldn't reproduce with that model:

import torch
from diffusers import FluxPipeline, StableDiffusionPipeline

generator = torch.Generator().manual_seed(0)
pipe = StableDiffusionPipeline.from_pretrained("hf-internal-testing/Counterfeit-V2.5", safety_checker=None).to("cuda")
lora_model_id = "hf-internal-testing/civitai-light-shadow-lora"
lora_filename = "light_and_shadow.safetensors"
pipe.load_lora_weights(lora_model_id, weight_name=lora_filename, low_cpu_mem_usage=True)
images = pipe(
    "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
).images

sayakpaul · 2024-09-28T03:08:08Z

@BenjaminBossan here's a minimal reproduction:

import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained("sayakpaul/tiny-flux-pipeline-with-lora", torch_dtype=torch.bfloat16).to("cuda")
pipe.load_lora_weights(
    "sayakpaul/tiny-flux-pipeline-with-lora", weight_name="pytorch_lora_weights.bin", low_cpu_mem_usage=True
)

prompt = "sandor clegane drinking in a pub"
image = pipe(prompt=prompt, num_inference_steps=30).images[0]
image.save("sandor.png")

I am on peft:main.

See: huggingface/diffusers#9510 (comment) Right now, the low_cpu_mem_usage=True option does not consolidate the devices. E.g. when the model is on GPU and the state_dict on CPU, the adapter weight will be on CPU after loading, when it should be GPU. This fix ensures that the devices are consolidated.

BenjaminBossan · 2024-09-30T10:19:36Z

Okay, got it now, thanks for the memory-friendly reproducer.

Indeed, if the LoRA weights on the model are on meta device, the device will be taken from the state_dict, not the base layer. I worked on a fix: huggingface/peft#2113

For the time being, you could add this snippet and it should fix the issue:

    if low_cpu_mem_usage:
        for module in model.modules():
            if hasattr(module, "_move_adapter_to_device_of_base_layer"):
                module._move_adapter_to_device_of_base_layer(adapter_name)

See: huggingface/diffusers#9510 (comment) Right now, the low_cpu_mem_usage=True option does not consolidate the devices. E.g. when the model is on GPU and the state_dict on CPU, the adapter weight will be on CPU after loading, when it should be GPU. This fix ensures that the devices are consolidated.

PEFT added support for low_cpu_mem_usage=True when loading adapters in huggingface/peft#1961. This feature is now available when installing PEFT v0.13.0. With this PR, this option is also supported when loading PEFT adapters directly into transformers models. Additionally, with this PR, huggingface/diffusers#9510 will be unblocked, which implements this option in diffusers.

…3725) * [PEFT] Support low_cpu_mem_usage for PEFT loading PEFT added support for low_cpu_mem_usage=True when loading adapters in huggingface/peft#1961. This feature is now available when installing PEFT v0.13.0. With this PR, this option is also supported when loading PEFT adapters directly into transformers models. Additionally, with this PR, huggingface/diffusers#9510 will be unblocked, which implements this option in diffusers. * Fix typo

sayakpaul · 2024-10-07T14:25:37Z

@BenjaminBossan could you give this a review?

sayakpaul · 2024-10-08T15:13:08Z

@yiyixuxu could you review this PR?

After the approval, will add docs and request for a review from Steven.

yiyixuxu

thanks! I left one comment, otherwise PR looks good to me

tests/lora/utils.py

sayakpaul · 2024-10-08T20:02:57Z

Thanks, @yiyixuxu!

I have taken care of the TODOs in the docs too. @stevhliu could you review the related changes?

stevhliu

Thanks! Just one minor change that needs to be propagated :)

docs/source/en/tutorials/using_peft_for_inference.md

src/diffusers/loaders/lora_pipeline.py

src/diffusers/loaders/unet.py

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

…9510) * allow loras to be loaded with low_cpu_mem_usage. * add flux support but note https://github.com/huggingface/diffusers/pull/9510\#issuecomment-2378316687 * low_cpu_mem_usage. * fix-copies * fix-copies again * tests * _LOW_CPU_MEM_USAGE_DEFAULT_LORA * _peft_version default. * version checks. * version check. * version check. * version check. * require peft 0.13.1. * explicitly specify low_cpu_mem_usage=False. * docs. * transformers version 4.45.2. * update * fix * empty * better name initialize_dummy_state_dict. * doc todos. * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * style * fix-copies --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

See: huggingface/diffusers#9510 (comment) Right now, the low_cpu_mem_usage=True option does not consolidate the devices. E.g. when the model is on GPU and the state_dict on CPU, the adapter weight will be on CPU after loading, when it should be GPU. This fix ensures that the devices are consolidated.

…ggingface#33725) * [PEFT] Support low_cpu_mem_usage for PEFT loading PEFT added support for low_cpu_mem_usage=True when loading adapters in huggingface/peft#1961. This feature is now available when installing PEFT v0.13.0. With this PR, this option is also supported when loading PEFT adapters directly into transformers models. Additionally, with this PR, huggingface/diffusers#9510 will be unblocked, which implements this option in diffusers. * Fix typo

* allow loras to be loaded with low_cpu_mem_usage. * add flux support but note https://github.com/huggingface/diffusers/pull/9510\#issuecomment-2378316687 * low_cpu_mem_usage. * fix-copies * fix-copies again * tests * _LOW_CPU_MEM_USAGE_DEFAULT_LORA * _peft_version default. * version checks. * version check. * version check. * version check. * require peft 0.13.1. * explicitly specify low_cpu_mem_usage=False. * docs. * transformers version 4.45.2. * update * fix * empty * better name initialize_dummy_state_dict. * doc todos. * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * style * fix-copies --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

See: huggingface/diffusers#9510 (comment) Right now, the low_cpu_mem_usage=True option does not consolidate the devices. E.g. when the model is on GPU and the state_dict on CPU, the adapter weight will be on CPU after loading, when it should be GPU. This fix ensures that the devices are consolidated.

allow loras to be loaded with low_cpu_mem_usage.

d66b858

sayakpaul added the peft label Sep 24, 2024

sayakpaul requested a review from BenjaminBossan September 24, 2024 06:51

sayakpaul added 2 commits September 25, 2024 13:45

Merge branch 'main' into low-cpu-mem-usage-lora

979b949

Merge branch 'main' into low-cpu-mem-usage-lora

019b257

BenjaminBossan mentioned this pull request Sep 26, 2024

[PEFT] Support low_cpu_mem_usage option for PEFT loading adapters huggingface/transformers#33725

Merged

5 tasks

add flux support but note https://github.com/huggingface/diffusers/pu…

9a22fc8

…ll/9510\#issuecomment-2378316687

resolve conflicts.

c1e987a

BenjaminBossan mentioned this pull request Sep 30, 2024

FIX low_cpu_mem_usage consolidates devices huggingface/peft#2113

Merged

sayakpaul added 5 commits October 7, 2024 18:16

Merge branch 'main' into low-cpu-mem-usage-lora

109949c

low_cpu_mem_usage.

b87eec8

fix-copies

d4a1fbf

fix-copies again

5c831cc

tests

1131e3d

sayakpaul added 4 commits October 8, 2024 20:04

transformers version 4.45.2.

28007f4

update

ba5576c

fix

48641dc

empty

0ab1d44

sayakpaul requested a review from yiyixuxu October 8, 2024 15:12

yiyixuxu reviewed Oct 8, 2024

View reviewed changes

tests/lora/utils.py Outdated Show resolved Hide resolved

sayakpaul added 3 commits October 9, 2024 01:27

better name initialize_dummy_state_dict.

ca5a1d5

Merge branch 'main' into low-cpu-mem-usage-lora

1c45329

doc todos.

95534e6

sayakpaul requested review from stevhliu and yiyixuxu October 8, 2024 20:03

yiyixuxu approved these changes Oct 8, 2024

View reviewed changes

stevhliu approved these changes Oct 9, 2024

View reviewed changes

sayakpaul and others added 4 commits October 9, 2024 10:38

Apply suggestions from code review

cf4917c

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Merge branch 'main' into low-cpu-mem-usage-lora

8f5b0e1

style

8ffe3be

fix-copies

a4eaa42

sayakpaul merged commit 31058cd into main Oct 9, 2024
18 checks passed

sayakpaul deleted the low-cpu-mem-usage-lora branch October 9, 2024 05:27

sayakpaul mentioned this pull request Oct 15, 2024

Why loading a lora weights so low? #8953

Closed

sayakpaul mentioned this pull request Nov 28, 2024

Maybe low_cpu_mem_usage need to be add as a parameter in load_lora_weight of StableDiffusionXLLoraLoaderMixin #6720

Closed

sayakpaul mentioned this pull request Feb 12, 2025

Faster set_adapters #10777

Merged

6 tasks

[LoRA] allow loras to be loaded with low_cpu_mem_usage. #9510

[LoRA] allow loras to be loaded with low_cpu_mem_usage. #9510

Uh oh!

Conversation

sayakpaul commented Sep 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Sep 24, 2024

Uh oh!

BenjaminBossan commented Sep 24, 2024

Uh oh!

sayakpaul commented Sep 24, 2024

Uh oh!

sayakpaul commented Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenjaminBossan commented Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Sep 27, 2024

Uh oh!

BenjaminBossan commented Sep 27, 2024

Uh oh!

sayakpaul commented Sep 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenjaminBossan commented Sep 30, 2024

Uh oh!

sayakpaul commented Oct 7, 2024

Uh oh!

sayakpaul commented Oct 8, 2024

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sayakpaul commented Oct 8, 2024

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sayakpaul commented Sep 24, 2024 •

edited

Loading

sayakpaul commented Sep 27, 2024 •

edited

Loading

BenjaminBossan commented Sep 27, 2024 •

edited

Loading

sayakpaul commented Sep 28, 2024 •

edited

Loading