-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LoRA] allow loras to be loaded with low_cpu_mem_usage. #9510
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Could you please clarify that bit? On the PEFT side, we have |
Yes, that is correct. |
PEFT added support for low_cpu_mem_usage=True when loading adapters in huggingface/peft#1961. This feature is now available when installing PEFT v0.13.0. With this PR, this option is also supported when loading PEFT adapters directly into transformers models. Additionally, with this PR, huggingface/diffusers#9510 will be unblocked, which implements this option in diffusers.
@BenjaminBossan when I used: import torch
from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16).to("cuda")
pipe.load_lora_weights(
"TheLastBen/The_Hound",
weight_name="sandor_clegane_single_layer.safetensors",
low_cpu_mem_usage=True
)
prompt = "sandor clegane drinking in a pub"
image = pipe(
prompt=prompt,
num_inference_steps=30,
width=1024,
generator=torch.manual_seed(42),
height=1024,
).images[0]
image.save("sandor.png") It leads to: Error trace File "/home/sayak/diffusers/src/diffusers/models/transformers/transformer_flux.py", line 98, in forward
hidden_states = gate * self.proj_out(hidden_states)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 585, in forward
result = result + lora_B(lora_A(dropout(x))) * scaling
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm) I investigated this a bit and confirmed that the LoRA params are kept on CPU which causes this failure. In case of I further investigated why the tests added in this PR don't fail. That is because the state dict we're supplying to Possible to look into this? |
Hmm, I could not reproduce the issue :-/ I had to change the code slightly due to memory constraints: pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16, device_map="balanced", max_memory={0: "24GB", 1: "20GB"},
)
# after loading the LoRA adapter {p.device for p in pipe.transformer.parameters()} returns:
# {device(type='cuda', index=0)} Could this be the reason why it works for me? I also tried this with a normal PEFT model that I moved to CUDA and then loaded with |
Yeah you need to be on the exact same setup to replicate this. We cannot assume people will do You can perhaps use an SD LoRA: diffusers/tests/lora/test_lora_layers_sd.py Line 329 in 534848c
|
Quick update, I couldn't reproduce with that model: import torch
from diffusers import FluxPipeline, StableDiffusionPipeline
generator = torch.Generator().manual_seed(0)
pipe = StableDiffusionPipeline.from_pretrained("hf-internal-testing/Counterfeit-V2.5", safety_checker=None).to("cuda")
lora_model_id = "hf-internal-testing/civitai-light-shadow-lora"
lora_filename = "light_and_shadow.safetensors"
pipe.load_lora_weights(lora_model_id, weight_name=lora_filename, low_cpu_mem_usage=True)
images = pipe(
"masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
).images |
@BenjaminBossan here's a minimal reproduction: import torch
from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained("sayakpaul/tiny-flux-pipeline-with-lora", torch_dtype=torch.bfloat16).to("cuda")
pipe.load_lora_weights(
"sayakpaul/tiny-flux-pipeline-with-lora", weight_name="pytorch_lora_weights.bin", low_cpu_mem_usage=True
)
prompt = "sandor clegane drinking in a pub"
image = pipe(prompt=prompt, num_inference_steps=30).images[0]
image.save("sandor.png") I am on |
See: huggingface/diffusers#9510 (comment) Right now, the low_cpu_mem_usage=True option does not consolidate the devices. E.g. when the model is on GPU and the state_dict on CPU, the adapter weight will be on CPU after loading, when it should be GPU. This fix ensures that the devices are consolidated.
See: huggingface/diffusers#9510 (comment) Right now, the low_cpu_mem_usage=True option does not consolidate the devices. E.g. when the model is on GPU and the state_dict on CPU, the adapter weight will be on CPU after loading, when it should be GPU. This fix ensures that the devices are consolidated.
Okay, got it now, thanks for the memory-friendly reproducer. Indeed, if the LoRA weights on the model are on meta device, the device will be taken from the For the time being, you could add this snippet and it should fix the issue: if low_cpu_mem_usage:
for module in model.modules():
if hasattr(module, "_move_adapter_to_device_of_base_layer"):
module._move_adapter_to_device_of_base_layer(adapter_name) |
See: huggingface/diffusers#9510 (comment) Right now, the low_cpu_mem_usage=True option does not consolidate the devices. E.g. when the model is on GPU and the state_dict on CPU, the adapter weight will be on CPU after loading, when it should be GPU. This fix ensures that the devices are consolidated.
PEFT added support for low_cpu_mem_usage=True when loading adapters in huggingface/peft#1961. This feature is now available when installing PEFT v0.13.0. With this PR, this option is also supported when loading PEFT adapters directly into transformers models. Additionally, with this PR, huggingface/diffusers#9510 will be unblocked, which implements this option in diffusers.
…3725) * [PEFT] Support low_cpu_mem_usage for PEFT loading PEFT added support for low_cpu_mem_usage=True when loading adapters in huggingface/peft#1961. This feature is now available when installing PEFT v0.13.0. With this PR, this option is also supported when loading PEFT adapters directly into transformers models. Additionally, with this PR, huggingface/diffusers#9510 will be unblocked, which implements this option in diffusers. * Fix typo
@BenjaminBossan could you give this a review? |
@BenjaminBossan thanks!
Yeah will be resolved after Yiyi's approval.
That's done.
For now, we can ignore it.
Done. |
@yiyixuxu could you review this PR? After the approval, will add docs and request for a review from Steven. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! I left one comment, otherwise PR looks good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Just one minor change that needs to be propagated :)
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
…9510) * allow loras to be loaded with low_cpu_mem_usage. * add flux support but note https://github.com/huggingface/diffusers/pull/9510\#issuecomment-2378316687 * low_cpu_mem_usage. * fix-copies * fix-copies again * tests * _LOW_CPU_MEM_USAGE_DEFAULT_LORA * _peft_version default. * version checks. * version check. * version check. * version check. * require peft 0.13.1. * explicitly specify low_cpu_mem_usage=False. * docs. * transformers version 4.45.2. * update * fix * empty * better name initialize_dummy_state_dict. * doc todos. * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * style * fix-copies --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
…ggingface#33725) * [PEFT] Support low_cpu_mem_usage for PEFT loading PEFT added support for low_cpu_mem_usage=True when loading adapters in huggingface/peft#1961. This feature is now available when installing PEFT v0.13.0. With this PR, this option is also supported when loading PEFT adapters directly into transformers models. Additionally, with this PR, huggingface/diffusers#9510 will be unblocked, which implements this option in diffusers. * Fix typo
See: huggingface/diffusers#9510 (comment) Right now, the low_cpu_mem_usage=True option does not consolidate the devices. E.g. when the model is on GPU and the state_dict on CPU, the adapter weight will be on CPU after loading, when it should be GPU. This fix ensures that the devices are consolidated.
What does this PR do?
huggingface/peft#1961 added the ability to set
low_cpu_mem_usage
while loading LoRAs. This can be quite helpful in speeding up the loading of LoRAs that are large and have many layers.#8953 is a good example where this feature could be beneficial.
Benchmarking code
Results:
The feature currently needs the user to install
peft
andtransformers
from the source. So, I suggest we wait until both the libraries have made stable releases to merge this PR.Once Ben reviews the PR, will request for a review from Yiyi.