Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow empty initialization of adapter weight #1961

Conversation

BenjaminBossan
Copy link
Member

@BenjaminBossan BenjaminBossan commented Jul 26, 2024

This PR allows to initialize the adapter weights as empty, i.e. on meta device.

Description

Why would this be useful? For PEFT training, it is indeed not useful, as we need the real weights in order to train the model. However, when loading a trained PEFT adapter, it is unnecessary to initialize the adapters for real, as we override them with the loaded weights later.

In the grand scheme of things, loading the base model will typically be much slower, but if the user loads, say, dozens of adapters, the overhead could add up. Of course, besides loading the model, this has no performance impact and is thus not a high priority feature.

Progress

  • unit tests
  • for all tests to pass, FIX: Bug that prevents BOFT from loading multiple adapters #2068 needs to be merged first
  • documentation
  • make it work on CPU
  • the argument can be passed to all relevant methods
  • the original diffusers issue is addressed
  • adapters other than LoRA have been changed (should be trivial for LoKr etc. but probably not for prefix-tuning, which could be done in a separate PR)
  • adjust PeftMixedModel
  • Renaming: It's now called low_cpu_mem_usage (for context, first name was init_empty).

Diffusers issue

I ran the script from the original diffusers issue to confirm that this helps. On my machine, loading the LoRA adapter takes 2 sec. When I change the default to low_cpu_mem_usage=True, it goes down to 0.7 sec, which is in line with the previous diffusers results. Therefore, this PR indeed allows to get back to the old speed. What is missing is a PR on the diffusers side is:

  • Update all calls to inject_adapter_in_model to use low_cpu_mem_usage=True.
  • Update all calls to set_peft_model_state_dictto also use low_cpu_mem_usage=True
  • Only do this when the PEFT version is >0.12.0 (i.e. next release version).

This PR allows to initialize the adpater weights as empty, i.e. on meta
device. It is still work in progress.

Why would this be useful? For PEFT training, it is indeed not useful, as
we need the real weights in order to train the model. However, when
loading a trained PEFT adapter, it is unnecessary to initialize the
adapters for real, as we override them with the loaded weights later.

In the grand scheme of things, loading the base model will typically be
much slower, but if the user loads, say, dozens of adapters, the
overhead could add up. Of course, besides loading the model, this has no
performance impact and is thus not a high priority feature.

A lot is still missing from this PR:

- unit tests
- documentation
- can the argument be passed to all relevant methods yet?
- adapters other than LoRA also need to be changed
@BenjaminBossan
Copy link
Member Author

BenjaminBossan commented Jul 26, 2024

Here are the results from a first test (warm cache, M2 SSD):

import gc
import os
import time
import torch
from transformers import AutoModelForCausalLM
from peft import LoraConfig, PeftModel, get_peft_model

model_id = "meta-llama/Meta-Llama-3-8B"
device = "cuda"  # cpu fails
dtype = torch.bfloat16
torch.manual_seed(0)
inputs = torch.arange(10).view(-1, 1).to(device)
path = f"/tmp/peft/test-empty-weights/{model_id.replace('/', '--')}"

# first save a peft adapter
if not os.path.exists(f"{path}/adapter_model.safetensors"):
    print("adapter does not exist yet, creating it")
    model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
    # choose high rank to make the effect stronger
    config = LoraConfig(r=64, init_lora_weights=False, target_modules="all-linear")
    model = get_peft_model(model, config).eval()
    with torch.inference_mode():
        outputs_before = model(inputs)
    model.save_pretrained(path)
    # save outputs
    torch.save(outputs_before, f"{path}/outputs_before.pt")

    del model
    torch.cuda.empty_cache()
    gc.collect()
else:
    print("adapter exists, loading")
    outputs_before = torch.load(f"{path}/outputs_before.pt")

# load without init_empty:
tic = time.perf_counter()
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
toc = time.perf_counter()

print(f"loading the base model: {toc - tic:.2f}s")

tic = time.perf_counter()
model = PeftModel.from_pretrained(model, path, init_empty=False)
toc = time.perf_counter()
with torch.inference_mode():
    outputs_not_empty = model(inputs)
assert torch.allclose(outputs_before.logits, outputs_not_empty.logits)

print(f"loading LoRA without init_empty: {toc - tic:.2f}s")
del model
torch.cuda.empty_cache()
gc.collect()

# load with init_empty:
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
tic = time.perf_counter()
model = PeftModel.from_pretrained(model, path, init_empty=True)
toc = time.perf_counter()
with torch.inference_mode():
    outputs_empty = model(inputs)
assert torch.allclose(outputs_before.logits, outputs_empty.logits)

print(f"loading LoRA with init_empty: {toc - tic:.2f}s")
del model
torch.cuda.empty_cache()
gc.collect()
loading the base model: 3.17s
loading LoRA without init_empty: 1.21s
loading LoRA with init_empty: 0.23s

@charchit7
Copy link

Hey @BenjaminBossan, to work on this PR, do we need Mac to test? I could see other PRs mentioning speed-up for Mac M2. I am interested in completing it by this weekend if a Mac specifically is not required.

@BenjaminBossan
Copy link
Member Author

I am interested in completing it by this weekend

I'd be happy if you could pick this up, thanks. Just be aware that this is going to require a bit of deeper PEFT knowledge to complete.

to work on this PR, do we need Mac to test? I could see other PRs mentioning speed-up for Mac M2

No, Mac is not needed. M2 was referring to M.2 SSDs, I should have been more precise :)

@charchit7
Copy link

I'd be happy if you could pick this up, thanks. Just be aware that this is going to require a bit of deeper PEFT knowledge to complete.

-- Sure, @BenjaminBossan, could you please tell me, if I had to start on this, how I should approach it so that I can acquire deeper PEFT knowledge? Happy to read/work on it.

@BenjaminBossan
Copy link
Member Author

could you please tell me, if I had to start on this, how I should approach it so that I can acquire deeper PEFT knowledge? Happy to read/work on it.

It's hard to tell how to best get started. The code base is big but that doesn't mean you need to know it in its entirety to get up to speed (even I don't know everything). Learning by doing is probably going to work best, but I think it can be easy to get stuck, so you may want to start with a smaller PR if you notice that.

For this PR, I have listed the TODOs at the top. It would probably be best to start with a unit test to ensure that the feature is working as intended and later changes don't break it. The unit tests don't need to be all comprehensive, that can be added later in the PR.

For reference, this is the script that I used to measure the times. It could be adapted for a unit test:

import gc
import os
import time
import torch
from transformers import AutoModelForCausalLM
from peft import LoraConfig, PeftModel, get_peft_model

model_id = "meta-llama/Meta-Llama-3-8B"
device = "cuda"  # cpu fails
dtype = torch.bfloat16
torch.manual_seed(0)
inputs = torch.arange(10).view(-1, 1).to(device)
path = f"/tmp/peft/test-empty-weights/{model_id.replace('/', '--')}"

# first save a peft adapter
if not os.path.exists(f"{path}/adapter_model.safetensors"):
    print("adapter does not exist yet, creating it")
    model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
    # choose high rank to make the effect stronger
    config = LoraConfig(r=64, init_lora_weights=False, target_modules="all-linear")
    model = get_peft_model(model, config).eval()
    with torch.inference_mode():
        outputs_before = model(inputs)
    model.save_pretrained(path)
    # save outputs
    torch.save(outputs_before, f"{path}/outputs_before.pt")

    del model
    torch.cuda.empty_cache()
    gc.collect()
else:
    print("adapter exists, loading")
    outputs_before = torch.load(f"{path}/outputs_before.pt")

# load without init_empty:
tic = time.perf_counter()
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
toc = time.perf_counter()

print(f"loading the base model: {toc - tic:.2f}s")

tic = time.perf_counter()
model = PeftModel.from_pretrained(model, path, init_empty=False)
toc = time.perf_counter()
with torch.inference_mode():
    outputs_not_empty = model(inputs)
assert torch.allclose(outputs_before.logits, outputs_not_empty.logits)

print(f"loading LoRA without init_empty: {toc - tic:.2f}s")
del model
torch.cuda.empty_cache()
gc.collect()

# load with init_empty:
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
tic = time.perf_counter()
model = PeftModel.from_pretrained(model, path, init_empty=True)
toc = time.perf_counter()
with torch.inference_mode():
    outputs_empty = model(inputs)
assert torch.allclose(outputs_before.logits, outputs_empty.logits)

print(f"loading LoRA with init_empty: {toc - tic:.2f}s")
del model
torch.cuda.empty_cache()
gc.collect()

Of course, the base model needs to be much smaller than Llama 8B and I'm not sure how flaky the measured times will be.

@charchit7
Copy link

Thank you for the detailed response. I'll look into it.

@BenjaminBossan
Copy link
Member Author

@charchit7 Do you still plan on working on this?

@charchit7
Copy link

Hey @BenjaminBossan, sorry, I was held up with work. Would you mind if I took some more time on this?
Do you have any deadline by which you would want this to close?

@BenjaminBossan
Copy link
Member Author

Would you mind if I took some more time on this?
Do you have any deadline by which you would want this to close?

No, I don't mind, just let me know if you don't have the time to work on this. The reason why I asked is because there might be someone else who might be able to work on this, but I want to avoid duplicate work.

@charchit7
Copy link

charchit7 commented Aug 7, 2024

Sure @BenjaminBossan, I'll start on this today. If again stuck somewhere will let you know for sure. Thank you for being patient with me.

@BenjaminBossan
Copy link
Member Author

Thanks a lot. Again, don't feel pressured, there is no deadline. I was just asking as we had contributors in the past who wanted to work on something but then ghosted -- which is fine, just let us know so that the work is not blocked.

@charchit7
Copy link

charchit7 commented Aug 8, 2024

Hey @BenjaminBossan I have started working on the problem, first thing I wanted to try was benchmark timings by running the runs which you posted in huggingface/diffusers#8953 (comment)

import time

import torch
from diffusers import StableDiffusionXLPipeline

try:
    import peft
    print("peft is installed")
    print(peft.__version__)
    print(peft.__path__)
except ImportError:
    print("peft is not installed")

device = 'cuda'
model_path = "stabilityai/stable-diffusion-xl-base-1.0"

pipe = StableDiffusionXLPipeline.from_pretrained(
    model_path, torch_dtype=torch.float16, variant="fp16", use_safetensors=True,
).to('cuda')

t1 = time.time()
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl", weight_name="pytorch_lora_weights.safetensors")
print(f"load lcm lora weights time: {time.time()- t1:.2f}s")

Memory consumption for the lora + model
image

First Test :

My configuration:

  • A10G 24gb vram.
  • peft : 0.12.0
  • Diffusers : '0.30.0.dev0'
  • Average time in seconds for 5 runs ~4.811

Second Test :

My configuration:

  • A10G 24gb vram
  • peft : 0.12.0
  • Diffusers : 0.30.0 ( not the dev version)
  • Average time in seconds for 5 runs ~ 4.2075

Third Test ( for this I installed peft using your branch (allow-empty-init-of-adapter-weights)

My configuration :

  • A10G 24GB vram
  • Diffusers : 0.30.0
  • Peft 0.12.0
  • Average time in seconds for 5 runs ~ 5.24s

Please let me know if you want me to test any similar runs. Next, I will try the above code you gave (#1961 (comment)). I have your branch installed, so I am able to see the init_empty: 'bool' = False argument you updated.

@BenjaminBossan
Copy link
Member Author

Thanks for running some tests. Regarding the third test, it won't have any effect right now, even with this branch, because diffusers would also need to be changed. Basically, we'd have to ensure that diffusers passes init_empty=True. Alternatively, you could edit your local PEFT code to make this the default just for testing purposes.

With the code from my comment above, you should, however, see the speedup when using this branch.

@charchit7
Copy link

Ahh, yeah, I missed it. Thanks for pointing that out.

@charchit7
Copy link

I ran the above code using your branch with the same system config as the third test.
These are the results. Do you think 4.35s is decent?

adapter does not exist yet, creating it
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [01:56<00:00, 29.15s/it]
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Loading checkpoint shards: 100%|██████████| 4/4 [01:53<00:00, 28.38s/it]
loading the base model: 113.76s
loading LoRA without init_empty: 5.64s
Loading checkpoint shards: 100%|██████████| 4/4 [01:37<00:00, 24.28s/it]
loading LoRA with init_empty: 4.35s

Also, In modelling_utils.py we have,

if is_torch_version(">=", "1.9.0"):
    _LOW_CPU_MEM_USAGE_DEFAULT = True
else:
    _LOW_CPU_MEM_USAGE_DEFAULT = False

Shall we keep default False for both the case?

@BenjaminBossan
Copy link
Member Author

These are the results. Do you think 4.35s is decent?

4.35 vs 5.64 isn't really a big difference. My guess is that this is i/o bound on your machine and the saved compute is only ~1sec, similar to what I found. You could try running this a few times to see if that changes anything. Regardless of that, we can still proceed with the PR though.

Shall we keep default False for both the case?

Are you referring to diffusers? Let's keep that part untouched for now. On the PEFT side, let's also keep the default to False. Maybe there is no downside to switching the default to True for load_adapter, but I'd have to think it through fully before making the switch. Backwards compatibility is very important.

@charchit7
Copy link

4.35 vs 5.64 isn't really a big difference. My guess is that this is i/o bound on your machine, and the saved compute is only ~1sec, similar to what I found.

Yeah, that makes sense.

You could try running this a few times to see if that changes anything.

Regarding this: I was running on a notebook, if I keep running that cell, it speeds up but adds up the memory for each run +7gb.

I'll start with the unit-test. PR.

@charchit7
Copy link

Just a reminder, @BenjaminBossan, that I will work on this. Sorry for the delay.

@BenjaminBossan
Copy link
Member Author

Feel free to ask any questions that may come up @charchit7.

@nachoal
Copy link

nachoal commented Sep 5, 2024

Hey @BenjaminBossan any tips on speeding up loading a flux lora weight on an API with: pipe.load_lora_weights(extracted_folder, weight_name="flux_train_latentgen.safetensors", adapter_name="main_lora") ?

I'm seeing avg load times of 40 seconds just on this step per api call run ( each user has a different lora), is there a performance opt that you recommend to speed this up?

Thank you.

@BenjaminBossan
Copy link
Member Author

@nachoal It's hard to say with the given information. Could you please create a separate issue and provide as much information as possible so that I can try to reproduce the error?

@nachoal
Copy link

nachoal commented Sep 8, 2024

@nachoal It's hard to say with the given information. Could you please create a separate issue and provide as much information as possible so that I can try to reproduce the error?

Thanks @BenjaminBossan, I just created an issue with more information here: #2055

@charchit7
Copy link

Hey, @BenjaminBossan I am not able to get time to work on this. Sadly getting caught up in multiple threads, gonna keep this open for anyone else if they want to continue this thread else will complete this as I get time. Again, apologies for keeping this open for long.

@BenjaminBossan
Copy link
Member Author

Thanks for letting me know @charchit7. I'll pick this PR back up then.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@BenjaminBossan BenjaminBossan changed the title [WIP] Allow empty initialization of adapter weight Allow empty initialization of adapter weight Sep 18, 2024
@BenjaminBossan BenjaminBossan changed the title Allow empty initialization of adapter weight ENH: Allow empty initialization of adapter weight Sep 18, 2024
@BenjaminBossan BenjaminBossan marked this pull request as ready for review September 18, 2024 13:15
@BenjaminBossan
Copy link
Member Author

@sayakpaul I hope this is not too much to ask for a review. With this PR in PEFT, we can restore the initial speed of loading LoRAS in diffusers as described in the initial comment.

When it comes to the review, a lot of files have been touched but the majority consists just of the same small change to each PEFT method to account for the new argument. The core logic of this PR is actually in just these 3 lines:

https://github.com/huggingface/peft/pull/1961/files#diff-95787fcc5356568929e383449ade2c1bac5da43deccab17318652b62ed171ae7R162-R164

The rest is just passing the argument along, documentation, and tests.

Currently, this feature is strictly opt-in. However, for the use case of loading adapters, I think it should be safe to make this the default (i.e. switch to opt-out). My current plan is to leave this feature as opt-in for the upcoming release with an announcement that the default will be switched and calling for users to test it. Then, for the next release after that, I would toggle the default unless users report errors.

@sayakpaul
Copy link
Member

@BenjaminBossan of course not! I will review it tomorrow, first thing my time.

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for shipping this feature! I have mostly left some minor comments.

My question is do we want to call it low_cpu_mem_usage instead of init_empty? The former is used throughout transformers and diffusers.


assert model.linear.lora_A["default"].weight.device.type == "meta"
set_peft_model_state_dict(model, peft_state_dict, init_empty=True)
model.linear.lora_A["default"].weight.device.type == "cpu"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have the asserts in the example codebase? No strong opinions, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's have them within print()s and comment the expected values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added # should be True, not sure if that's what you meant.


### Loading adapter weights is slow

Loading adapters like LoRA weights should generally be fast compared to loading the base model. However, there can be use cases where the adapter weights are quite large or where users need to load a large number of adapters -- the loading time can add up in this case. To speed up the loading time, you can pass the `init_empty=True` argument to [`~PeftModel.from_pretrained`] and [`~PeftModel.load_adapter`].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make sense to add a note on why there's an overhead on loading adapter layers (because they are first initialized and then they are populated with the pre-trained params).

@@ -210,6 +210,8 @@ def inject_adapter_in_model(
The input model where the adapter will be injected.
adapter_name (`str`, `optional`, defaults to `"default"`):
The name of the adapter to be injected, if not provided, the default adapter name is used ("default").
init_empty (`bool`, `optional``, defaults to `False`):
Create empty adapter weights on meta device. Useful to speed up the process.
Copy link
Member

@sayakpaul sayakpaul Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Create empty adapter weights on meta device. Useful to speed up the process.
Create empty adapter weights on meta device. Useful to speed up the loading process.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no change in the suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my bad. See now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, I fixed all instances.

src/peft/mixed_model.py Show resolved Hide resolved
src/peft/peft_model.py Outdated Show resolved Hide resolved
@@ -801,6 +812,9 @@ def _move_adapter_to_device_of_base_layer(self, adapter_name: str, device: Optio
continue
if adapter_name not in adapter_layer:
continue
if any(p.device == meta for p in adapter_layer.parameters()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, do we have to handle the "else" part of this branch?

Comment on lines 457 to 458
if init_empty:
load_result = model.load_state_dict(peft_model_state_dict, strict=False, assign=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own knowledge:

Can you walk me through a scenario when set_peft_model_state_dict() would be required to call with init_empty=True?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You replied yourself below :)

# module_name, _, param_name = new_key.rpartition(".")
# new_key = f"{module_name}.default.{param_name}"
remapped_dict[new_key] = val.to(device)
errors = set_peft_model_state_dict(model, remapped_dict, init_empty=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay. So, set_peft_model_state_dict() has to be called with init_empty=True if the injection (inject_adapter_in_model()) was made with init_empty=True. I think that should be okay.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. I guess we could try to automatically detect it in set_peft_model_state_dict to remove the argument, but I think since it's a very low level API, it's okay to be explicit here (and maybe there is some edge case where users don't want it).

Comment on lines 1448 to 1449
# module_name, _, param_name = new_key.rpartition(".")
# new_key = f"{module_name}.default.{param_name}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go away?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, done.


# also test injecting directly
del model
model = self.transformers_class.from_pretrained(model_id).to(self.torch_device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to do this for a diffusers model class as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would require a big rewrite of the test without really improving the test coverage. Unless you think there is a specific reason why this would not work on diffusion models, I'd rather not do that. Since we plan to support this feature on the diffusers side, which will surely include tests, I think this will be covered there? WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we would expose another argument for low_cpu_mem_usage for loading LoRAs. Would prefer to keep the peft level atomic feature tests within peft, itself.

If the test suite needs a big rewrite, we could just create a Tester class with a specific model class from diffusers and write a simple test case like this one. Will that work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added dedicated stable diffusion tests. These tests should more or less reflect how the code in diffusers would look like so as to use this new feature.

fbace51

Most notably, renaming the argument.
Copy link
Member Author

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the helpful comments, I think they should be addressed now.

When it comes to the new name, I agree that low_cpu_mem_usage corresponds mostly to what is implemented here. Personally, I find the name a bit confusing, but for consistency I agree it's the correct choice.


assert model.linear.lora_A["default"].weight.device.type == "meta"
set_peft_model_state_dict(model, peft_state_dict, init_empty=True)
model.linear.lora_A["default"].weight.device.type == "cpu"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@@ -210,6 +210,8 @@ def inject_adapter_in_model(
The input model where the adapter will be injected.
adapter_name (`str`, `optional`, defaults to `"default"`):
The name of the adapter to be injected, if not provided, the default adapter name is used ("default").
init_empty (`bool`, `optional``, defaults to `False`):
Create empty adapter weights on meta device. Useful to speed up the process.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no change in the suggestion.

src/peft/mixed_model.py Show resolved Hide resolved
src/peft/peft_model.py Outdated Show resolved Hide resolved
@@ -1125,14 +1149,18 @@ def load_adapter(
raise ValueError("Cannot set a prompt learning adapter to trainable when loading pretrained adapter.")
else:
peft_config.inference_mode = not is_trainable
self.add_adapter(adapter_name, peft_config)
self.add_adapter(adapter_name, peft_config, init_empty=init_empty)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ad hoc, I can't think of a use case, but there might be. I think the description in the docstring of add_adapter, "Don't use this option when creating a new PEFT adapter for training", is sufficient here, as users would have to opt-in to use the option.

src/peft/tuners/boft/model.py Show resolved Hide resolved
# module_name, _, param_name = new_key.rpartition(".")
# new_key = f"{module_name}.default.{param_name}"
remapped_dict[new_key] = val.to(device)
errors = set_peft_model_state_dict(model, remapped_dict, init_empty=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. I guess we could try to automatically detect it in set_peft_model_state_dict to remove the argument, but I think since it's a very low level API, it's okay to be explicit here (and maybe there is some edge case where users don't want it).

Comment on lines 457 to 458
if init_empty:
load_result = model.load_state_dict(peft_model_state_dict, strict=False, assign=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You replied yourself below :)

Comment on lines 1448 to 1449
# module_name, _, param_name = new_key.rpartition(".")
# new_key = f"{module_name}.default.{param_name}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, done.


# also test injecting directly
del model
model = self.transformers_class.from_pretrained(model_id).to(self.torch_device)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would require a big rewrite of the test without really improving the test coverage. Unless you think there is a specific reason why this would not work on diffusion models, I'd rather not do that. Since we plan to support this feature on the diffusers side, which will surely include tests, I think this will be covered there? WDYT?

Copy link
Member Author

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional feedback, I think your concerns have been addressed.


# also test injecting directly
del model
model = self.transformers_class.from_pretrained(model_id).to(self.torch_device)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added dedicated stable diffusion tests. These tests should more or less reflect how the code in diffusers would look like so as to use this new feature.

fbace51

@@ -210,6 +210,8 @@ def inject_adapter_in_model(
The input model where the adapter will be injected.
adapter_name (`str`, `optional`, defaults to `"default"`):
The name of the adapter to be injected, if not provided, the default adapter name is used ("default").
init_empty (`bool`, `optional``, defaults to `False`):
Create empty adapter weights on meta device. Useful to speed up the process.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, I fixed all instances.


assert model.linear.lora_A["default"].weight.device.type == "meta"
set_peft_model_state_dict(model, peft_state_dict, init_empty=True)
model.linear.lora_A["default"].weight.device.type == "cpu"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added # should be True, not sure if that's what you meant.

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for shipping this. LGTM!

docs/source/developer_guides/low_level_api.md Outdated Show resolved Hide resolved
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
@charchit7
Copy link

Thanks again, @BenjaminBossan, sorry for holding this up for long time.

@BenjaminBossan BenjaminBossan merged commit af275d2 into huggingface:main Sep 23, 2024
14 checks passed
@BenjaminBossan BenjaminBossan deleted the allow-empty-init-of-adapter-weights branch September 23, 2024 09:13
yaswanth19 pushed a commit to yaswanth19/peft that referenced this pull request Sep 23, 2024
This PR allows to initialize the adpater weights as empty, i.e. on meta
device, by passing low_cpu_mem_usage=True.

Why would this be useful? For PEFT training, it is indeed not useful, as
we need the real weights in order to train the model. However, when
loading a trained PEFT adapter, it is unnecessary to initialize the
adapters for real, as we override them with the loaded weights later.

In the grand scheme of things, loading the base model will typically be
much slower, but if the user loads, say, dozens of adapters, the
overhead could add up. Of course, besides loading the model, this has no
performance impact and is thus not a high priority feature.

For the time being, this is completely opt in. However, it should be safe to
make this default for loading adapters. Therefore, in the future we may change
the default there.

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
BenjaminBossan added a commit to BenjaminBossan/transformers that referenced this pull request Sep 26, 2024
PEFT added support for low_cpu_mem_usage=True when loading adapters in
huggingface/peft#1961. This feature is now
available when installing PEFT v0.13.0. With this PR, this option is
also supported when loading PEFT adapters directly into transformers
models.

Additionally, with this PR,
huggingface/diffusers#9510 will be unblocked,
which implements this option in diffusers.
ArthurZucker pushed a commit to BenjaminBossan/transformers that referenced this pull request Oct 3, 2024
PEFT added support for low_cpu_mem_usage=True when loading adapters in
huggingface/peft#1961. This feature is now
available when installing PEFT v0.13.0. With this PR, this option is
also supported when loading PEFT adapters directly into transformers
models.

Additionally, with this PR,
huggingface/diffusers#9510 will be unblocked,
which implements this option in diffusers.
ArthurZucker pushed a commit to huggingface/transformers that referenced this pull request Oct 3, 2024
…3725)

* [PEFT] Support low_cpu_mem_usage for PEFT loading

PEFT added support for low_cpu_mem_usage=True when loading adapters in
huggingface/peft#1961. This feature is now
available when installing PEFT v0.13.0. With this PR, this option is
also supported when loading PEFT adapters directly into transformers
models.

Additionally, with this PR,
huggingface/diffusers#9510 will be unblocked,
which implements this option in diffusers.

* Fix typo
NielsRogge pushed a commit to NielsRogge/transformers that referenced this pull request Oct 21, 2024
…ggingface#33725)

* [PEFT] Support low_cpu_mem_usage for PEFT loading

PEFT added support for low_cpu_mem_usage=True when loading adapters in
huggingface/peft#1961. This feature is now
available when installing PEFT v0.13.0. With this PR, this option is
also supported when loading PEFT adapters directly into transformers
models.

Additionally, with this PR,
huggingface/diffusers#9510 will be unblocked,
which implements this option in diffusers.

* Fix typo
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
…ggingface#33725)

* [PEFT] Support low_cpu_mem_usage for PEFT loading

PEFT added support for low_cpu_mem_usage=True when loading adapters in
huggingface/peft#1961. This feature is now
available when installing PEFT v0.13.0. With this PR, this option is
also supported when loading PEFT adapters directly into transformers
models.

Additionally, with this PR,
huggingface/diffusers#9510 will be unblocked,
which implements this option in diffusers.

* Fix typo
BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024
…ggingface#33725)

* [PEFT] Support low_cpu_mem_usage for PEFT loading

PEFT added support for low_cpu_mem_usage=True when loading adapters in
huggingface/peft#1961. This feature is now
available when installing PEFT v0.13.0. With this PR, this option is
also supported when loading PEFT adapters directly into transformers
models.

Additionally, with this PR,
huggingface/diffusers#9510 will be unblocked,
which implements this option in diffusers.

* Fix typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants