-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PiSSA, OLoRA: Delete initial adapter after conversion instead of the active adapter #1933
PiSSA, OLoRA: Delete initial adapter after conversion instead of the active adapter #1933
Conversation
Resolves huggingface#1860 As discussed in that issue, it's not user friendly to delete the default adapter of a PiSSA/OLoRA model after calling save_pretrained with weight conversion. Instead, it is much more intuitive to delete the initial adapter instead, since it is loaded inside the method and not by the user, so it's really an implementation detail. Apart from this, I made the following related changes: - Put everything in a try ... finally to ensure that the initial adapter does not hang around if there is an error (thus not hogging memory). - Renamed initial_adapter to initial_adapter_name, to make it clear that this is the name and not the adapter itself.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Again tagging @fxmeng and @tokenizer-decode as this affects the code they contributed. |
Makes so much more sense. Just one think that I did not realize before, this changes peak GPU mem usage right? Or does |
Yes, I had the same concerns with regard to memory. That's why I decided to delete the init adapter and not leave it hanging around. Of course, this still doesn't change that we'll have a bit of a peak for the time when both adapters are loaded at the same time. However, I think in the grand scheme of things it shouldn't matter much. The peak memory should be affected much more during training, when it needs to account for hidden states etc. Unless I'm missing something, if there is enough memory to train the model, there should be enough to temporarily load a second LoRA adapter. We can still think about ways of preventing this but it's not really related to this PR, so I'd do that in a separate one.
It would be great to have a reproducer for this. |
I really like to fit as much data as I possible can to my RAM in a given time. Assume I'm training at 39.5 GB of my 40 GB HB RAM. If the trainer is freeing the memory as soon as training is done, that would not be an issue and we can load the new adapter. But if not that could potentially OOM. It depends on the trainer in this case.
Sure. Would you like me to open an issue here or in transformers repo? |
It could be a transformers issue or also an accelerate issue (or even PyTorch), depending on what exactly is happening, so I can't say for sure. But it sounds like something that needs addressing. |
src/peft/peft_model.py
Outdated
adapter_name=initial_adapter_name, | ||
) | ||
if any( | ||
str(self.peft_config[initial_adapter_name].init_lora_weights).lower().startswith(prefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's assign str(self.peft_config[initial_adapter_name].init_lora_weights).lower().startswith(prefix)
in a variable and use it. Easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
raise ValueError( | ||
"The `init_lora_weights` parameter of the initial adapter should be set to `True`. " | ||
"Otherwise, `self.load_adapter` will subtract the decomposed values again based on the residual model." | ||
initial_adapter_name = os.path.basename(path_initial_model_for_weight_conversion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for sanity: it's 100 percent okay to assume that it's not against the user to always delete the initial adapter and not letting them choose it instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is pretty much no reason why users would want to load the initial model and delete the trained model. I pondered keeping both, but users could be surprised by the extra memory required, resulting for instance in evals that are run after training and saving the model to OOM.
The new behavior is documented in the release notes: https://github.com/huggingface/peft/releases/tag/untagged-f3a302cd5b4f9d08a46b
(Release planned later today)
peft_config_keys_before = list(peft_model.peft_config.keys()) | ||
peft_model.save_pretrained( | ||
tmp_path / "pissa-model-converted", path_initial_model_for_weight_conversion=tmp_path / "init-model" | ||
) | ||
peft_config_keys_after = list(peft_model.peft_config.keys()) | ||
assert peft_config_keys_before == peft_config_keys_after | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding. Why would this hold given we're deleting an adapter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do the weight conversion, we have to temporarily load the initial adapter (i.e. from before it was trained). Before this PR, we would keep that initial adapter around but delete the actually trained adapter. This is surprising for users, who would rather keep the trained adapter around, e.g. to perform evals (see the linked issue #1860). With this PR, instead we keep the trained adapter around and delete the initial adapter, which was only really needed for the conversion step.
As such, I added the assert that the loaded adapters are the same before and after saving, which is also true when saving any other kind of model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Left a comment and some clarification questions.
Merging despite Windows CI failing. As shown in #1947, Windows CI is failing for some unrelated reason. |
Resolves #1860
As discussed in that issue, it's not user friendly to delete the active adapter of a PiSSA/OLoRA model after calling
save_pretrained
with weight conversion. Instead, it is much more intuitive to delete the initial adapter instead, since it is loaded inside the method and not by the user, so it's really an implementation detail.Apart from this, I made the following related changes:
try ... finally
to ensure that the initial adapter does not hang around if there is an error (thus not hogging memory).initial_adapter
toinitial_adapter_name
, to make it clear that this is the name and not the adapter itself.For this reason, the diff looks much bigger than it actually is, most lines are actually identical.