-
Notifications
You must be signed in to change notification settings - Fork 2.2k
intorduce AdaDoRA implementation using both DoRA and AdaLoRa together #2969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Adds weight-decomposed adaptation to AdaLoRA, introducing a learnable magnitude vector and DoRA-scaled forward/merge paths. Propagates configuration to layers and model, updates rank allocation to skip magnitude params, and recalculates magnitudes after pruning/masking. Introduces a GLUE finetuning script that wires AdaLoRA/DoRA, computes total steps for scheduling, and updates rank allocation via a training callback. Improves flexibility by supporting DoRA in AdaLoRA, safer merging, and clearer training integration; unmerging for DoRA remains unsupported.
- Added AdaDoraLinearLayer for handling AdaLoRA's 3-factor decomposition. - Introduced AdaDoraLinearVariant for managing variant-specific operations. - Updated AdaLoraLayer to support DoRA integration and manage variants. - Enhanced SVDLinear to utilize the new variant pattern for merging and unmerging. - Implemented tests for AdaDoraLinearLayer and AdaDoraLinearVariant to ensure functionality. - Updated AdaLoraModel to handle magnitude updates after rank pruning. - Added uv.lock file for dependency management.
…tation after rank pruning
BenjaminBossan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening a new PR to add DoRA support to AdaLoRA. I did a first review pass which focused on the general design of the implementation, I haven't done a detailed check of the DoRA logic or the example yet, as it makes more sense to finalize the integration first. Please check my comments.
I also saw that there are unrelated changes in the PR, like changing capitalization of comments or adding the uv lock file. Could you please revert all those changes? Finally, please run make style to ensure the linter is happy.
Moreover, before the PR can be merged, the docs should be updated, but we can leave this for later.
| self.ipt[n] = torch.zeros_like(p) | ||
| self.exp_avg_ipt[n] = torch.zeros_like(p) | ||
| self.exp_avg_unc[n] = torch.zeros_like(p) | ||
| # skip update if no gradient (can happen at first step or for frozen params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something new that arose with AdaDoRA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! This isn't strictly new to AdaDoRA but it rose while trainning on various tasks. it's actually a defensive fix I added while working on the integration.
The issue can occur in vanilla AdaLoRA too: at the very first training step, gradients haven't been computed yet, so p.grad is None. Calling
(p * p.grad).abs()
would then crash.
That said, I noticed it specifically while testing AdaDoRA because the magnitude vectors added another edge case to watch for. I figured a simple None check here is cheap insurance and makes the code more robust overall.
Happy to add a more detailed comment if you think it would help clarify!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the details. Yes, if you could provide a small reproducer for this error, ideally with normal AdaLoRA, that would be great.
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
- Merge lora_magnitude_vector check with main conditional (with parentheses) - Remove internal exports (AdaDoraLinearLayer, AdaDoraLinearVariant) - Remove obsolete test_adalora_use_dora_raises test - Move run_glue.py to examples/adalora_adadora/ with README
The method was a no-op (just pass) since resetting magnitude after pruning disabled DoRA by making the ratio m/||W+ΔW|| = 1. Removed the dead code for cleanliness.
Per reviewer feedback: variant resolution now happens inside the base class after _move_adapter_to_device_of_base_layer, ensuring correct ordering. SVDLinear just overrides resolve_lora_variant() to return the AdaDoRA variant.
Per reviewer feedback: consistent with LoRA DoRA implementation. Added _compute_lora_weight helper to compute B @ (A * E) * scaling / ranknum externally instead of inside get_weight_norm.
BenjaminBossan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. I did another review and still found a few areas for improvement, please check my comments.
| mag_norm_scale = (magnitude / weight_norm).view(1, -1) | ||
|
|
||
| # compute adapter output: x @ (A*E).T @ B.T * scaling / ranknum | ||
| lora_result = (x @ (lora_A * lora_E).T @ lora_B.T) * (scaling / (ranknum + 1e-5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I forgot about the nn.Parameter thing, in that case just disregard what I said. This may cause issues with FSDP but it's a bit of an edge case.
Something else: For calculating the lora_result, can we not re-use the lora_weight that was already computed above?
| "lora_E", | ||
| "lora_embedding_A", | ||
| "lora_embedding_B", | ||
| "lora_magnitude_vector", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove lora_magnitude_vector here, since normal AdaLoRA doesn't have this. Instead, the DoRA variant should add it during its init. See how it's done in LoRA:
peft/src/peft/tuners/lora/variants.py
Lines 135 to 138 in 261366d
| def init(module: Linear, adapter_name: str, **kwargs: Any) -> None: | |
| if not module.lora_magnitude_vector: | |
| # first dora layer being added, add lora_magnitude_vector to the list of learnable parameters | |
| module.adapter_layer_names = module.adapter_layer_names[:] + ("lora_magnitude_vector",) |
| self.lora_dropout[adapter_name] = lora_dropout_layer | ||
| # Actual trainable parameters | ||
| # Right singular vectors | ||
| # actual trainable parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please undo these kinds of changes. There are more of those below.
| self._move_adapter_to_device_of_base_layer(adapter_name) | ||
|
|
||
| # initialize variant if needed (e.g., DoRA) - must happen after device move | ||
| if use_dora: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check can be removed. if lora_variant is not None is enough.
| self.ipt[n] = torch.zeros_like(p) | ||
| self.exp_avg_ipt[n] = torch.zeros_like(p) | ||
| self.exp_avg_unc[n] = torch.zeros_like(p) | ||
| # skip update if no gradient (can happen at first step or for frozen params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the details. Yes, if you could provide a small reproducer for this error, ideally with normal AdaLoRA, that would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding tests, but let's use the existing test harness instead of introducing completely new tests. As a start, we can add AdaDoRA to the decoder model test matrix, check how it's done here:
peft/tests/test_decoder_models.py
Lines 88 to 94 in 261366d
| AdaLoraConfig, | |
| { | |
| "task_type": "CAUSAL_LM", | |
| "target_modules": None, | |
| "total_step": 1, | |
| }, | |
| ), |
After adding an entry for AdaDoRA, please ensure that the corresponding tests pass (something like pytest tests/test_decoder_models.py -k "adalora" should do the trick).
As for the tests in this file, I think they are all covered by existing tests and can thus be removed. You can take a look at the printed test coverage report to see if there are gaps in the tests when it comes to the newly added code.
| @@ -0,0 +1,34 @@ | |||
| # AdaDoRA (AdaLoRA + DoRA) Example | |||
|
|
|||
| This example demonstrates how to use **AdaDoRA** - a combination of AdaLoRA's adaptive rank allocation with DoRA's weight-decomposition approach for parameter-efficient fine-tuning. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's link the papers.
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
This pull request introduces support for AdaDoRA (AdaLoRA with DoRA) to the codebase, enabling the use of DoRA (Weight-Decomposed Low-Rank Adaptation) within the AdaLoRA framework. The main changes add new modules and configuration options to support this, update the initialization and management of adapter layers, and provide a new GLUE finetuning script that demonstrates AdaLoRA/DoRA usage.
Key changes include:
AdaDoRA (DoRA for AdaLoRA) Support
AdaDoraLinearLayerclass inadalora/dora.py, implementing DoRA's magnitude vector and forward logic for AdaLoRA's SVD decomposition, including proper handling of singular values and rank pruning.adalora/__init__.pyto exportAdaDoraLinearLayerandAdaDoraLinearVariantfor external use.adalora/config.pyto allowuse_dora(DoRA) in AdaLoRA by removing the previous error, enabling AdaDoRA configuration.Adapter Layer and Variant Management
AdaLoraLayerinadalora/layer.pyto:lora_magnitude_vectorinadapter_layer_namesfor DoRA support.lora_variantdictionary to manage per-adapter variants (e.g., AdaDoRA).use_doraflag and resolve the correct variant, including initialization of the AdaDoRA variant if requested.Utilities and Imports
LoraVariantintoadalora/layer.pyto support adapter variants.Example and Usage
scripts/run_glue.pythat demonstrates finetuning for GLUE tasks with AdaLoRA and AdaDoRA, including argument parsing for DoRA options, correct PEFT configuration, and integration of the new AdaLoRA/DoRA callback.These changes collectively enable AdaLoRA models to leverage DoRA-style adaptation,while using extending dora.py implementation in LoRA module, with full support in both the core library and example scripts.