-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Feature request
Paper: https://arxiv.org/abs/2502.01235 (ICML 2025 Oral Presentation)
Reference code: https://github.com/YuanheZ/LoRA-One
Content Overview
This paper explores how theory can guide and enhance practical algorithms, using Low-Rank Adaptation (LoRA) in large language models as a case study. We rigorously prove that, under gradient descent, LoRA adapters align with specific singular subspaces of the one-step full fine-tuning gradient. This result suggests that, by properly initializing the adapters using the one-step full gradient, subspace alignment can be achieved immediately—applicable to both linear and nonlinear models. Building on our theory, we propose a theory-driven algorithm, LoRA-One, where the linear convergence (as well as generalization) is built and incorporating preconditioners theoretically helps mitigate the effects of ill-conditioning. Besides, our theory reveals connections between LoRA-One and other gradient-alignment-based methods, helping to clarify misconceptions in the design of such algorithms. LoRA-One achieves significant empirical improvements over LoRA and its variants across benchmarks in natural language understanding, mathematical reasoning, and code generation.
Main Contributions
We theoretically prove:
- standard LoRA will align to the top-r subspace of first-step full gradient;
- LoRA can achieve fast linear convergence both in optimization and generalization if we initialize LoRA using best r-rank first-step full gradient.
Grounded by our theory, we establish the optimal initialization making use of gradient, clarifying the suboptimality of previous graident-based methods such as LoRA-GA, LoRA-SB. Our method is supported by performance improvement in a wide range of instruction, math, code benchmarks.
Algorithmic Overview
For each weight matrix, we first compute the gradient
which is equivalent to perform one best r-rank full gradient descent under full fine-tuning with learning rate
Experiments
Your contribution
The code implementation is similar to PiSSA and LoRA-GA. The core idea is to replace the random init LoRA adapters with matrices from SVD. One additional need is the first-step full gradient compution, which has been implemented by a custom PEFT version in LoRA-GA. Welcome any suggestions or guidance on this.