Skip to content

[FEAT] Integrate LoRA-One into PEFT #2882

@YuanheZ

Description

@YuanheZ

Feature request

Paper: https://arxiv.org/abs/2502.01235 (ICML 2025 Oral Presentation)
Reference code: https://github.com/YuanheZ/LoRA-One


Content Overview

This paper explores how theory can guide and enhance practical algorithms, using Low-Rank Adaptation (LoRA) in large language models as a case study. We rigorously prove that, under gradient descent, LoRA adapters align with specific singular subspaces of the one-step full fine-tuning gradient. This result suggests that, by properly initializing the adapters using the one-step full gradient, subspace alignment can be achieved immediately—applicable to both linear and nonlinear models. Building on our theory, we propose a theory-driven algorithm, LoRA-One, where the linear convergence (as well as generalization) is built and incorporating preconditioners theoretically helps mitigate the effects of ill-conditioning. Besides, our theory reveals connections between LoRA-One and other gradient-alignment-based methods, helping to clarify misconceptions in the design of such algorithms. LoRA-One achieves significant empirical improvements over LoRA and its variants across benchmarks in natural language understanding, mathematical reasoning, and code generation.

Image

Main Contributions

We theoretically prove:

  • standard LoRA will align to the top-r subspace of first-step full gradient;
  • LoRA can achieve fast linear convergence both in optimization and generalization if we initialize LoRA using best r-rank first-step full gradient.

Grounded by our theory, we establish the optimal initialization making use of gradient, clarifying the suboptimality of previous graident-based methods such as LoRA-GA, LoRA-SB. Our method is supported by performance improvement in a wide range of instruction, math, code benchmarks.

Image

Algorithmic Overview

For each weight matrix, we first compute the gradient $\nabla_{W} L$ under full fine-tuning using a batch and perform SVD on $-\nabla_{W} L$ to get $U$, $\Sigma$, $V$, then we initialize LoRA via

$$\mathbf{A}_{0}=\frac{1}{\sqrt{\gamma}} U_{[:,:r]} Diag(S[:r])\,,\quad \mathbf{B}_{0}=\frac{1}{\sqrt{\gamma}} Diag(S[:r]) V_{[:,:r]}^\top\,,\quad W_{adapted} = W_{pre}+\frac{\alpha}{\sqrt{r}}\mathbf{A}_{0} \mathbf{B}_{0}\,,$$

which is equivalent to perform one best r-rank full gradient descent under full fine-tuning with learning rate $\frac{\alpha}{\gamma\sqrt{r}}$ at the initialization. The SVD is implemented by random SVD, which is super efficient.


Experiments

Image Image Image Image

Your contribution

The code implementation is similar to PiSSA and LoRA-GA. The core idea is to replace the random init LoRA adapters with matrices from SVD. One additional need is the first-step full gradient compution, which has been implemented by a custom PEFT version in LoRA-GA. Welcome any suggestions or guidance on this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions