SPLoRA

Abstract

Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enhances adaptability while reducing computational costs. However, fine-tuning can compromise safety alignment, even with benign data, increasing susceptibility to harmful outputs. Existing safety alignment methods struggle to capture complex parameter shifts, leading to suboptimal safety-utility trade-offs. To address this issue, we propose Safe Pruning LoRA (SPLoRA), a novel pruning-based approach that selectively removes LoRA layers that weaken safety alignment, improving safety while preserving performance. At its core, we introduce Empirical-DIEM (E-DIEM), a dimension-insensitive similarity metric that effectively detects safety misalignment in LoRA-adapted models. We conduct extensive experiments on LLMs fine-tuned with both mixed of benign and malicious data, and purely benign datasets, evaluating SPLoRA across utility, safety, and reliability metrics. Results demonstrate that SPLoRA outperforms state-of-the-art safety alignment techniques, significantly reducing safety risks while maintaining or improving model performance and reliability. Additionally, SPLoRA reduces inference overhead, making it a scalable and efficient solution for deploying safer and more reliable LLMs.

Eval command

python splora.py

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
llama2_DS+PD_pruning.ipynb		llama2_DS+PD_pruning.ipynb
llama2_alpaca_pd_lora.py		llama2_alpaca_pd_lora.py
llama2_diagsum_pd_lora.py		llama2_diagsum_pd_lora.py
similarity.py		similarity.py
splora.py		splora.py
visualization.ipynb		visualization.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

SPLoRA

Abstract

Eval command

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

AoShuang92/SPLoRA

Folders and files

Latest commit

History

Repository files navigation

SPLoRA

Abstract

Eval command

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages