🔥 Must-read papers for harmful fine-tuning attacks/defenses for LLMs.
💫 Continuously update on a weekly basis. (last update: 2025/10/18)
🔥 Good news: 7 harmful fine-tuning related papers are accpeted by NeurIPS2024
💫 We have updated our survey, including the discussion on the 17 ICLR2025 new submissions.
🔥 We update a slide to introduce harmful fine-tuning attacks/defenses. Check out the slide here.
🔥 Good news: 12 harmful fine-tuning related papers are accpeted by ICLR2025.
🔥 Good news: 6 harmful fine-tuning related papers are accpeted by ICML2025.
🔥 Chef Recommendation: Risk of harmful fine-tuning attack can be more more prounounced with jailbreak tuning and for larger scale models.
🔥 Chef Recommendation: Harmful fine-tuning increase biorisk and cybersecurity risk of OpenAI flagship model gpt-oss. Check out the recent OpenAI technical report.
🔥 We collected all the related ICLR2026 submission. Please use ctrl+f to search for ICLR2026 submission if interested..
- Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
-
[2023/10/4] Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models arXiv [paper] [code]
-
[2023/10/5] Fine-tuning aligned language models compromises safety, even when users do not intend to! ICLR 2024 [paper] [code]
-
[2023/10/5] On the Vulnerability of Safety Alignment in Open-Access LLMs ACL2024 (Findings) [paper]
-
[2023/10/22] Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases arXiv [paper]
-
[2023/10/31] Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b SeT LLM workshop@ ICLR 2024 [paper]
-
[2023/10/31] BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B arXiv [paper]
-
[2023/11/9] Removing RLHF Protections in GPT-4 via Fine-Tuning NAACL2024 [paper]
-
[2023/12/21] Exploiting Novel GPT-4 APIs arXiv [paper]
-
[2024/4/1] What's in your" safe" data?: Identifying benign data that breaks safety COLM2024 [paper] [code]
-
[2024/6/28] Covert malicious finetuning: Challenges in safeguarding llm adaptation ICML2024 [paper]
-
[2024/07/29] Can Editing LLMs Inject Harm? NeurIPS2024 [paper] [code]
-
[2024/08/06] Scaling Trends for Data Poisoning in LLMs AAAI25-AIA [paper] [code]
-
[2024/10/01] Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Modelss arXiv [paper] [code]
-
[2024/10/21] The effect of fine-tuning on language model toxicity NeurIPS2024 Safe GenAI workshop [paper]
-
[2024/10/23] Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks arXiv [paper]
-
[2025/01/29] Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation arXiv [paper] [code]
-
[2025/02/03] The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models arXiv [paper]
-
[2025/02/20] Fundamental Limitations in Defending LLM Finetuning APIs arXiv [paper]
-
[2025/02/26] No, of course I can! Refusal Mechanisms Can Be Exploited Using Harmless Fine-Tuning Data arXiv [paper]
-
[2025/03/05] Emergent Misalignment:Narrow finetuning can produce broadly misaligned LLMs arXiv [paper]
-
[2025/05/1] Tongue-Tied: Breaking LLMs Safety Through New Language Learning CALCS [paper]
-
[2025/05/11] Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety ICML2025 [paper] [code]
-
[2025/05/11] SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models? arXiv [paper]
-
[2025/05/11] Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability arXiv [paper] [code]
-
[2025/05/22] Finetuning-Activated Backdoors in LLMs arXiv [paper] [code]
-
[2025/07/15] Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility arXiv [paper] [code]
-
[2025/07/15] ESTIMATING WORST-CASE FRONTIER RISKS OF OPEN-WEIGHT LLMS OpenAI technical report [paper]
-
[2025/08/19] Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation arXiv [paper] [code]
-
[2025/9/30] Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents arXiv [paper] [code]
-
[2025/10/01] Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach arXiv [paper]
-
[2025/10/08] Eliciting Harmful Capabilities by Fine-Tuning on Safeguarded Outputs ICLR2026 Submission [paper]
-
[2025/10/08] TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning ICLR2026 Submission [paper]
- [2025/8/8] Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs arXiv [paper] [code]
-
[2024/2/2] Vaccine: Perturbation-aware alignment for large language model aginst harmful fine-tuning NeurIPS2024 [paper] [code]
-
[2024/5/23] Representation noising effectively prevents harmful fine-tuning on LLMs NeurIPS2024 [paper] [code]
-
[2024/5/24] Buckle Up: Robustifying LLMs at Every Customization Stage via Data Curation arXiv [paper] [code] [Openreview]
-
[2024/8/1] Tamper-Resistant Safeguards for Open-Weight LLMs ICLR2025 [Openreview] [paper] [code]
-
[2024/9/3] Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation ICLR2025 [paper] [code] [Openreview]
-
[2024/9/26] Leveraging Catastrophic Forgetting to Develop Safe Diffusion Models against Malicious Finetuning NeurIPS2024 (for diffusion model) [paper]
-
[2024/10/05] Identifying and Tuning Safety Neurons in Large Language Models ICLR2025 [Openreview]
-
[2024/10/13] Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation arXiv [paper] [code]
-
[2024/10/13] Preserving Safety in Fine-Tuned Large Language Models: A Systematic Evaluation and Mitigation Strategy NeurIPS2024 workshop SafeGenAi [paper]
-
[2025/01/19] On Weaponization-Resistant Large Language Models with Prospect Theoretic Alignment arXiv [paper] [code]
-
[2025/02/07] Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond arXiv [paper]
-
[2025/05/07] Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization arXiv [paper]
-
[2025/05/18] Self-Destructive Language Model arXiv [paper]
-
[2025/05/22] CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning arXiv [paper] [code]
-
[2025/05/22] Model Immunization from a Condition Number Perspective ICML2025 [paper] [code]
-
[2025/06/02] Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning ICML2025 [paper] [code]
-
[2025/06/04] Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning ICML2025 [paper] [code]
-
[2025/06/05] Locking Open Weight Models with Spectral Deformation ICML2025 Workshop TAIG [paper]
-
[2025/06/18] LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning arxiv [paper] [code]
-
[2025/07/22] Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning NeurIPS2025 [paper]
-
[2025/08/28] TOKEN BUNCHER: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning arxiv [paper] [code]
-
[2025/09/06] AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs arxiv [paper]
-
[2025/10/08] Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence ICLR2026 submission [paper]
-
[2023/8/25] Fine-tuning can cripple your foundation model; preserving features may be the solution TMLR [paper] [code]
-
[2023/9/14] Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions ICLR2024 [paper] [code]
-
[2024/2/3] Safety fine-tuning at (almost) no cost: A baseline for vision large language models ICML2024 [paper] [code]
-
[2024/2/7] Assessing the brittleness of safety alignment via pruning and low-rank modifications ME-FoMo@ICLR2024 [paper] [code]
-
[2024/2/22] Mitigating fine-tuning jailbreak attack with backdoor enhanced alignment NeurIPS2024 [paper] [code]
-
[2024/2/28] Keeping llms aligned after fine-tuning: The crucial role of prompt templates NeurIPS2024 [paper] [code]
-
[2024/5/28] Lazy safety alignment for large language models against harmful fine-tuning NeurIPS2024 [paper] [code]
-
[2024/6/10] Safety alignment should be made more than just a few tokens deep ICLR2025 [paper] [code] [Openriew]
-
[2024/6/12] Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models ICLR2025 [paper] [Openreview]
-
[2024/8/27] Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models ICLR2025 [Openreview] [paper]
-
[2024/8/30] Safety Layers in Aligned Large Language Models: The Key to LLM Security ICLR2025 [Openreview] [paper]
-
[2024/10/05] SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection ICLR2025 [Openreview]
-
[2024/10/05] Safety Alignment Shouldn't Be Complicated preprint [Openreview]
-
[2024/10/05] SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation ICLR2025 [paper] [Openreview]
-
[2024/10/05] Towards Secure Tuning: Mitigating Security Risks Arising from Benign Instruction Fine-Tuning ICLR2025 [paper] [Openreview]
-
[2024/10/13] Safety-Aware Fine-Tuning of Large Language Models NeurIPS 2024 Workshop on Safe Generative AI [paper]
-
[2024/12/19] RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response arXiv [paper]
-
[2025/02/28] Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs arXiv [paper]
-
[2025/03/03] Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness arXiv [paper]
-
[2025/03/24] LookAhead Tuning: Safer Language Models via Partial Answer Previews arXiv [paper] [code]
-
[2025/04/12] Detecting Instruction Fine-tuning Attack on Language Models with Influence Function arXiv [paper] [code]
-
[2025/04/14] Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? arXiv [paper]
-
[2025/05/22] Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization ICLR2026 submission [paper] [code]
-
[2025/05/22] Shape it Up! Restoring LLM Safety during Finetuning arXiv [paper]
-
[2025/05/23] Understanding Pre-training and Fine-tuning from Loss Landscape Perspectives arXiv [paper]
-
[2025/05/29] SC-LoRA: Balancing Efficient Fine-tuning and Knowledge Preservation via Subspace-Constrained LoRA arXiv [paper]
-
[2025/06/09] When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment arXiv [paper] [code]
-
[2025/06/09] Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation arXiv [paper]
-
[2025/06/10] AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin arXiv [paper] [code]
-
[2025/07/25] Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment arXiv [paper] [code]
-
[2025/08/04] Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization arXiv [paper]
-
[2025/08/17] Rethinking Safety in LLM Fine-tuning: An Optimization Perspective COLM2025 [paper]
-
[2025/08/18] Gradient Surgery for Safe LLM Fine-Tuning arXiv [paper] [code]
-
[2025/08/23] Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks arXiv [paper] [code]
-
[2025/09/08] Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint arXiv [paper]
-
[2025/09/26] Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment ICLR2026 submission [paper] [code]
-
[2025/10/08] GradShield: Alignment Preserving Finetuning ICLR2026 Submission [paper]
-
[2025/10/08] SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance–Diversity Data Selection ICLR2026 Submission [paper]
-
[2025/10/08] A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space ICLR2026 Submission [paper]
-
[2025/10/08] Detecting Instruction Fine-tuning Attack on Language Models with Influence Function ICLR2026 Submission [paper]
-
[2023/11/02] Making Harmful Behaviors Unlearnable for Large Language Models ACL2024 [paper]
-
[2024/2/19] Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic ACL2024 [paper] [code]
-
[2024/3/8] Defending Against Unforeseen Failure Modes with Latent Adversarial Training arXiv [paper] [code]
-
[2024/5/15] A safety realignment framework via subspace-oriented model fusion for large language models KBS [paper] [code]
-
[2024/5/23] MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability NeurIPS2024 [paper] [code]
-
[2024/5/27] Safe lora: the silver lining of reducing safety risks when fine-tuning large language models NeurIPS2024 [paper]
-
[2024/8/18] Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning ICML2025 [paper]
-
[2024/10/05] Locking Down the Finetuned LLMs Safety preprint [Openreview]
-
[2024/10/05] Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models ICLR2025 [Openreview] [code]
-
[2024/10/05] Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models preprint [Openreview]
-
[2024/12/15] Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models arXiv [paper]
-
[2024/12/17] NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning AAAI2025 [paper] [code]
-
[2024/12/30] Enhancing AI Safety Through the Fusion of Low Rank Adapters arXiv [paper]
-
[2025/02/01] Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation NeurIPS2025 [paper] [repo]
-
[2025/02/24] Safety Misalignment Against Large Language Models NDSS2025 [paper] [repo]
-
[2025/03/06] SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging ICLR2025 (short paper) [paper] [repo]
-
[2025/04/13] Alleviating the Fear of Losing Alignment in LLM Fine-tuning S&P2025 [paper] [repo]
-
[2025/05/17] Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets ICML2025 [paper] [repo]
-
[2025/05/29] Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets ICML2025 [paper] [repo]
-
[2025/06/21] Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs arxiv [paper] [repo]
-
[2025/07/01] LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion ACL2025 [paper]
-
[2025/08/08] Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks arXiv [paper]
-
[2025/09/08] MoGUV 2: Toward a Higher Pareto Frontier Between Model Usability and Security arXiv [paper]
-
[2025/10/08] Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks ICLR2026 Submission [paper]
-
[2025/10/08] Surgical Safety Repair: A Parameter-Isolated Approach to Correcting Harmful Fine-tuning ICLR2026 Submission [paper]
-
[2025/11/25] Safe and Effective Post-Fine-tuning Alignment in Large Language Models KBS [paper]
- [2024/5/25] No two devils alike: Unveiling distinct mechanisms of fine-tuning attacks arXiv [paper]
- [2024/5/27] Navigating the safety landscape: Measuring risks in finetuning large language models NeurIPS2024 [paper]
- [2024/10/05] Your Task May Vary: A Systematic Understanding of Alignment and Safety Degradation when Fine-tuning LLMs preprint [Openreview]
- [2024/10/05] On Evaluating the Durability of Safeguards for Open-Weight LLMs ICLR2025 [Openreview] [Code]
- [2024/11/13] The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense arXiv [paper]
- [2025/2/3] Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities arXiv [paper]
- [2025/3/24] Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models arXiv [paper]
- [2025/5/20] Safety Subspaces are Not Distinct: A Fine-Tuning Case Study ICLR20206 submission [paper] [Code]
- [2025/6/30] Foundational Models Must Be Designed To Yield Safer Loss Landscapes That Resist Harmful Fine-Tuning ICML 2025 R2-FM Workshop [paper]
- [2025/8/08] In-Training Defenses against Emergent Misalignment in Language Models arXiv [paper] [Code]
- [2024/9/19] Defending against Reverse Preference Attacks is Difficult arXiv [paper] [code]
- [2025/5/31] SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-Tuning arXiv [paper] [code]
-
[2024/6/15] Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models ICLR2025 [paper] [Openreview]
-
[2024/11/28] PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning arXiv [paper]
-
[2025/10/08] TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering ICLR2026 Submission [paper]
If you find this repository useful, please cite our paper:
@article{huang2024harmful,
title={Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey},
author={Huang, Tiansheng and Hu, Sihao and Ilhan, Fatih and Tekin, Selim Furkan and Liu, Ling},
journal={arXiv preprint arXiv:2409.18169},
year={2024}
}
If you discover any papers that are suitable but not included, please contact Tiansheng Huang (thuang374@gatech.edu).
Please kindly 🌟star🌟 our repository if you find it helpful!