In this paper, we investigate three questions:
- Do style patterns affect LLM safety?
- How do safety vulnerabilities emerge during superficial style alignment?
- How can we mitigate these risks during the alignment process?
Please note that we are unable to release the jailbreak datasets and model outputs due to safety considerations.
However, all results can be easily reproduced using the provided code.
This folder contains the code implementation for the first research question.
We recommend the following workflow:
- Dataset Preparation: Run
Setup.ipynbto prepare the jailbreak datasets. - Jailbreak Execution & Evaluation: Use
jailbreak.pyto run jailbreak attacks andevaluate.pyto evaluate model responses. - Additional Features: Compute attention difference with
attention.pyand entropy withuncertainty.py. - Result Analysis: Use
Analysis.ipynbto analyze results and generate the relevant figures.
This folder contains the code implementation for the second research question.
We recommend the following workflow:
- Dataset Preparation: Run
Setup.ipynbto prepare the fine-tuning datasets. - Instruction Tuning: Fine-tune models using LLaMA-Factory [1].
- Evaluation: Use
evaluate.pyto assess the safety and utility of the fine-tuned models. - Analysis: Use
Analysis.ipynbto analyze results and generate the relevant figures.
This folder contains the code implementation for the third research question.
We recommend the following workflow:
- Dataset Preparation: Run
Setup.ipynbto prepare the fine-tuning datasets. - Instruction Tuning: Fine-tune models using LLaMA-Factory [1].
- Evaluation: Use
evaluate.pyto assess the safety and utility of the fine-tuned models. - Analysis: Use
Analysis.ipynbto analyze results and generate the relevant figures.
[1] Zheng, Yaowei, et al. "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models." ACL 2024.
@article{xiao2025style,
title={When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment},
author={Xiao, Yuxin and Tonekaboni, Sana and Gerych, Walter and Suriyakumar, Vinith and Ghassemi, Marzyeh},
journal={arXiv preprint arXiv:2506.07452},
year={2025}
}