When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

In this paper, we investigate three questions:

Do style patterns affect LLM safety?
How do safety vulnerabilities emerge during superficial style alignment?
How can we mitigate these risks during the alignment process?

Please note that we are unable to release the jailbreak datasets and model outputs due to safety considerations.
However, all results can be easily reproduced using the provided code.

InflatedASR

This folder contains the code implementation for the first research question.
We recommend the following workflow:

Dataset Preparation: Run Setup.ipynb to prepare the jailbreak datasets.
Jailbreak Execution & Evaluation: Use jailbreak.py to run jailbreak attacks and evaluate.py to evaluate model responses.
Additional Features: Compute attention difference with attention.py and entropy with uncertainty.py.
Result Analysis: Use Analysis.ipynb to analyze results and generate the relevant figures.

StyleAlignment

This folder contains the code implementation for the second research question.
We recommend the following workflow:

Dataset Preparation: Run Setup.ipynb to prepare the fine-tuning datasets.
Instruction Tuning: Fine-tune models using LLaMA-Factory [1].
Evaluation: Use evaluate.py to assess the safety and utility of the fine-tuned models.
Analysis: Use Analysis.ipynb to analyze results and generate the relevant figures.

SafeStyle

This folder contains the code implementation for the third research question.
We recommend the following workflow:

Dataset Preparation: Run Setup.ipynb to prepare the fine-tuning datasets.
Instruction Tuning: Fine-tune models using LLaMA-Factory [1].
Evaluation: Use evaluate.py to assess the safety and utility of the fine-tuned models.
Analysis: Use Analysis.ipynb to analyze results and generate the relevant figures.

[1] Zheng, Yaowei, et al. "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models." ACL 2024.

Citation

@article{xiao2025style,
  title={When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment},
  author={Xiao, Yuxin and Tonekaboni, Sana and Gerych, Walter and Suriyakumar, Vinith and Ghassemi, Marzyeh},
  journal={arXiv preprint arXiv:2506.07452},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
1_InflatedASR		1_InflatedASR
2_StyleAlignment		2_StyleAlignment
3_SafeStyle		3_SafeStyle
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

InflatedASR

StyleAlignment

SafeStyle

Citation

About

Uh oh!

Releases

Packages

Languages

Uh oh!

License

Uh oh!

xiaoyuxin1002/SafeStyle

Folders and files

Latest commit

History

Repository files navigation

When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

InflatedASR

StyleAlignment

SafeStyle

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages