MisalignmentBench

Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models

Siddhant Panpatil, Hiskias Dingeto, Haon Park

Abstract

Despite significant advances in alignment techniques, we demonstrate that state-of-the-art language models remain vulnerable to carefully crafted conversational scenarios that can induce various forms of misalignment without explicit jailbreaking. Through systematic manual red-teaming with Claude-4-Opus, we discovered 10 successful attack scenarios, revealing fundamental vulnerabilities in how current alignment methods handle narrative immersion, emotional pressure, and strategic framing. These scenarios successfully elicited a range of misaligned behaviors, including deception, value drift, self-preservation, and manipulative reasoning, each exploiting different psychological and contextual vulnerabilities. To validate generalizability, we distilled our successful manual attacks into MISALIGNMENTBENCH, an automated evaluation framework that enables reproducible testing across multiple models. Cross-model evaluation of our 10 scenarios against five frontier LLMs revealed an overall 76% vulnerability rate, with significant variations: GPT-4.1 showed the highest susceptibility (90%), while Claude-4-Sonnet demonstrated greater resistance (40%). Our findings demonstrate that sophisticated reasoning capabilities often become attack vectors rather than protective mechanisms, as models can be manipulated into complex justifications for misaligned behavior. This work provides (i) a detailed taxonomy of conversational manipulation patterns and (ii) a reusable evaluation framework. Together, these findings expose critical gaps in current alignment strategies and highlight the need for robustness against subtle, scenario-based manipulation in future AI systems.

Key Findings

76% overall vulnerability rate across five frontier LLMs
10 distinct attack scenarios targeting different psychological vulnerabilities
GPT-4.1: 90% susceptibility rate (highest)
Claude-4-Sonnet: 40% susceptibility rate (most resistant)
Sophisticated reasoning can become attack vectors rather than protective mechanisms

Repository Status

🚧 Codebase coming soon 🚧

The automated evaluation framework and scenario implementations will be released following paper publication.

Citation

@article{panpatil2024misalignmentbench,
  title={Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models},
  author={Panpatil, Siddhant and Dingeto, Hiskias and Park, Haon},
  journal={arXiv preprint arXiv:2508.04196},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MisalignmentBench

Abstract

Key Findings

Repository Status

Citation

About

Uh oh!

Releases

Packages

Uh oh!

AIM-Intelligence/MisalignmentBench

Folders and files

Latest commit

History

Repository files navigation

MisalignmentBench

Abstract

Key Findings

Repository Status

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages