📹 For clearer demo video, please click the link: Video
CNFinBench.mov
📰 Media Coverage
- 👉 CNFinBench Released: A New Benchmark for Financial LLM Safety Evaluation
- 👉 From Capabilities to Agents: How CNFinBench Evaluates Compliance and Safety of Financial LLMs
- 👉 Shanghai Government: Financial LLM Evaluation System 2.0 Released in Shanghai
- 👉 21st Century Business Herald: Industry Standards Upgraded! 2025 Financial LLM Evaluation System Officially Released in Shanghai
- 👉 China News Service: 2025 Financial LLM Evaluation System Successfully Launched in Shanghai
- 👉 International Finance News: 2025 Financial LLM Evaluation System
- 👉 Xinhua Finance: 2025 Financial LLM Evaluation System
- 👉 China Securities Journal: 2025 Financial LLM Evaluation System
📄 Academic Release
- Beyond Knowledge to Agency: Evaluating Expertise, Autonomy, and Integrity in Finance with CNFinBench
🔗 https://arxiv.org/abs/2512.09506
🌐 Online Leaderboard
- 🔥 Live leaderboard and model submission: https://cnfinbench.opencompass.org.cn/home
CNFinBench is a comprehensive benchmark for evaluating large language models and agentic systems in high-stakes financial scenarios.
Unlike traditional textbook-style financial QA benchmarks, CNFinBench targets real-world deployment risks introduced by high-privilege financial agents, and systematically evaluates models along three orthogonal axes:
- Expertise – professional financial knowledge and reasoning
- Autonomy – multi-step planning, tool use, and agent execution
- Integrity – safety, compliance, and robustness under adversarial interaction
CNFinBench spans 29 fine-grained tasks, grounded in certified regulatory corpora, real financial workflows, and multi-turn adversarial attack scenarios.
CNFinBench decomposes financial intelligence into a three-dimensional evaluation space:
- Financial Knowledge Mastery
- Complex Logic Composition
- Contextual Analysis Resilience
- End-to-End Execution (Intent → Plan → Tool → Verification)
- Strategic Planning & Reasoning
- Meta-cognitive Reliability
- Immediate Risk Interception
- Compliance Persistence
- Dynamic Adversarial Orchestration
To quantify behavioral compliance degradation, CNFinBench introduces:
- Multi-dimensional, severity-aware safety metric
- Tracks violation escalation across dialogue rounds
- Supports interpretable rule-level deduction logs
- Reveals collapse rhythms under different attack strategies
CNFinBench adopts a multi-stage data generation pipeline that combines:
- LLM-assisted synthesis – for scalable question generation
- Expert authoring & validation – to ensure domain accuracy and risk coverage
- Interaction & safety task design – simulating trust boundaries and real agent execution chains
- Task-aware rubric annotation – enabling interpretable model evaluation across 3 axes
CNFinBench is deployed on a fully automated evaluation platform built on OpenCompass, supporting:
- Unified evaluation of open-source & closed-source models
- Task-aware rubrics with LLM-as-Judge protocols
- Real-time leaderboard updates
- Dynamic task and model integration
🔗 Visit the leaderboard:
👉 https://cnfinbench.opencompass.org.cn/home
Below is a snapshot of the current leaderboard (updated in real time on the platform):
- 29 subtasks across Expertise / Autonomy / Integrity
- 11,947 single-turn QA instances
- 321 multi-turn adversarial dialogues (4 rounds each)
- 22 evaluated models (open-source, closed-source, finance-tuned)
For multi-turn adversarial dialogue evaluation, please refer to the detailed guides in the multi-turn directory:
- 📖 English Guide: multi-turn/README.md
- 📖 中文指南: multi-turn/README_CN.md
The multi-turn evaluation pipeline includes:
- Generate multi-turn dialogue tests using scripts in
multi-turn/pred/ - Merge output files using
multi-turn/pred/merge.py - Evaluate results using scripts in
multi-turn/judge/
For detailed evaluation script documentation:
- 📖 English: multi-turn/judge/README_EN.md
- 📖 中文: multi-turn/judge/README.md
-
Install dependencies:
cd multi-turn pip install -r requirements.txt -
Generate multi-turn dialogues:
cd pred python main.py --data-dir ../data --output-dir ../output --model-name your_model_name \ --attack-api-key your_attack_api_key --attack-base-url your_attack_base_url \ --attack-model-name your_attack_model --defense-api-key your_defense_api_key \ --defense-base-url your_defense_base_url --defense-model-name your_defense_model -
Merge output files:
python merge.py --output-dir ../output
-
Evaluate results:
cd .. python -m judge.evaluate --output-dir ./output \ --judge-api-key your_judge_api_key --judge-base-url your_judge_base_url \ --judge-model-name your_judge_model
For more detailed instructions and examples, please see the multi-turn README.
If you use CNFinBench, please cite:
@misc{ding2025cnfinbenchbenchmarksafetycompliance,
title={CNFinBench: A Benchmark for Safety and Compliance of Large Language Models in Finance},
author={Jinru Ding and Chao Ding and Wenrao Pang and Boyi Xiao and Zhiqiang Liu and Pengcheng Chen and Jiayuan Chen and Tiantian Yuan and Junming Guan and Yidong Jiang and Dawei Cheng and Jie Xu},
year={2025},
eprint={2512.09506},
archivePrefix={arXiv},
primaryClass={cs.CE},
url={https://arxiv.org/abs/2512.09506},
}


