Skip to content

open-compass/CNFinBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CNFinBench: Evaluating Expertise, Autonomy, and Integrity in Finance

Language: English | 中文

Paper Leaderboard


📹 For clearer demo video, please click the link: Video

CNFinBench.mov

📣 News & Announcements

📰 Media Coverage

📄 Academic Release

🌐 Online Leaderboard


📖 What is CNFinBench?

CNFinBench is a comprehensive benchmark for evaluating large language models and agentic systems in high-stakes financial scenarios.

Unlike traditional textbook-style financial QA benchmarks, CNFinBench targets real-world deployment risks introduced by high-privilege financial agents, and systematically evaluates models along three orthogonal axes:

  • Expertise – professional financial knowledge and reasoning
  • Autonomy – multi-step planning, tool use, and agent execution
  • Integrity – safety, compliance, and robustness under adversarial interaction

CNFinBench spans 29 fine-grained tasks, grounded in certified regulatory corpora, real financial workflows, and multi-turn adversarial attack scenarios.


🧩 Task Taxonomy

CNFinBench decomposes financial intelligence into a three-dimensional evaluation space:

📚Expertise – Financial Capability

  • Financial Knowledge Mastery
  • Complex Logic Composition
  • Contextual Analysis Resilience

⚒️ Autonomy – Agentic Execution

  • End-to-End Execution (Intent → Plan → Tool → Verification)
  • Strategic Planning & Reasoning
  • Meta-cognitive Reliability

🥷🏻 Integrity – Safety & Compliance

  • Immediate Risk Interception
  • Compliance Persistence
  • Dynamic Adversarial Orchestration

🔐 Multi-turn Safety Evaluation & HICS

To quantify behavioral compliance degradation, CNFinBench introduces:

Harmful Instruction Compliance Score (HICS)

  • Multi-dimensional, severity-aware safety metric
  • Tracks violation escalation across dialogue rounds
  • Supports interpretable rule-level deduction logs
  • Reveals collapse rhythms under different attack strategies

🏗 How is CNFinBench constructed?

CNFinBench adopts a multi-stage data generation pipeline that combines:

  • LLM-assisted synthesis – for scalable question generation
  • Expert authoring & validation – to ensure domain accuracy and risk coverage
  • Interaction & safety task design – simulating trust boundaries and real agent execution chains
  • Task-aware rubric annotation – enabling interpretable model evaluation across 3 axes

🏆 Leaderboard & Evaluation Platform

CNFinBench is deployed on a fully automated evaluation platform built on OpenCompass, supporting:

  • Unified evaluation of open-source & closed-source models
  • Task-aware rubrics with LLM-as-Judge protocols
  • Real-time leaderboard updates
  • Dynamic task and model integration

🔗 Visit the leaderboard:
👉 https://cnfinbench.opencompass.org.cn/home

Platform Overview

Below is a snapshot of the current leaderboard (updated in real time on the platform):

Leaderboard Snapshot

📊 Benchmark Scale

  • 29 subtasks across Expertise / Autonomy / Integrity
  • 11,947 single-turn QA instances
  • 321 multi-turn adversarial dialogues (4 rounds each)
  • 22 evaluated models (open-source, closed-source, finance-tuned)

💻 Code Usage

Multi-turn Dialogue Evaluation

For multi-turn adversarial dialogue evaluation, please refer to the detailed guides in the multi-turn directory:

The multi-turn evaluation pipeline includes:

  1. Generate multi-turn dialogue tests using scripts in multi-turn/pred/
  2. Merge output files using multi-turn/pred/merge.py
  3. Evaluate results using scripts in multi-turn/judge/

For detailed evaluation script documentation:

Quick Start

  1. Install dependencies:

    cd multi-turn
    pip install -r requirements.txt
  2. Generate multi-turn dialogues:

    cd pred
    python main.py --data-dir ../data --output-dir ../output --model-name your_model_name \
        --attack-api-key your_attack_api_key --attack-base-url your_attack_base_url \
        --attack-model-name your_attack_model --defense-api-key your_defense_api_key \
        --defense-base-url your_defense_base_url --defense-model-name your_defense_model
  3. Merge output files:

    python merge.py --output-dir ../output
  4. Evaluate results:

    cd ..
    python -m judge.evaluate --output-dir ./output \
        --judge-api-key your_judge_api_key --judge-base-url your_judge_base_url \
        --judge-model-name your_judge_model

For more detailed instructions and examples, please see the multi-turn README.


📖 Citation

If you use CNFinBench, please cite:

@misc{ding2025cnfinbenchbenchmarksafetycompliance,
      title={CNFinBench: A Benchmark for Safety and Compliance of Large Language Models in Finance}, 
      author={Jinru Ding and Chao Ding and Wenrao Pang and Boyi Xiao and Zhiqiang Liu and Pengcheng Chen and Jiayuan Chen and Tiantian Yuan and Junming Guan and Yidong Jiang and Dawei Cheng and Jie Xu},
      year={2025},
      eprint={2512.09506},
      archivePrefix={arXiv},
      primaryClass={cs.CE},
      url={https://arxiv.org/abs/2512.09506}, 
}

About

CNFinBench — the first comprehensive benchmark for high-stakes financial scenarios. It spans 29 subtasks grounded in authoritative financial corpora and real business contexts, reconstructing end-to-end agent execution chains from requirement parsing, path planning, tool invocation, to result verification.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages