Skip to content

The-FinAI/FinTagging

Repository files navigation

✨ FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information ✨

📁 Benchmark Data | 📖 Arxiv | 🛠️ Evaluation Framework


🌟 Overview

📚 Datasets Released

📂 Dataset 📝 Description
FinNI-eval Evaluation set for FinNI subtask within FinTagging benchmark.
FinCL-eval Evaluation set for FinCL subtask within FinTagging benchmark.
FinTagging_Original Original benchmark dataset without preprocessing, suitable for custom research. Annotated data (benchmark_ground_truth_pipeline.json) provided in the "annotation" folder.
FinTagging_BIO BIO-format dataset tailored for token-level tagging with BERT-series models. We also provided this data in the "BERT/data" folder under the name "test_data_benchmark.bio".
FinTagging_Trainset The training set for the BERT-series models is provided in two formats: the JSON format, located in the "annotation" folder under the name "TrainingSet_Annotation.json," and the BIO format, located in the "BERT/data" folder under the name "train_data_all.bio".
FinTagging_Subset We provided subset for FinNI and FinCL tasks.

🧑‍💻 Evaluated LLMs and PLMs

We benchmarked FinTagging alongside 10 cutting-edge LLMs and 3 advanced PLMs:

  • 🌐 GPT-4o — OpenAI’s multimodal flagship model with structured output support.
  • 🚀 DeepSeek-V3 — A MoE reasoning model with efficient inference via MLA.
  • 🧠 Qwen2.5 Series — Multilingual models optimized for reasoning, coding, and math. Here, we assessed 14B, 1.5B, and 0.8B Instruct models.
  • 🦙 Llama-3 Series — Meta’s open-source instruction-tuned models for long context. Here, we assessed the Llama-3.1-8B-Instruct and Llama-3.2-3B-Instruct models.
  • 🧭 DeepSeek-R1 Series — RL-tuned first-gen reasoning models with zero-shot strength. Here, we only assessed the DeepSeek-R1-Distill-Qwen-32B model.
  • 🧪 Gemma-2 Model — Google’s latest instruction-tuned model with open weights. Here, we only assess the gemma-2-27b-it model.
  • 💎 Fino1-8B — Our in-house financial LLM with strong reasoning capability.
  • 🏛️ BERT-large — The classic transformer encoder for language understanding.
  • 📉 FinBERT — A financial domain-tuned BERT for sentiment analysis.
  • 🧾 SECBERT — BERT model fine-tuned on SEC filings for financial disclosure tasks.

📌 Evaluation Methodology

  • Local Model Inference: Conducted via FinBen (VLLM framework).
  • We provide task-specific evaluation scripts through our forked version of the FinBen framework, available at: https://github.com/Yan2266336/FinBen.
  • For the FinNI task, you can directly execute the provided script to evaluate a variety of LLMs, including both local and API-based models.
  • For the FinCL task, first run the retrieval script from the repository to obtain US-GAAP candidate concepts. Then, use our provided prompts to construct instruction-style inputs, and apply the reranking method implemented in the forked FinBen to identify the most appropriate US-GAAP concept.
  • Taxonomy: We provided the original US-GAAP taxonomy file ("us-gaap-2024.xsd") as well as the processed taxonomy BM25 index document ("us_gaap_2024_BM25.jsonl") under the taxonomy folder.
  • Note: Running the retrieval script requires a local installation of Elasticsearch. We provided our embedding index document at Google Drive: https://drive.google.com/file/d/1cyMONjP9WdHtD8-WGezmgh_LNhbY3qtR/view?usp=drive_link. However, you can construct your own index document instead of using ours with the original US-GAAP taxonomy file.

📊 Key Performance Metrics

Table: Overall Performance
🥇 = best, 🥈 = second-best, 🥉 = third-best
Category Models Macro P Macro R Macro F1 Micro P Micro R Micro F1
Closed-source LLM GPT-4o 0.0764 🥈 0.0576 🥈 0.0508 🥈 0.0947 0.0788 0.0860
Open-source LLMs DeepSeek-V3 0.0813 🥇 0.0696 🥇 0.0582 🥇 0.1058 0.1217 🥉 0.1132 🥉
DeepSeek-R1-Distill-Qwen-32B 0.0482 🥉 0.0288 🥉 0.0266 🥉 0.0692 0.0223 0.0337
Qwen2.5-14B-Instruct 0.0423 0.0256 0.0235 0.0197 0.0133 0.0159
gemma-2-27b-it 0.0430 0.0273 0.0254 0.0519 0.0453 0.0483
Llama-3.1-8B-Instruct 0.0287 0.0152 0.0137 0.0462 0.0154 0.0231
Llama-3.2-3B-Instruct 0.0182 0.0109 0.0083 0.0151 0.0102 0.0121
Qwen2.5-1.5B-Instruct 0.0180 0.0079 0.0069 0.0248 0.0060 0.0096
Qwen2.5-0.5B-Instruct 0.0014 0.0003 0.0004 0.0047 0.0001 0.0002
Financial LLM Fino1-8B 0.0299 0.0146 0.0140 0.0355 0.0133 0.0193
Fine-tuned PLMs BERT-large 0.0135 0.0200 0.0126 0.1397 🥈 0.1145 🥈 0.1259 🥈
FinBERT 0.0088 0.0143 0.0087 0.1293 🥉 0.0963 0.1104
SECBERT 0.0308 0.0483 0.0331 0.2144 🥇 0.2146 🥇 0.2145 🥇

Table: The FinNI Task Performance
🥇 = best, 🥈 = second-best, 🥉 = third-best
Category Models Precision Recall F1
Closed-source LLM GPT-4o 0.6105 🥈 0.5941 🥈 0.6022 🥈
Open-source LLMs DeepSeek-V3 0.6329 🥇 0.8452 🥇 0.7238 🥇
DeepSeek-R1-Distill-Qwen-32B 0.5490 🥉 0.2238 🥉 0.3180 🥉
Qwen2.5-14B-Instruct 0.3632 0.0018 0.0035
gemma-2-27b-it 0.5319 0.5490 🥉 0.5403 🥉
Llama-3.1-8B-Instruct 0.3346 0.1746 0.2295
Llama-3.2-3B-Instruct 0.1887 0.1794 0.1839
Qwen2.5-1.5B-Instruct 0.1323 0.0636 0.0859
Qwen2.5-0.5B-Instruct 0.0116 0.0027 0.0043
Financial LLM Fino1-8B 0.3416 0.1481 0.2066

Table: The FinCL Task Performance
🥇 = best, 🥈 = second-best, 🥉 = third-best
Category Models Accuracy
Closed-source LLM GPT-4o 0.1664 🥈
Open-source LLMs DeepSeek-V3 0.1715 🥇
DeepSeek-R1-Distill-Qwen-32B 0.1013
Qwen2.5-14B-Instruct 0.1072 🥉
gemma-2-27b-it 0.1009
Llama-3.1-8B-Instruct 0.0807
Llama-3.2-3B-Instruct 0.0375
Qwen2.5-1.5B-Instruct 0.0419
Qwen2.5-0.5B-Instruct 0.0246
Financial LLM Fino1-8B 0.0704

📖 Citation

If you find our benchmark useful, please cite:

@misc{wang2025fintaggingbenchmarkingllmsextracting,
      title={FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information}, 
      author={Yan Wang and Yang Ren and Lingfei Qian and Xueqing Peng and Keyi Wang and Yi Han and Dongji Feng and Fengran Mo and Shengyuan Lin and Qinchuan Zhang and Kaiwen He and Chenri Luo and Jianxing Chen and Junwei Wu and Jimin Huang and Guojun Xiong and Xiao-Yang Liu and Qianqian Xie and Jian-Yun Nie},
      year={2025},
      eprint={2505.20650},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.20650}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published