📁 Benchmark Data | 📖 Arxiv | 🛠️ Evaluation Framework
| 📂 Dataset | 📝 Description |
|---|---|
| FinNI-eval | Evaluation set for FinNI subtask within FinTagging benchmark. |
| FinCL-eval | Evaluation set for FinCL subtask within FinTagging benchmark. |
| FinTagging_Original | Original benchmark dataset without preprocessing, suitable for custom research. Annotated data (benchmark_ground_truth_pipeline.json) provided in the "annotation" folder. |
| FinTagging_BIO | BIO-format dataset tailored for token-level tagging with BERT-series models. We also provided this data in the "BERT/data" folder under the name "test_data_benchmark.bio". |
| FinTagging_Trainset | The training set for the BERT-series models is provided in two formats: the JSON format, located in the "annotation" folder under the name "TrainingSet_Annotation.json," and the BIO format, located in the "BERT/data" folder under the name "train_data_all.bio". |
| FinTagging_Subset | We provided subset for FinNI and FinCL tasks. |
We benchmarked FinTagging alongside 10 cutting-edge LLMs and 3 advanced PLMs:
- 🌐 GPT-4o — OpenAI’s multimodal flagship model with structured output support.
- 🚀 DeepSeek-V3 — A MoE reasoning model with efficient inference via MLA.
- 🧠 Qwen2.5 Series — Multilingual models optimized for reasoning, coding, and math. Here, we assessed 14B, 1.5B, and 0.8B Instruct models.
- 🦙 Llama-3 Series — Meta’s open-source instruction-tuned models for long context. Here, we assessed the Llama-3.1-8B-Instruct and Llama-3.2-3B-Instruct models.
- 🧭 DeepSeek-R1 Series — RL-tuned first-gen reasoning models with zero-shot strength. Here, we only assessed the DeepSeek-R1-Distill-Qwen-32B model.
- 🧪 Gemma-2 Model — Google’s latest instruction-tuned model with open weights. Here, we only assess the gemma-2-27b-it model.
- 💎 Fino1-8B — Our in-house financial LLM with strong reasoning capability.
- 🏛️ BERT-large — The classic transformer encoder for language understanding.
- 📉 FinBERT — A financial domain-tuned BERT for sentiment analysis.
- 🧾 SECBERT — BERT model fine-tuned on SEC filings for financial disclosure tasks.
- Local Model Inference: Conducted via FinBen (VLLM framework).
- We provide task-specific evaluation scripts through our forked version of the FinBen framework, available at: https://github.com/Yan2266336/FinBen.
- For the FinNI task, you can directly execute the provided script to evaluate a variety of LLMs, including both local and API-based models.
- For the FinCL task, first run the retrieval script from the repository to obtain US-GAAP candidate concepts. Then, use our provided prompts to construct instruction-style inputs, and apply the reranking method implemented in the forked FinBen to identify the most appropriate US-GAAP concept.
- Taxonomy: We provided the original US-GAAP taxonomy file ("us-gaap-2024.xsd") as well as the processed taxonomy BM25 index document ("us_gaap_2024_BM25.jsonl") under the taxonomy folder.
- Note: Running the retrieval script requires a local installation of Elasticsearch. We provided our embedding index document at Google Drive: https://drive.google.com/file/d/1cyMONjP9WdHtD8-WGezmgh_LNhbY3qtR/view?usp=drive_link. However, you can construct your own index document instead of using ours with the original US-GAAP taxonomy file.
🥇 = best, 🥈 = second-best, 🥉 = third-best
| Category | Models | Macro P | Macro R | Macro F1 | Micro P | Micro R | Micro F1 |
|---|---|---|---|---|---|---|---|
| Closed-source LLM | GPT-4o | 0.0764 🥈 | 0.0576 🥈 | 0.0508 🥈 | 0.0947 | 0.0788 | 0.0860 |
| Open-source LLMs | DeepSeek-V3 | 0.0813 🥇 | 0.0696 🥇 | 0.0582 🥇 | 0.1058 | 0.1217 🥉 | 0.1132 🥉 |
| DeepSeek-R1-Distill-Qwen-32B | 0.0482 🥉 | 0.0288 🥉 | 0.0266 🥉 | 0.0692 | 0.0223 | 0.0337 | |
| Qwen2.5-14B-Instruct | 0.0423 | 0.0256 | 0.0235 | 0.0197 | 0.0133 | 0.0159 | |
| gemma-2-27b-it | 0.0430 | 0.0273 | 0.0254 | 0.0519 | 0.0453 | 0.0483 | |
| Llama-3.1-8B-Instruct | 0.0287 | 0.0152 | 0.0137 | 0.0462 | 0.0154 | 0.0231 | |
| Llama-3.2-3B-Instruct | 0.0182 | 0.0109 | 0.0083 | 0.0151 | 0.0102 | 0.0121 | |
| Qwen2.5-1.5B-Instruct | 0.0180 | 0.0079 | 0.0069 | 0.0248 | 0.0060 | 0.0096 | |
| Qwen2.5-0.5B-Instruct | 0.0014 | 0.0003 | 0.0004 | 0.0047 | 0.0001 | 0.0002 | |
| Financial LLM | Fino1-8B | 0.0299 | 0.0146 | 0.0140 | 0.0355 | 0.0133 | 0.0193 |
| Fine-tuned PLMs | BERT-large | 0.0135 | 0.0200 | 0.0126 | 0.1397 🥈 | 0.1145 🥈 | 0.1259 🥈 |
| FinBERT | 0.0088 | 0.0143 | 0.0087 | 0.1293 🥉 | 0.0963 | 0.1104 | |
| SECBERT | 0.0308 | 0.0483 | 0.0331 | 0.2144 🥇 | 0.2146 🥇 | 0.2145 🥇 |
🥇 = best, 🥈 = second-best, 🥉 = third-best
| Category | Models | Precision | Recall | F1 |
|---|---|---|---|---|
| Closed-source LLM | GPT-4o | 0.6105 🥈 | 0.5941 🥈 | 0.6022 🥈 |
| Open-source LLMs | DeepSeek-V3 | 0.6329 🥇 | 0.8452 🥇 | 0.7238 🥇 |
| DeepSeek-R1-Distill-Qwen-32B | 0.5490 🥉 | 0.2238 🥉 | 0.3180 🥉 | |
| Qwen2.5-14B-Instruct | 0.3632 | 0.0018 | 0.0035 | |
| gemma-2-27b-it | 0.5319 | 0.5490 🥉 | 0.5403 🥉 | |
| Llama-3.1-8B-Instruct | 0.3346 | 0.1746 | 0.2295 | |
| Llama-3.2-3B-Instruct | 0.1887 | 0.1794 | 0.1839 | |
| Qwen2.5-1.5B-Instruct | 0.1323 | 0.0636 | 0.0859 | |
| Qwen2.5-0.5B-Instruct | 0.0116 | 0.0027 | 0.0043 | |
| Financial LLM | Fino1-8B | 0.3416 | 0.1481 | 0.2066 |
🥇 = best, 🥈 = second-best, 🥉 = third-best
| Category | Models | Accuracy |
|---|---|---|
| Closed-source LLM | GPT-4o | 0.1664 🥈 |
| Open-source LLMs | DeepSeek-V3 | 0.1715 🥇 |
| DeepSeek-R1-Distill-Qwen-32B | 0.1013 | |
| Qwen2.5-14B-Instruct | 0.1072 🥉 | |
| gemma-2-27b-it | 0.1009 | |
| Llama-3.1-8B-Instruct | 0.0807 | |
| Llama-3.2-3B-Instruct | 0.0375 | |
| Qwen2.5-1.5B-Instruct | 0.0419 | |
| Qwen2.5-0.5B-Instruct | 0.0246 | |
| Financial LLM | Fino1-8B | 0.0704 |
If you find our benchmark useful, please cite:
@misc{wang2025fintaggingbenchmarkingllmsextracting,
title={FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information},
author={Yan Wang and Yang Ren and Lingfei Qian and Xueqing Peng and Keyi Wang and Yi Han and Dongji Feng and Fengran Mo and Shengyuan Lin and Qinchuan Zhang and Kaiwen He and Chenri Luo and Jianxing Chen and Junwei Wu and Jimin Huang and Guojun Xiong and Xiao-Yang Liu and Qianqian Xie and Jian-Yun Nie},
year={2025},
eprint={2505.20650},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.20650},
}