Juncheng Wu*, Sheng Liu*, Haoqin Tu*, Hang yu*, Xiaoke Huang, James Zou, Cihang Xie, Yuyin Zhou
In this work, we propose a fine-grained evaluation framework include two novel step-by-step metrics for LLMs reasoning:
- Knowledge Index (KI): the correctness of the knowledge used
- Information Gain (Info Gain): the quality of the reasoning.
- Setup: Install the condo environment from
environment.yml
- Reasoning Decomposition: call the function
llm_decompose_reasoning
inutils.py
- Info Gain: refer to
InformationGainScorer
class inmetrics.py
- **Knowledge Index: **refer to
RetrievalScorer
class inmetrics.py
# reasoning decomposition
# reasoning_steps = llms_output_reasoning
decomposed_steps = utils.llm_decompose_reasoning(reasoning_steps)
# init the metrics
retrieval_scorer = RetrievalScorer()
information_gain_scorer = InformationGainScorer(model_name='Qwen/Qwen2.5-7B')
# metrics calculation
KI = retrieval_scorer.forward(decomposed_steps)
InfoGain = information_gain_scorer.forward(decomposed_steps)
This work was partially funded by an unrestricted gift from Google. We thank the Microsoft Accelerate Foundation Models Research Program for supporting our computing needs.
We gratefully thank the MedRAG for the knowledge retrieval model and toolkit!
@misc{wu2025knowledgereasoningcloselook,
title={Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains},
author={Juncheng Wu and Sheng Liu and Haoqin Tu and Hang Yu and Xiaoke Huang and James Zou and Cihang Xie and Yuyin Zhou},
year={2025},
eprint={2506.02126},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.02126},
}