Clinically-Informed Evaluation of Vision-Language Models for Radiology Report Generation
This repository provides the source code and evaluation tools accompanying our AMIA 2025 oral paper: A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric
This project introduces a clinically-grounded evaluation framework for assessing the quality of chest X-ray radiology reports generated by Vision-Language Models (VLMs). We propose:
- A taxonomy of 12 clinically meaningful error types, covering omissions, hallucinations, uncertainty handling, and misclassifications.
- A novel evaluation metric: Clinical Risk-weighted Error Score for Text-generation (CREST), which incorporates clinical severity into report assessment.
- An in-depth comparison of three open-source VLMs: DeepSeek VL2, CXR-LLaVA, and CheXagent.
- Python ≥ 3.6
pandas,matplotlib,numpy,openpyxl
We use 685 gold-standard chest X-ray cases from the MIMIC-CXR dataset (test split), each annotated for 13 clinical conditions:
- Enlarged Cardiomediastinum
- Cardiomegaly
- Lung Lesion
- Lung Opacity
- Edema
- Consolidation
- Pneumonia
- Atelectasis
- Pneumothorax
- Pleural Effusion
- Pleural Other
- Fracture
- Support Devices
Predictions from the following models are provided:
- [
labels_DeepSeek_fixed.xlsx] - [
labels_CXR_LLaVA_fixed.xlsx] - [
labels_cheXagent_fixed.xlsx]
Each file contains per-case, per-condition predictions in the format:
{1: positive, 0: negative, -1: uncertain, -2: not mentioned}
The provided Python scripts (in Jupyter Notebook) support:
- Visualize label frequency across all 13 clinical conditions.
- Script: "Label_Distribution.ipynb".
- Compute total error counts (i.e., mismatches with gold labels).
- Normalize error rates by ground-truth frequency for each of the 12 predefined error types.
- Analyze condition-level error rates.
- Script: "Error_Analysis.ipynb".
- Compute the CREST score for each model, incorporating the clinical severity of each error.
- Analyze per-condition normalized CREST and macro-average performance.
- Script: "CREST_Evaluation.ipynb".
📁 VLM-CREST/
├── mimic-cxr-gold-standard.xlsx # Gold-standard labels
├── labels_DeepSeek_fixed.xlsx # DeepSeek VL2 predictions
├── labels_CXR_LLaVA_fixed.xlsx # CXR-LLaVA predictions
├── labels_cheXagent_fixed.xlsx # CheXagent predictions
│
├── Label_Distribution.ipynb # Label distribution for conditions
├── Error_Analysis.ipynb # Error type and condition error analysis
├── CREST_Evaluation.ipynb # CREST score calculation and plots