Skip to content

Source code for our AMIA 2025 paper (oral): "A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric"

Notifications You must be signed in to change notification settings

guanharry/VLM-CREST

Repository files navigation

Clinically-Informed Evaluation of Vision-Language Models for Radiology Report Generation

This repository provides the source code and evaluation tools accompanying our AMIA 2025 oral paper: A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric


🔍 Overview

This project introduces a clinically-grounded evaluation framework for assessing the quality of chest X-ray radiology reports generated by Vision-Language Models (VLMs). We propose:

  • A taxonomy of 12 clinically meaningful error types, covering omissions, hallucinations, uncertainty handling, and misclassifications.
  • A novel evaluation metric: Clinical Risk-weighted Error Score for Text-generation (CREST), which incorporates clinical severity into report assessment.
  • An in-depth comparison of three open-source VLMs: DeepSeek VL2, CXR-LLaVA, and CheXagent.

🛠 Requirements

  • Python ≥ 3.6
  • pandas, matplotlib, numpy, openpyxl

🗂 Data

We use 685 gold-standard chest X-ray cases from the MIMIC-CXR dataset (test split), each annotated for 13 clinical conditions:

  • Enlarged Cardiomediastinum
  • Cardiomegaly
  • Lung Lesion
  • Lung Opacity
  • Edema
  • Consolidation
  • Pneumonia
  • Atelectasis
  • Pneumothorax
  • Pleural Effusion
  • Pleural Other
  • Fracture
  • Support Devices

Predictions from the following models are provided:

  • [labels_DeepSeek_fixed.xlsx]
  • [labels_CXR_LLaVA_fixed.xlsx]
  • [labels_cheXagent_fixed.xlsx]

Each file contains per-case, per-condition predictions in the format:
{1: positive, 0: negative, -1: uncertain, -2: not mentioned}


⚙️ Functionality

The provided Python scripts (in Jupyter Notebook) support:

Label Distribution

  • Visualize label frequency across all 13 clinical conditions.
  • Script: "Label_Distribution.ipynb".

Error Analysis

  • Compute total error counts (i.e., mismatches with gold labels).
  • Normalize error rates by ground-truth frequency for each of the 12 predefined error types.
  • Analyze condition-level error rates.
  • Script: "Error_Analysis.ipynb".

Clinical Risk Analysis (CREST)

  • Compute the CREST score for each model, incorporating the clinical severity of each error.
  • Analyze per-condition normalized CREST and macro-average performance.
  • Script: "CREST_Evaluation.ipynb".

📂 File Structure

📁 VLM-CREST/
├── mimic-cxr-gold-standard.xlsx         # Gold-standard labels
├── labels_DeepSeek_fixed.xlsx           # DeepSeek VL2 predictions
├── labels_CXR_LLaVA_fixed.xlsx          # CXR-LLaVA predictions
├── labels_cheXagent_fixed.xlsx          # CheXagent predictions
│
├── Label_Distribution.ipynb             # Label distribution for conditions
├── Error_Analysis.ipynb                 # Error type and condition error analysis
├── CREST_Evaluation.ipynb               # CREST score calculation and plots

About

Source code for our AMIA 2025 paper (oral): "A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published