GitHub - guanharry/VLM-CREST: Source code for our AMIA 2025 paper (oral): "A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric"

Clinically-Informed Evaluation of Vision-Language Models for Radiology Report Generation

This repository provides the source code and evaluation tools accompanying our AMIA 2025 oral paper: A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric

🔍 Overview

This project introduces a clinically-grounded evaluation framework for assessing the quality of chest X-ray radiology reports generated by Vision-Language Models (VLMs). We propose:

A taxonomy of 12 clinically meaningful error types, covering omissions, hallucinations, uncertainty handling, and misclassifications.
A novel evaluation metric: Clinical Risk-weighted Error Score for Text-generation (CREST), which incorporates clinical severity into report assessment.
An in-depth comparison of three open-source VLMs: DeepSeek VL2, CXR-LLaVA, and CheXagent.

🛠 Requirements

Python ≥ 3.6
pandas, matplotlib, numpy, openpyxl

🗂 Data

We use 685 gold-standard chest X-ray cases from the MIMIC-CXR dataset (test split), each annotated for 13 clinical conditions:

Enlarged Cardiomediastinum
Cardiomegaly
Lung Lesion
Lung Opacity
Edema
Consolidation
Pneumonia
Atelectasis
Pneumothorax
Pleural Effusion
Pleural Other
Fracture
Support Devices

Predictions from the following models are provided:

[labels_DeepSeek_fixed.xlsx]
[labels_CXR_LLaVA_fixed.xlsx]
[labels_cheXagent_fixed.xlsx]

Each file contains per-case, per-condition predictions in the format:
{1: positive, 0: negative, -1: uncertain, -2: not mentioned}

⚙️ Functionality

The provided Python scripts (in Jupyter Notebook) support:

Label Distribution

Visualize label frequency across all 13 clinical conditions.
Script: "Label_Distribution.ipynb".

Error Analysis

Compute total error counts (i.e., mismatches with gold labels).
Normalize error rates by ground-truth frequency for each of the 12 predefined error types.
Analyze condition-level error rates.
Script: "Error_Analysis.ipynb".

Clinical Risk Analysis (CREST)

Compute the CREST score for each model, incorporating the clinical severity of each error.
Analyze per-condition normalized CREST and macro-average performance.
Script: "CREST_Evaluation.ipynb".

📂 File Structure

📁 VLM-CREST/
├── mimic-cxr-gold-standard.xlsx         # Gold-standard labels
├── labels_DeepSeek_fixed.xlsx           # DeepSeek VL2 predictions
├── labels_CXR_LLaVA_fixed.xlsx          # CXR-LLaVA predictions
├── labels_cheXagent_fixed.xlsx          # CheXagent predictions
│
├── Label_Distribution.ipynb             # Label distribution for conditions
├── Error_Analysis.ipynb                 # Error type and condition error analysis
├── CREST_Evaluation.ipynb               # CREST score calculation and plots

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 Overview

🛠 Requirements

🗂 Data

⚙️ Functionality

Label Distribution

Error Analysis

Clinical Risk Analysis (CREST)

📂 File Structure

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
CREST_Evaluation.ipynb		CREST_Evaluation.ipynb
Error_Analysis.ipynb		Error_Analysis.ipynb
Label_Distribution.ipynb		Label_Distribution.ipynb
README.md		README.md
labels_CXR_LLaVA_fixed.xlsx		labels_CXR_LLaVA_fixed.xlsx
labels_DeepSeek_fixed.xlsx		labels_DeepSeek_fixed.xlsx
labels_cheXagent_fixed.xlsx		labels_cheXagent_fixed.xlsx
mimic-cxr-gold-standard.xlsx		mimic-cxr-gold-standard.xlsx

guanharry/VLM-CREST

Folders and files

Latest commit

History

Repository files navigation

🔍 Overview

🛠 Requirements

🗂 Data

⚙️ Functionality

Label Distribution

Error Analysis

Clinical Risk Analysis (CREST)

📂 File Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages