ECAMP: Entity-centered Context-aware Medical Vision Language Pre-training

Rongsheng Wang^1,2,6*, Qingsong Yao^3*, Zihang Jiang^1,2†, Haoran Lai^1,2,6, Zhiyang He⁶, Xiaodong Tao⁶, S. Kevin Zhou^1,2,4,5†

^*Equal contribution. ^†Corresponding author.

¹ School of Biomedical Engineering, University of Science and Technology of China
² Suzhou Institute for Advanced Research, University of Science and Technology of China
³ Stanford University, Palo Alto, CA, 94025, United States
⁴ Jingsu Provincial Key Laboratory of Multimodal Digital Twin Technology, Suzhou Jiangsu, 215123, China
⁵ Key Laboratory of Precision and Intelligent Chemistry, USTC, Hefei Anhui, 230026, China
⁶ Anhui IFLYTEK CO., Ltd.

News 🥰:

ECAMP is accepted by Medical Image Analysis 2025! 🎉

Introduction

Despite significant advancements in medical vision-language pre-training, existing methods have largely overlooked the inherent linguistic complexity and imbalanced issue within medical reports, as well as the complex cross-modality contextual relationships between texts and images. To close this gap, we propose a novel Entity-centered Context-aware Medical Vision-language Pre-training (ECAMP) framework, which establishes a more entity-centered, context-sensitive, and balanced understanding of medical reports to effectively pre-train the vision encoder. We first distill entity-centered context from medical reports utilizing large language models, enabling ECAMP to draw more precise supervision from the text modality. By further incorporating entity-aware re-balanced factor and descriptor masking strategies into masked language modeling, ECAMP significantly enhances the knowledge of entities within the reports. A context-guided super-resolution task is proposed alongside a multi-scale context fusion design to improve the semantic integration of both coarse and fine-level image representations, which prompts better performance for multi-scale downstream applications. ECAMP integrates these innovations together, leading to significant performance leaps over current state-of-the-art methods and establish a new standard for cross-modality pre-training in medical imaging. The effectiveness of ECAMP is demonstrated by extensive experiments on various domains and organs, which achieves cutting-edge results on multiple tasks including classification, segmentation, and detection across 5 public chest X-ray and 4 fundoscopy datasets respectively.

Installation

Clone this repository:

git clone https://github.com/ToniChopp/ECAMP.git

Install Python dependencies:

conda env create -f environment.yml

Resource fetching

We offer the pre-training and fine-tuning code of ECAMP, whose contribution is pre-training representative and multi-scale features from complex and imbalanced medical reports. We pre-train our method both on MIMIC-CXR and FFA-IR dataset.

MIMIC-CXR: We download the MIMIC-CXR-JPG dataset as the radiographs. Paired medical reports can be downloaded in MIMIC-CXR.
FFA-IR: We download the FFA-IR dataset as the fundus images and paired reports.

You can download ViTB/16 checkpoint here for pretraining.
Our pre-trained model can be found here for evaluating.

Our distilled reports by LLM have been released. You can fetch them here

Pre-training

We pre-train ECAMP on MIMIC-CXR using this command:

cd ECAMP/ECAMP/Pre-training
chmod a+x run.sh
./run.sh

Note that it is flexible to develop other pre-training models under this framework.

Fine-tuning

We perform fine-tuning classification, linear probing classification, fine-tuning segmentation and detection for downstream tasks.

Datasets

ChestX-ray14: We download the ChestX-ray14 dataset using its official split for classification.
CheXpert: We use the CheXpert consisting of 224,316 chest radiographs of 65,240 patients.
RSNA: We use the stage 2 of RSNA Pneumonia dataset.
COVIDx: We use the version 7 of COVIDx CXR dataset.
SIIM-ACR Pneumothorax: We use the stage 1 of SIIM-ACR Pneumothorax.
ODIR-5K: We download ODIR-5k from its offical site.
APTOS-2019: We download APTOS-2019 from Kaggle.
MuReD: We download MuRed from its official site.
RIGA: We download RIGA from its official site.

Classification

We evaluate fine-tuning classification performance of our model using this command:

CUDA_VISIBLE_DEVICES=0 python train.py --name ecamp --stage train --model vit_base_patch16 --task ChestX-ray14 --num_classes 14 \
    --pretrained_path '$PATH TO ECAMP_ViT_Base_16.pth' --dataset_path '$PATH TO ChestX-ray14/' \
    --output_dir "output/ChestX-ray14/1/" --data_volume '1' --num_steps 3000  --eval_batch_size 512 --img_size 224 \
    --learning_rate 3e-2 --warmup_steps 50 --fp16 --fp16_opt_level O2 --train_batch_size 96

You can change --task to set specific dataset for fine-tuning classification. Here, 7 datasets are available: ChestX-ray14, CheXpert, RSNA, COVIDx, ODIR-5K, APTOS-2019 and MuReD. The --data_volume parameter can be set to identify the fraction of training data for fine-tuning.

For linear probing classification, please set --mode to LinearProbe.

Segmentation

We evaluate fine-tuning segmentation performance of our model using this command:

CUDA_VISIBLE_DEVICES=0 python train.py --name ecamp --stage train --model vit_base_patch16 --task RSNA --img_size 224 \
    --pretrained_path '$PATH TO ECAMP_ViT_Base_16.pth' --dataset_path '$PATH TO RSNA/' \
    --output_dir "output/RSNA/1/" --data_volume '1' --num_steps 3000  --eval_batch_size 512 \
    --learning_rate 3e-4 --warmup_steps 50 --fp16 --fp16_opt_level O2 --train_batch_size 96 --weight_decay 0.05

You can change --task to set specific dataset for segmentation, where 3 datasets are available: SIIM, RSNA and RIGA. The --data_volume parameter can be set to identify the fraction of training data for fine-tuning.

Detection

We evaluate fine-tuning detection performance of our model using this command:

CUDA_VISIBLE_DEVICES=0 python train.py --name ecamp --stage train --model vit_base_patch16 --task RSNA --img_size 224 \
    --pretrained_path '$PATH TO ECAMP_ViT_Base_16.pth' --dataset_path '$PATH TO RSNA/' \
    --output_dir "output/RSNA/1/" --data_volume '1' --num_steps 3000  --eval_batch_size 512 \
    --learning_rate 3e-4 --warmup_steps 50 --fp16 --fp16_opt_level O2 --train_batch_size 96 --weight_decay 0.05

Acknowledgement

Some code is borrowed from MAE, huggingface and MRM.

Reference

If you have found our work valuable for your research, we kindly suggest that you acknowledge and cite our contribution(s) by referencing:

@article{WANG2025ECAMP,
title = {ECAMP: Entity-centered Context-aware Medical Vision Language Pre-training},
journal = {Medical Image Analysis},
volume = {105},
pages = {103690},
year = {2025},
issn = {1361-8415},
doi = {https://doi.org/10.1016/j.media.2025.103690},
url = {https://www.sciencedirect.com/science/article/pii/S1361841525002373},
author = {Rongsheng Wang and Qingsong Yao and Zihang Jiang and Haoran Lai and Zhiyang He and Xiaodong Tao and S. Kevin Zhou},
keywords = {Medical Vision-language Pre-training, Masked Modeling, Cross-modality Learning},
abstract = {Despite significant advancements in medical vision-language pre-training, existing methods have largely overlooked the inherent linguistic complexity and imbalanced issue within medical reports, as well as the complex cross-modality contextual relationships between texts and images. To close this gap, we propose a novel Entity-centered Context-aware Medical Vision-language Pre-training (ECAMP) framework, which establishes a more entity-centered, context-sensitive, and balanced understanding of medical reports to effectively pre-train the vision encoder. We first distill entity-centered context from medical reports utilizing large language models, enabling ECAMP to draw more precise supervision from the text modality. By further incorporating entity-aware re-balanced factor and descriptor masking strategies into masked language modeling, ECAMP significantly enhances the knowledge of entities within the reports. A context-guided super-resolution task is proposed alongside a multi-scale context fusion design to improve the semantic integration of both coarse and fine-level image representations, which prompts better performance for multi-scale downstream applications. ECAMP integrates these innovations together, leading to significant performance leaps over current state-of-the-art methods and establish a new standard for cross-modality pre-training in medical imaging. The effectiveness of ECAMP is demonstrated by extensive experiments on various domains and organs, which achieves cutting-edge results on multiple tasks including classification, segmentation, and detection across 5 public chest X-ray and 4 fundoscopy datasets respectively.}
}

License

This project is released under the MIT license. Please see the LICENSE file for more information.

Hope you enjoy!

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
ECAMP		ECAMP
Visualization		Visualization
figs		figs
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ECAMP: Entity-centered Context-aware Medical Vision Language Pre-training

Introduction

Installation

Resource fetching

Pre-training

Fine-tuning

Datasets

Classification

Segmentation

Detection

Acknowledgement

Reference

License

About

Uh oh!

Releases

Packages

Languages

License

ToniChopp/ECAMP

Folders and files

Latest commit

History

Repository files navigation

ECAMP: Entity-centered Context-aware Medical Vision Language Pre-training

Introduction

Installation

Resource fetching

Pre-training

Fine-tuning

Datasets

Classification

Segmentation

Detection

Acknowledgement

Reference

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages