Skip to content

The official implementation of "ECAMP: Entity-centered Context-aware Medical Vision Language Pre-training"

License

Notifications You must be signed in to change notification settings

ToniChopp/ECAMP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ECAMP: Entity-centered Context-aware Medical Vision Language Pre-training

framework

Rongsheng Wang1,2,6*, Qingsong Yao3*, Zihang Jiang1,2†, Haoran Lai1,2,6, Zhiyang He6, Xiaodong Tao6, S. Kevin Zhou1,2,4,5†

*Equal contribution. Corresponding author.


1 School of Biomedical Engineering, University of Science and Technology of China 
2 Suzhou Institute for Advanced Research, University of Science and Technology of China 
3 Stanford University, Palo Alto, CA, 94025, United States 
4 Jingsu Provincial Key Laboratory of Multimodal Digital Twin Technology, Suzhou Jiangsu, 215123, China 
5 Key Laboratory of Precision and Intelligent Chemistry, USTC, Hefei Anhui, 230026, China 
6 Anhui IFLYTEK CO., Ltd.


MedIA github License: MIT license

News 🥰:

  • ECAMP is accepted by Medical Image Analysis 2025! 🎉

Introduction

Despite significant advancements in medical vision-language pre-training, existing methods have largely overlooked the inherent linguistic complexity and imbalanced issue within medical reports, as well as the complex cross-modality contextual relationships between texts and images. To close this gap, we propose a novel Entity-centered Context-aware Medical Vision-language Pre-training (ECAMP) framework, which establishes a more entity-centered, context-sensitive, and balanced understanding of medical reports to effectively pre-train the vision encoder. We first distill entity-centered context from medical reports utilizing large language models, enabling ECAMP to draw more precise supervision from the text modality. By further incorporating entity-aware re-balanced factor and descriptor masking strategies into masked language modeling, ECAMP significantly enhances the knowledge of entities within the reports. A context-guided super-resolution task is proposed alongside a multi-scale context fusion design to improve the semantic integration of both coarse and fine-level image representations, which prompts better performance for multi-scale downstream applications. ECAMP integrates these innovations together, leading to significant performance leaps over current state-of-the-art methods and establish a new standard for cross-modality pre-training in medical imaging. The effectiveness of ECAMP is demonstrated by extensive experiments on various domains and organs, which achieves cutting-edge results on multiple tasks including classification, segmentation, and detection across 5 public chest X-ray and 4 fundoscopy datasets respectively.

Installation

Clone this repository:

git clone https://github.com/ToniChopp/ECAMP.git

Install Python dependencies:

conda env create -f environment.yml

Resource fetching

We offer the pre-training and fine-tuning code of ECAMP, whose contribution is pre-training representative and multi-scale features from complex and imbalanced medical reports. We pre-train our method both on MIMIC-CXR and FFA-IR dataset.

  • MIMIC-CXR: We download the MIMIC-CXR-JPG dataset as the radiographs. Paired medical reports can be downloaded in MIMIC-CXR.
  • FFA-IR: We download the FFA-IR dataset as the fundus images and paired reports.

You can download ViTB/16 checkpoint here for pretraining.
Our pre-trained model can be found here for evaluating.

Our distilled reports by LLM have been released. You can fetch them here

Pre-training

We pre-train ECAMP on MIMIC-CXR using this command:

cd ECAMP/ECAMP/Pre-training
chmod a+x run.sh
./run.sh

Note that it is flexible to develop other pre-training models under this framework.

Fine-tuning

We perform fine-tuning classification, linear probing classification, fine-tuning segmentation and detection for downstream tasks.

Datasets

  • ChestX-ray14: We download the ChestX-ray14 dataset using its official split for classification.
  • CheXpert: We use the CheXpert consisting of 224,316 chest radiographs of 65,240 patients.
  • RSNA: We use the stage 2 of RSNA Pneumonia dataset.
  • COVIDx: We use the version 7 of COVIDx CXR dataset.
  • SIIM-ACR Pneumothorax: We use the stage 1 of SIIM-ACR Pneumothorax.
  • ODIR-5K: We download ODIR-5k from its offical site.
  • APTOS-2019: We download APTOS-2019 from Kaggle.
  • MuReD: We download MuRed from its official site.
  • RIGA: We download RIGA from its official site.

Classification

We evaluate fine-tuning classification performance of our model using this command:

CUDA_VISIBLE_DEVICES=0 python train.py --name ecamp --stage train --model vit_base_patch16 --task ChestX-ray14 --num_classes 14 \
    --pretrained_path '$PATH TO ECAMP_ViT_Base_16.pth' --dataset_path '$PATH TO ChestX-ray14/' \
    --output_dir "output/ChestX-ray14/1/" --data_volume '1' --num_steps 3000  --eval_batch_size 512 --img_size 224 \
    --learning_rate 3e-2 --warmup_steps 50 --fp16 --fp16_opt_level O2 --train_batch_size 96

You can change --task to set specific dataset for fine-tuning classification. Here, 7 datasets are available: ChestX-ray14, CheXpert, RSNA, COVIDx, ODIR-5K, APTOS-2019 and MuReD. The --data_volume parameter can be set to identify the fraction of training data for fine-tuning.

For linear probing classification, please set --mode to LinearProbe.

Segmentation

We evaluate fine-tuning segmentation performance of our model using this command:

CUDA_VISIBLE_DEVICES=0 python train.py --name ecamp --stage train --model vit_base_patch16 --task RSNA --img_size 224 \
    --pretrained_path '$PATH TO ECAMP_ViT_Base_16.pth' --dataset_path '$PATH TO RSNA/' \
    --output_dir "output/RSNA/1/" --data_volume '1' --num_steps 3000  --eval_batch_size 512 \
    --learning_rate 3e-4 --warmup_steps 50 --fp16 --fp16_opt_level O2 --train_batch_size 96 --weight_decay 0.05

You can change --task to set specific dataset for segmentation, where 3 datasets are available: SIIM, RSNA and RIGA. The --data_volume parameter can be set to identify the fraction of training data for fine-tuning.

Detection

We evaluate fine-tuning detection performance of our model using this command:

CUDA_VISIBLE_DEVICES=0 python train.py --name ecamp --stage train --model vit_base_patch16 --task RSNA --img_size 224 \
    --pretrained_path '$PATH TO ECAMP_ViT_Base_16.pth' --dataset_path '$PATH TO RSNA/' \
    --output_dir "output/RSNA/1/" --data_volume '1' --num_steps 3000  --eval_batch_size 512 \
    --learning_rate 3e-4 --warmup_steps 50 --fp16 --fp16_opt_level O2 --train_batch_size 96 --weight_decay 0.05

Acknowledgement

Some code is borrowed from MAE, huggingface and MRM.

Reference

If you have found our work valuable for your research, we kindly suggest that you acknowledge and cite our contribution(s) by referencing:

@article{WANG2025ECAMP,
title = {ECAMP: Entity-centered Context-aware Medical Vision Language Pre-training},
journal = {Medical Image Analysis},
volume = {105},
pages = {103690},
year = {2025},
issn = {1361-8415},
doi = {https://doi.org/10.1016/j.media.2025.103690},
url = {https://www.sciencedirect.com/science/article/pii/S1361841525002373},
author = {Rongsheng Wang and Qingsong Yao and Zihang Jiang and Haoran Lai and Zhiyang He and Xiaodong Tao and S. Kevin Zhou},
keywords = {Medical Vision-language Pre-training, Masked Modeling, Cross-modality Learning},
abstract = {Despite significant advancements in medical vision-language pre-training, existing methods have largely overlooked the inherent linguistic complexity and imbalanced issue within medical reports, as well as the complex cross-modality contextual relationships between texts and images. To close this gap, we propose a novel Entity-centered Context-aware Medical Vision-language Pre-training (ECAMP) framework, which establishes a more entity-centered, context-sensitive, and balanced understanding of medical reports to effectively pre-train the vision encoder. We first distill entity-centered context from medical reports utilizing large language models, enabling ECAMP to draw more precise supervision from the text modality. By further incorporating entity-aware re-balanced factor and descriptor masking strategies into masked language modeling, ECAMP significantly enhances the knowledge of entities within the reports. A context-guided super-resolution task is proposed alongside a multi-scale context fusion design to improve the semantic integration of both coarse and fine-level image representations, which prompts better performance for multi-scale downstream applications. ECAMP integrates these innovations together, leading to significant performance leaps over current state-of-the-art methods and establish a new standard for cross-modality pre-training in medical imaging. The effectiveness of ECAMP is demonstrated by extensive experiments on various domains and organs, which achieves cutting-edge results on multiple tasks including classification, segmentation, and detection across 5 public chest X-ray and 4 fundoscopy datasets respectively.}
}

License

This project is released under the MIT license. Please see the LICENSE file for more information.

Hope you enjoy!

About

The official implementation of "ECAMP: Entity-centered Context-aware Medical Vision Language Pre-training"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published