-
Download the CT-RATE dataset into the data folder.
-
Download ImageNet pre-trained ViT weights from link, and BiomedVLP-CXR-BERT-specialized text encoder from link, as used by CT-CLIP.
-
Download the decomposed anatomy-wise descriptions from our provided supplementary materials link, and process the CT volume with the following commands.
cd data python fix_data.py --split [train/valid] python generate_mask.py --split [train/valid] python resize.py --split [train/valid] python preprocess.py --split [train/valid]
The processed results.
|-- BiomedVLP-CXR-BERT |-- data | |-- train | |-- valid | |-- train_fixed | |-- valid_fixed | |-- train_mask | |-- valid_mask | |-- resized_train_images | |-- resized_train_masks | |-- resized_valid_images | |-- resized_valid_masks | |-- processed_train_images | |-- processed_train_masks | |-- processed_valid_images | |-- processed_valid_masks | |-- multi_abnormality_labels | |-- desc_info.json | |-- conc_info.json |-- mae_pretrain_vit_base.pth
torchrun --nproc_per_node=4 train.py
torchrun --nproc_per_node=4 eval.py
Then, you can calculate the metrics using the generated CSV file.
python calc_metrics.py --csv_file res/xxx.csv
If you find this repository useful, please cite:
@inproceedings{fvlm_iclr25,
title={Large-scale and fine-grained vision-language pre-training for enhanced CT image understanding},
author={Zhongyi Shui, Jianpeng Zhang, Weiwei Cao, Sinuo Wang, Ruizhe Guo, Le Lu, Lin Yang, Xianghua Ye, Tingbo Liang, Qi Zhang, Ling Zhang},
booktitle={The Thirteenth International Conference on Learning Representations},
pages={},
year={2025}
}