Paper link: https://doi.org/10.1038/s41467-025-66071-6.
Tumor is a leading cause of death worldwide, with an estimated 10 million deaths attributed to tumor-related diseases every year. AI-driven tumor recognition unlocks new possibilities for more precise and intelligent tumor screening and diagnosis. However, the progress is heavily hampered by the scarcity of annotated datasets, which demands extensive annotation efforts by radiologists. To tackle this challenge, we introduce FreeTumor, an innovative Generative AI (GAI) framework to enable large-scale tumor synthesis for mitigating data scarcity. Specifically, FreeTumor effectively leverages a combination of limited labeled data and large-scale unlabeled data for tumor synthesis training. Unleashing the power of large-scale data, FreeTumor is capable of synthesizing a large number of realistic tumors on images for augmenting training datasets. To validate the fidelity of synthetic tumors, we engaged 13 board-certified radiologists in a Visual Turing Test to discern between synthetic and real tumors. Rigorous clinician evaluation validates the high quality of our synthetic tumors, as they achieved only 51.1% sensitivity and 60.8% accuracy in distinguishing our synthetic tumors from real ones. Through high-quality tumor synthesis, FreeTumor showcases a notable superiority over state-of-the-art AI methods including various synthesis methods and foundation models. These findings indicate promising prospects of FreeTumor in clinical applications, potentially advancing tumor treatments and improving the survival rates of patients.
We have released the solutions for leaderboard tasks at link.
Our models can be downloaded from link. The datasets are available at link, and processed labels can be found at link. NOTE THAT we are not the authors of these datasets. Although all these datasets are publicly available for academic research, you need to cite the original works as shown in our paper.
The labels of training datasets can be found at Hugging face, which contain organ labels for tumor synthesis. For abdomen organs, most of them are obtained from original sources while some of them are generated by a CT foundation model VoCo. The lung labels are generated by lungmask. Note that the organ labels are not absolutely accurate since we only utilize them for tumor position simulation. You can download the images of datasets from their original sources or our previous work Large-Scale-Medical. Specifically, the organ labels are defined as:
# abdomen
0: background
1: liver
2: liver tumors
3: pancreas
4: pancreas tumors
5: kidney
6: kidney tumors
# chest
0: background
1: lung
2: lesion
The path of datasets should be organized as:
├── /data/FreeTumor
├── Dataset003_Liver
├──imagesTr
└──labelsTr
├── Dataset007_Pancreas
├── Dataset220_KiTS2023
├── Covid19_20
├── BTCV
├── Flare22
├── Flare23
├── Amos2022
├── WORD
├── PANORAMA
├── AbdomenCT-1K
├── CHAOS
├── Dataset082_TCIA_Pancreas-CT
├── Dataset009_Spleen
├── Dataset010_Colon
├── Dataset224_AbdomenAtlas1.0
├── MELA
├── 3Dircadb1_convert
├── TCIAcovid19
├── stoic21
├── LIDC
└── ...
First, you need to train a baseline segmentation model as the discriminator for synthesis training (or you can download ours). The baseline segmentation model is placed in './baseline/'
├── baseline
├── model_baseline_segmentor.pt ### for abdomen, 7 output channels
└── model_covid_voco160k.pt
The synthesis training is conducted on 8*H800 GPUs while the segmentation training can be done with one 3090 GPU. Simple commands for training:
# Synthesis training
sh Syn_train.sh
# Segmentation training
sh Free_train.sh
Notably, currently we provide codes to train a generalist model, which can synthesize liver tumors, pancreas tumors, and kidney tumors (output by different channels). If you want to train specialist models for specific types of tumors (e.g., one model for liver tumors and another model for pancreas tumors), you need to check the codes as here and modify the labels as follows:
0: background
1: organ
2: tumor/lesion
For synthesis training, you can modify number of GPUs in 'Syn_train.sh' script. You can modify the number of training epochs, batch size, or other parameters in 'Syn_train.py'.
After synthesis training, we use the generative model for tumor synthesis during segmentation training. For the parameters of segmentation training:
- data: 'lits', 'panc', 'kits'. Training different segmentation models for different types.
- task: 'onlylabeled' or 'freesyn'. 'onlylabeled' means the baseline, training with only real tumors.
- use_ssl_pretrained: whether use pre-trained models. Optional, just want to advertise our work VoCo. You need to download our pretrained model and place it as './pretrained/VoCo_B_SSL_head.pt'.
- baseline_seg_dir: the path to the baseline segmentation model, serving for tumor quality control in segmentation training.
- TGAN_checkpoint: the path to the generative model.
We initially provide a baseline segmentation model and a generative model (trained on 1.6K abdomen data as follows). You can download them here.
├── /data/FreeTumor
├── Dataset003_Liver
├── Dataset007_Pancreas
├── Dataset220_KiTS2023
├── BTCV
├── Flare22
├── Amos2022
├── WORD
├── CHAOS
├── Dataset082_TCIA_Pancreas-CT
├── Dataset009_Spleen
├── Dataset010_Colon
└── ...
Synthesis visualization: codes to save offline datasets for visualization can be found under '/Syn_data'. You need to modify the data path and the path to save your results. In addition, you need to make sure the organ labels as:
# abdomen
0: background
1: liver
3: pancreas
5: kidney
# chest
0: background
1: lung
For validation and testing of segmentation, please check our previous work Large-Scale-Medical.
NOTE THAT we are not the authors of these datasets. Although all these datasets are publicly available for academic research, you need to cite the original works as shown in our paper.
This work is highly inspired by series of pioneering works led by Prof. Zongwei Zhou. We highly appreciate their great efforts.
If you find our codes or datasets useful, please consider to leave a star and cite our paper as follows, we would be highly grateful (^o^)/.
In addition, some previous papers that contributed to our work are listed for reference.
@article{wu2025freetumor,
title={Large-scale generative tumor synthesis in computed tomography images for improving tumor recognition},
author={Wu, Linshan and Zhuang, Jiaxin and Zhou, Yanning and He, Sunan and Ma, Jiabo and Luo, Luyang and Wang, Xi and Ni, Xuefeng and Zhong, Xiaoling and Wu, Mingxiang and others},
journal={Nature Communications},
volume={16},
number={1},
pages={11053},
year={2025},
publisher={Nature Publishing Group UK London}
}
@article{voco,
title={Large-Scale 3D Medical Image Pre-training with Geometric Context Priors},
author={Wu, Linshan and Zhuang, Jiaxin and Chen, Hao},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2025},
publisher={IEEE}
}
@inproceedings{hu2023label,
title={Label-free liver tumor segmentation},
author={Hu, Qixin and Chen, Yixiong and Xiao, Junfei and Sun, Shuwen and Chen, Jieneng and Yuille, Alan L and Zhou, Zongwei},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={7422--7432},
year={2023}
}
@inproceedings{chen2024towards,
title={Towards generalizable tumor synthesis},
author={Chen, Qi and Chen, Xiaoxi and Song, Haorui and Xiong, Zhiwei and Yuille, Alan and Wei, Chen and Zhou, Zongwei},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={11147--11158},
year={2024}
}