Skip to content

UCSC-VLAA/MeDiM

Repository files navigation

Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation

Code for the paper Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation will be available soon.

About this repo:

The repository contains the official implementation of "MeDiM", and it is still under development.

🏥 Introduction

We propose MeDiM, the first medical discrete diffusion model that learns shared distributions across different medical modalities without requiring modality-specific components. MeDiM unifies multiple generative tasks: it flexibly translates between images and text or jointly produces image–report pairs across domains in response to user prompts. It builds on a discrete diffusion framework that unifies vision and language representations by modeling their shared probabilistic distribution. To empower the diffusion process to support unified and versatile medical generation, we employ a multimodal large language model (MLLM) as the diffusion backbone, leveraging its rich prior knowledge and cross-modal reasoning abilities. Because MLLMs are trained with causal (autoregressive) masking while diffusion denoising benefits from bidirectional context, MeDiM introduces two key designs: 1) removing the causal attention mask to enable a fully bidirectional information flow essential for mutual alignment, and 2) injecting continuous timestep embeddings to make the MLLM aware of the diffusion steps. Extensive experiments validate MeDiM as a unified foundation model capable of high-fidelity medical generation across various modalities, including medical image generation (16.60 FID on MIMIC-CXR; 24.19 FID on PathGen) and report generation (0.2650 METEOR on MIMIC-CXR; 0.2580 METEOR on PathGen). In addition, the jointly generated medical image-report pairs improve downstream task performance (+6.43 % BLEU-1, +18.57 % BLEU-2, +31.58 % BLEU-3, and +4.80 % METEOR in PathGen), enabling the use of multimodal inputs and the production of coherent, clinically grounded outputs.


🔥 News/TODO

🧑‍⚕️ Framework

Overview of the MeDiM architecture. The framework integrates an MLLM backbone within a discrete diffusion process for unified medical multimodal generation. During the forward process, data is tokenized and diffused over timesteps. The MLLM is then trained to reverse this process. Key architectural adaptations, including causal attention removal, timestep embeddings, and AdaLN, adapt the autoregressive MLLM for the bidirectional denoising required for unified medical generation.


Getting Started

Step1. To install the dependencies, run:

# create new anaconda env
conda create -n MedDiM python=3.10
conda activate MedDiM 

# install packages
pip install -r requirements.txt
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128

Step2. Prepare pretrain checkpoint, run:

# download VQVAE config and weight
cd ./models
wget -P chameleon/ https://huggingface.co/spaces/Junfeng5/Liquid_demo/resolve/main/chameleon/vqgan.ckpt 
wget -P chameleon/ https://huggingface.co/spaces/Junfeng5/Liquid_demo/resolve/main/chameleon/vqgan.yaml

# download Liquid 7B
huggingface-cli login
huggingface-cli snapshot download Junfeng5/Liquid_V1_7B --local-dir ./models/Liquid_V1_7B --local-dir-use-symlinks False

# download Llama-2-7b-hf
huggingface-cli login
huggingface-cli snapshot download NousResearch/Llama-2-7b-hf --local-dir ./models/Llama-2-7b-hf --local-dir-use-symlinks False

Step3. Fixing num_hidden_layers: 10 of config.json in ./models/Liquid_V1_7B.

Step4. Prepare pretrain checkpoint, run:

For MIMIC-CXR, you can download the dataset from here with your license of PhysioNet.

For PathGen, you can download the dataset from here, and you need to follow our pathgen setting to split this dataset.

We need to make sure MIMIC-CXR and PathGen have a structure like following:

```
- ./dataset/pathgen/train
   TCGA-05-4244-01Z-00-DX1.d4ff32cd-38cf-40ea-8213-45c2b100ac01_10336_14272.png
   TCGA-05-4244-01Z-00-DX1.d4ff32cd-38cf-40ea-8213-45c2b100ac01_10336_14272.txt
   ...
 ./dataset/pathgen/test
   TCGA-05-4244-01Z-00-DX1.d4ff32cd-38cf-40ea-8213-45c2b100ac01_21088_18976.png
   TCGA-05-4244-01Z-00-DX1.d4ff32cd-38cf-40ea-8213-45c2b100ac01_21088_18976.txt
   ...
 ./dataset/mimic-cxr/data/
   p10_p10000032_s50414267_02aa804e-bde0afdd-112c0b34-7bc16630-4e384014.jpg
   p10_p10000032_s50414267_02aa804e-bde0afdd-112c0b34-7bc16630-4e384014.txt
   ...
 ./dataset/mimic-cxr/test/
   p10_p10032725_s50331901_687754ce-7420bfd3-0a19911f-a27a3916-9019cd53.jpg
   p10_p10032725_s50331901_687754ce-7420bfd3-0a19911f-a27a3916-9019cd53.txt
   ...
```

Step5. MedUnidisc training, run:

# training
accelerate launch  --num_processes 8 --multi_gpu --main_process_port=$RANDOM main.py +experiments='[large_scale_train]' debug=true loader.batch_size=1 data_path_dir_train=./dataset/pathgen/train data_path_dir_val=./dataset/pathgen/test data_mimic_dir_train=./dataset/mimic-cxr/data data_mimic_dir_val=./dataset/mimic-cxr/test model.vqgan_config=./models/chameleon/vqgan.yaml model.vqgan_ckpt=./models/vqgan_ckpt model.llama_ckpt=./models/Llama-2-7b-hf model.liquid_ckpt=./models/Liquid_V1_7B

Step6. Find the latest ckpt path, run:

# find ckpt path
python find_latest_ckpt.py ./medunidisc/outputs/outputs/debug

Step7. Resume MedUnidisc training, run:

# resume
accelerate launch  --num_processes 8 --multi_gpu --main_process_port=$RANDOM main.py +experiments='[large_scale_train]' debug=true loader.batch_size=1 data_path_dir_train=./dataset/pathgen/train data_path_dir_val=./dataset/pathgen/test data_mimic_dir_train=./dataset/mimic-cxr/data data_mimic_dir_val=./dataset/mimic-cxr/test model.vqgan_config=./models/chameleon/vqgan.yaml model.vqgan_ckpt=./models/vqgan_ckpt model.llama_ckpt=./models/Llama-2-7b-hf model.liquid_ckpt=./models/Liquid_V1_7B

🙏 Acknowledgement

Deeply appreciate these wonderful open source projects: unidisc, pathgen-1.6m, mimic-cxr.

🩺 Citation

If you find this repository useful, please consider giving a star ⭐ and citation 💓:

@misc{mao2025discretediffusionmodelsmllms,
      title={Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation}, 
      author={Jiawei Mao and Yuhan Wang and Lifeng Chen and Can Zhao and Yucheng Tang and Dong Yang and Liangqiong Qu and Daguang Xu and Yuyin Zhou},
      year={2025},
      eprint={2510.06131},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.06131}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages