- 2025-09-25: CC-Diff++ has been accepted by TGRS10.1109/TGRS.2025.3616376
- 2024-12-03: Publish initial code.
- 2024-12-12: Arxiv is avaliable (arxiv: 2412.08464).
- CC-Diff++ can generate RS images with enhanced context coherence.
- An overview of CC-Diff++'s pipeline.
- Detailed structure of Co-Resampler and Conditional Generation Module.
- Gallery
DIOR-RSVG
DOTA
HRSC
| Method | DIOR-RSVG | DOTA | HRSC | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CLIP-L ↑ | CLIP-G ↑ | FID ↓ | mAP50 | mAP50:95 | CLIP-L ↑ | CLIP-G ↑ | FID ↓ | mAP50 | mAP50:95 | CLIP-L ↑ | CLIP-G ↑ | FID ↓ | mAP50 | mAP50:95 | |
| Txt2Img-MHN | 18.91 | 23.46 | 123.84 | 0.43 | 0.16 | 19.58 | 25.99 | 137.76 | 0.07 | 0.02 | 17.22 | 21.82 | 91.27 | 0.10 | 0.02 |
| DiffusionSat | 19.84 | 32.68 | 78.16 | 2.87 | 0.84 | 19.78 | 31.61 | 65.19 | 0.22 | 0.05 | 17.36 | 30.09 | 83.53 | 0.13 | 0.01 |
| GLIGEN | 20.55 | 32.22 | 73.02 | 7.90† | 2.74† | 20.72 | 29.98 | 61.05 | 0.95† | 0.28† | 19.48 | 30.51 | 68.12 | 1.89† | 0.56† |
| AeroGen | 20.28 | 30.75 | 74.90 | 43.16 | 27.97 | 21.55 | 26.13 | 55.02 | 41.88 | 26.37 | 18.96 | 28.24 | 62.40 | 9.56 | 4.37 |
| LayoutDiffusion | 19.31 | 30.65 | 79.03 | 41.22 | 24.23 | 20.49 | 27.67 | 64.77 | 42.70 | 22.66 | 19.48 | 29.39 | 73.95 | 11.44 | 3.84 |
| MIGC | 21.59 | 32.36 | 79.93 | 46.92 | 30.10 | 22.21 | 30.96 | 63.95 | 57.06 | 33.66 | 21.69 | 30.91 | 66.13 | 27.05 | 16.93 |
| CC-Diff | 21.82 | 32.36 | 70.68 | 56.59 | 39.97 | 22.60 | 30.92 | 47.72 | 57.72 | 34.04 | 21.27 | 30.48 | 64.96 | 28.91 | 17.31 |
| CC-Diff++ | 21.94 | 32.52 | 65.86 | 56.94 | 39.99 | 22.66 | 30.93 | 47.02 | 58.59 | 35.66 | 23.21 | 30.96 | 61.52 | 30.48 | 17.37 |
| Method | DIOR-RSVG | DOTA | HRSC | ||||||
|---|---|---|---|---|---|---|---|---|---|
| mAP | mAP50 | mAP75 | mAP | mAP50 | mAP75 | mAP | mAP50 | mAP75 | |
| Baseline | 50.17 | 75.84 | 54.38 | 35.53 | 62.10 | 35.83 | 34.88 | 51.45 | 40.90 |
| Txt2Img-MHN | 50.12 | 75.87 | 54.74 | 35.91 | 62.53 | 36.43 | 33.45 | 49.65 | 38.67 |
| DiffusionSat | 49.95 | 75.59 | 55.26 | 36.15 | 62.50 | 36.76 | 29.67 | 45.62 | 33.51 |
| AeroGen | 51.39 | 76.85 | 56.75 | 36.65 | 63.15 | 36.96 | 44.43 | 60.38 | 50.60 |
| LayoutDiffusion | 51.96 | 77.31 | 56.82 | 35.15 | 61.54 | 35.18 | 43.74 | 60.27 | 51.34 |
| MIGC | 51.87 | 76.65 | 57.20 | 35.93 | 62.36 | 36.32 | 42.18 | 60.01 | 49.44 |
| CC-Diff | 52.18 | 77.39 | 57.59 | 37.36 | 63.18 | 38.55 | 42.15 | 61.15 | 50.74 |
| CC-Diff++ | 52.62 | 77.51 | 58.09 | 37.57 | 63.18 | 38.05 | 46.69 | 66.58 | 55.73 |
To get started, first clone the CC-Diff repository:
git clone https://github.com/AZZMM/CC-Diff.git
cd CC-DiffEnvironment setup:
conda create -n CC-Diff python=3.9 -y
conda activate CC-Diff
pip install -r requirement.txt2.1 Dataset processing
This is an example:
DIOR
├── train
│ ├── 00003.jpg
| ├── ...
| ├── metadata.jsonl
├── val
| ├── 00011.jpg
| ├── ...
| ├── metadata.jsonl
├── results
│ ├── ...
├── dior_emb.pt
Dataset processing scripts are in data_tools, for details see data_process.md.
2.2 Checkpoint
Download the DIOR_checkpoint or DOTA_checkpoint.
Controllable RS image generation:
python infer_dior.pyTrain CC-Diff model:
./dist_train.shWe include the implementation of the enhanced version: CC-Diff++ in ccdiff_pp folder.
Training:
Just replace the script train_dior.py in ./dist_train.sh with train_dior_pp.py to train the CC-Diff++ model.
Inference:
accelerate launch
--main_process_port=29500,
--num_processes=8
infer_dior_pp.pyWe here publish the synthetic images used in the trainability experiments for every dataset, enabling exact reproduction of our results.
DIOR-RSVG: link, Password: ccdp
DOTA: link, Password: ccdp
HRSC: link, Password: ccdp
Our work is based on stable diffusion, diffusers, CLIP, We appreciate their outstanding contributions.
@ARTICLE{11187367,
author={Zhang, Mu and Liu, Yunfan and Liu, Yue and Zhao, Yuzhong and Ye, Qixiang},
journal={IEEE Transactions on Geoscience and Remote Sensing},
title={CC-Diff++: Spatially Controllable Text-to-Image Synthesis for Remote Sensing With Enhanced Contextual Coherence},
year={2025},
volume={63},
number={},
pages={1-16},
keywords={Semantics;Image synthesis;Coherence;Layout;Diffusion models;Remote sensing;Feature extraction;Visualization;Roads;Training;Diffusion model;generative model;image synthesis;layout-to-image (L2I) generation;text-to-image (T2I) generation},
doi={10.1109/TGRS.2025.3616376}}






