This repository provides the official implementation of the paper
“Enhancing Scene Graph Generation via Semantic-Aligned Masked Vision-and-Language Pre-training”,
submitted to The Visual Computer (2025).
The code is developed and tested with the following environment:
- Python ≥ 3.8
- PyTorch ≥ 1.10
- torchvision ≥ 0.11
- CUDA ≥ 11.3
- Transformers ≥ 4.30
To install the basic dependencies:
pip install torch torchvision transformers opencv-python numpy tqdm pillowUsed for caption-based pre-training.
Download the dataset from the official page:
🔗 Conceptual Captions 3M
Used for fine-tuning and evaluation on the SGG task.
Download: [Visual Genome Dataset](https://visualgenome.org/
Run the Vision-and-Language pre-training stage using CC3M:
python -m torch.distributed.run \
--nproc_per_node=4 \
train.py \
--cfg-path lavis/projects/sggp/train/pretrain.yamlAfter pre-training, fine-tune the model on Visual Genome:
python train_finetune.py \
--dataset vg \
--batch_size 32 \
--epochs 10 \
--lr 5e-5 \
--pretrained checkpoints/pretrain/model_best.pth \
--output_dir checkpoints/finetune/Evaluate the fine-tuned model (e.g., for PredCls):
python eval_sgg.py \
--dataset vg \
--split test \
--model checkpoints/finetune/model_best.pth