Enhancing Scene Graph Generation via Semantic-Aligned Masked Vision-and-Language Pre-training

This repository provides the official implementation of the paper
“Enhancing Scene Graph Generation via Semantic-Aligned Masked Vision-and-Language Pre-training”,
submitted to The Visual Computer (2025).

🧩 Environment Setup

The code is developed and tested with the following environment:

Python ≥ 3.8
PyTorch ≥ 1.10
torchvision ≥ 0.11
CUDA ≥ 11.3
Transformers ≥ 4.30

To install the basic dependencies:

pip install torch torchvision transformers opencv-python numpy tqdm pillow

📦 Dataset Preparation

1. CC3M (Conceptual Captions 3M)

Used for caption-based pre-training.

Download the dataset from the official page:
🔗 Conceptual Captions 3M

2. Visual Genome (VG)

Used for fine-tuning and evaluation on the SGG task.
Download: [Visual Genome Dataset](https://visualgenome.org/

🚀 Pre-training

Run the Vision-and-Language pre-training stage using CC3M:

python -m torch.distributed.run \
  --nproc_per_node=4 \
  train.py \
  --cfg-path lavis/projects/sggp/train/pretrain.yaml

🔧 Fine-tuning on Scene Graph Generation

After pre-training, fine-tune the model on Visual Genome:

python train_finetune.py \
  --dataset vg \
  --batch_size 32 \
  --epochs 10 \
  --lr 5e-5 \
  --pretrained checkpoints/pretrain/model_best.pth \
  --output_dir checkpoints/finetune/

Evaluate the fine-tuned model (e.g., for PredCls):

python eval_sgg.py \
  --dataset vg \
  --split test \
  --model checkpoints/finetune/model_best.pth

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
evaluation		evaluation
lavis		lavis
pysgg/modeling/detector		pysgg/modeling/detector
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enhancing Scene Graph Generation via Semantic-Aligned Masked Vision-and-Language Pre-training

🧩 Environment Setup

📦 Dataset Preparation

1. CC3M (Conceptual Captions 3M)

2. Visual Genome (VG)

🚀 Pre-training

🔧 Fine-tuning on Scene Graph Generation

About

Uh oh!

Releases

Packages

Languages

License

rafa-cxg/MaskedSGG

Folders and files

Latest commit

History

Repository files navigation

Enhancing Scene Graph Generation via Semantic-Aligned Masked Vision-and-Language Pre-training

🧩 Environment Setup

📦 Dataset Preparation

1. CC3M (Conceptual Captions 3M)

2. Visual Genome (VG)

🚀 Pre-training

🔧 Fine-tuning on Scene Graph Generation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages