Skip to content

rafa-cxg/MaskedSGG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Enhancing Scene Graph Generation via Semantic-Aligned Masked Vision-and-Language Pre-training

This repository provides the official implementation of the paper
“Enhancing Scene Graph Generation via Semantic-Aligned Masked Vision-and-Language Pre-training”,
submitted to The Visual Computer (2025).


🧩 Environment Setup

The code is developed and tested with the following environment:

  • Python ≥ 3.8
  • PyTorch ≥ 1.10
  • torchvision ≥ 0.11
  • CUDA ≥ 11.3
  • Transformers ≥ 4.30

To install the basic dependencies:

pip install torch torchvision transformers opencv-python numpy tqdm pillow

📦 Dataset Preparation

1. CC3M (Conceptual Captions 3M)

Used for caption-based pre-training.

Download the dataset from the official page:
🔗 Conceptual Captions 3M

2. Visual Genome (VG)

Used for fine-tuning and evaluation on the SGG task.
Download: [Visual Genome Dataset](https://visualgenome.org/

🚀 Pre-training

Run the Vision-and-Language pre-training stage using CC3M:

python -m torch.distributed.run \
  --nproc_per_node=4 \
  train.py \
  --cfg-path lavis/projects/sggp/train/pretrain.yaml

🔧 Fine-tuning on Scene Graph Generation

After pre-training, fine-tune the model on Visual Genome:

python train_finetune.py \
  --dataset vg \
  --batch_size 32 \
  --epochs 10 \
  --lr 5e-5 \
  --pretrained checkpoints/pretrain/model_best.pth \
  --output_dir checkpoints/finetune/

Evaluate the fine-tuned model (e.g., for PredCls):

python eval_sgg.py \
  --dataset vg \
  --split test \
  --model checkpoints/finetune/model_best.pth

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published