Because asking an AI "what color is this bus?" shouldn't require a ₹5,000/month API bill.
Visual Question Answering (VQA) — answering natural language questions about images — is typically locked behind large closed models. This project builds a VQA system from scratch by fusing a text encoder, an image encoder, and a generative decoder into a single custom multimodal architecture, trained end-to-end on MS-COCO.
Design and train a custom MultiModalModel that encodes questions with DistilBERT, encodes images with DINO ViT-B/16, projects both into a shared space, and generates answers autoregressively with GPT-2. Trained for 3 epochs on the VQAv2 dataset, evaluated on validation loss and exact-match accuracy on the test set.
Image (URL or file) + Natural Language Question
↓
Image Encoder — facebook/dino-vitb16 → image embedding
Text Encoder — distilbert-base-uncased → question embedding
↓
Projection Layers — linear layers align both embeddings to shared dimensionality
↓
Decoder — GPT-2 generates answer tokens autoregressively
↓
Answer (text) — decoded and stripped of special tokens
- Custom multimodal architecture — text encoder + image encoder + generative decoder built as a single
nn.Modulefrom scratch - DINO ViT image encoder —
facebook/dino-vitb16for rich visual representations - Autoregressive answer generation — GPT-2 decoder generates open-ended answers, not just classification
- Data augmentation — random horizontal flip + affine transforms on training images
- Consensus filtering — only training samples where annotators agree above a threshold are used
- WandB logging — full training tracked via Weights & Biases
- URL inference — load any image from a URL and run multi-question VQA in one call
# 1. Clone the repo
git clone https://github.com/Daddy-Myth/Visual_Q-A_System.git
cd Visual_Q-A_System
# 2. Install dependencies
pip install datasets transformers[torch] wandb torchvision
# 3. Download the dataset
# Run the download cells in 09_constructing_a_vqa_system.ipynb
# or manually place these files in the project root:
# - v2_mscoco_train2014_annotations.json
# - v2_OpenEnded_mscoco_train2014_questions.json
# - train2014/ (image folder)
# 4. Train
jupyter notebook 09_constructing_a_vqa_system.ipynb
# 5. Run inference
jupyter notebook 09_using_our_vqa.ipynbVisual_Q-A_System/
├── 09_constructing_a_vqa_system.ipynb # Model definition, training, evaluation
├── 09_using_our_vqa.ipynb # Inference on custom images and URLs
├── loss_validation_dataset.png # Validation loss chart (generated)
├── accuracy_test_dataset.png # Test accuracy chart (generated)
├── vqa_custom/ # Saved model checkpoint (generated after training)
└── README.md
| Component | Tool |
|---|---|
| Text Encoder | distilbert-base-uncased |
| Image Encoder | facebook/dino-vitb16 |
| Decoder | gpt2 |
| Framework | PyTorch + Hugging Face transformers |
| Dataset | VQAv2 / MS-COCO 2014 |
| Training | Hugging Face Trainer (custom subclass) |
| Experiment Tracking | Weights & Biases |
| Image Processing | ViTFeatureExtractor, torchvision.transforms |
| Language | Python 3 / CUDA / fp16 |
A custom nn.Module with three components wired together:
text_encoder (DistilBERT) → text_projection (Linear) ─┐
├→ concatenated context → decoder (GPT-2) → answer
image_encoder (DINO ViT) → image_projection (Linear) ─┘
- Text encoder:
distilbert-base-uncased— encodes the question, CLS token used as the question representation - Image encoder:
facebook/dino-vitb16— encodes the image, CLS token used as the visual representation - Projection layers: two learned linear layers align text and image embeddings to GPT-2's hidden size
- Decoder:
gpt2— takes projected embeddings as context, generates the answer token by token - Freezing: configurable — can freeze encoders, decoder, or nothing (
freeze='nothing'used in final run)
- Annotations filtered by consensus threshold — only samples where annotators agree are kept, reducing label noise
- Train/val/test split: 80% / 10% / 10%
- Custom
data_collatorhandles image preprocessing, question tokenization, and answer label masking (pad tokens masked in loss) - Image augmentation on training set: random horizontal flip + random affine (degrees=5, shear=5)
TrainingArguments(
num_train_epochs=3,
per_device_train_batch_size=32,
per_device_eval_batch_size=128,
gradient_accumulation_steps=2, # effective batch size = 64
evaluation_strategy="epoch",
fp16=True,
warmup_ratio=0.1,
learning_rate=2e-5,
lr_scheduler_type="cosine",
save_total_limit=1,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
report_to="wandb",
)| Metric | Before Training | After 3 Epochs |
|---|---|---|
| Validation Loss | 9.359 | 0.639 |
| Test Accuracy (exact match) | 0.0% | 46.9% |
# Load model
trained_model = MultiModalModel(
image_encoder_model="facebook/dino-vitb16",
text_encoder_model="distilbert-base-uncased",
decoder_model="gpt2",
load_from="vqa_custom/pytorch_model.bin"
)
# Run VQA from a URL
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/f/f9/STOP_sign.jpg/440px-STOP_sign.jpg"
image, answers = trained_model.generate(
url,
["What color is this sign?", "What does the sign say?"],
max_text_length=5,
eos_token_id=decoder_tokenizer.eos_token_id
)
for question, answer in answers.items():
print(f"{question}: {answer}")
# Output:
# What color is this sign?: red
# What does the sign say?: stopThe model is trained on VQAv2 (MS-COCO 2014). Dataset files are not included in the repo. Download links are provided inside 09_constructing_a_vqa_system.ipynb (Dropbox mirrors for the JSON annotation files + image zip).
Files needed:
v2_mscoco_train2014_annotations.jsonv2_OpenEnded_mscoco_train2014_questions.jsontrain2014/image folder
- Upload trained weights to Hugging Face Hub for one-line inference
- Experiment with larger decoders (GPT-2 Medium / Large) for better answer generation
- Add cross-attention between image and text embeddings instead of simple concatenation
- Benchmark on the full VQAv2 validation split with standard VQA accuracy metric
- Try CLIP as the image encoder for stronger vision-language alignment out of the box
- Add a Gradio demo for interactive image + question input

