Skip to content

Daddy-Myth/Visual_Q-A_System

Repository files navigation

👁️ Visual Q&A System

Because asking an AI "what color is this bus?" shouldn't require a ₹5,000/month API bill.

Problem Statement

Visual Question Answering (VQA) — answering natural language questions about images — is typically locked behind large closed models. This project builds a VQA system from scratch by fusing a text encoder, an image encoder, and a generative decoder into a single custom multimodal architecture, trained end-to-end on MS-COCO.

Project Objective

Design and train a custom MultiModalModel that encodes questions with DistilBERT, encodes images with DINO ViT-B/16, projects both into a shared space, and generates answers autoregressively with GPT-2. Trained for 3 epochs on the VQAv2 dataset, evaluated on validation loss and exact-match accuracy on the test set.


How It Works

Image (URL or file) + Natural Language Question
    ↓
Image Encoder — facebook/dino-vitb16 → image embedding
Text Encoder  — distilbert-base-uncased → question embedding
    ↓
Projection Layers — linear layers align both embeddings to shared dimensionality
    ↓
Decoder — GPT-2 generates answer tokens autoregressively
    ↓
Answer (text) — decoded and stripped of special tokens

Features

  • Custom multimodal architecture — text encoder + image encoder + generative decoder built as a single nn.Module from scratch
  • DINO ViT image encoderfacebook/dino-vitb16 for rich visual representations
  • Autoregressive answer generation — GPT-2 decoder generates open-ended answers, not just classification
  • Data augmentation — random horizontal flip + affine transforms on training images
  • Consensus filtering — only training samples where annotators agree above a threshold are used
  • WandB logging — full training tracked via Weights & Biases
  • URL inference — load any image from a URL and run multi-question VQA in one call

Setup

# 1. Clone the repo
git clone https://github.com/Daddy-Myth/Visual_Q-A_System.git
cd Visual_Q-A_System

# 2. Install dependencies
pip install datasets transformers[torch] wandb torchvision

# 3. Download the dataset
# Run the download cells in 09_constructing_a_vqa_system.ipynb
# or manually place these files in the project root:
#   - v2_mscoco_train2014_annotations.json
#   - v2_OpenEnded_mscoco_train2014_questions.json
#   - train2014/  (image folder)

# 4. Train
jupyter notebook 09_constructing_a_vqa_system.ipynb

# 5. Run inference
jupyter notebook 09_using_our_vqa.ipynb

Project Structure

Visual_Q-A_System/
├── 09_constructing_a_vqa_system.ipynb   # Model definition, training, evaluation
├── 09_using_our_vqa.ipynb               # Inference on custom images and URLs
├── loss_validation_dataset.png          # Validation loss chart (generated)
├── accuracy_test_dataset.png            # Test accuracy chart (generated)
├── vqa_custom/                          # Saved model checkpoint (generated after training)
└── README.md

Tech Stack

Component Tool
Text Encoder distilbert-base-uncased
Image Encoder facebook/dino-vitb16
Decoder gpt2
Framework PyTorch + Hugging Face transformers
Dataset VQAv2 / MS-COCO 2014
Training Hugging Face Trainer (custom subclass)
Experiment Tracking Weights & Biases
Image Processing ViTFeatureExtractor, torchvision.transforms
Language Python 3 / CUDA / fp16

Model Architecture (Detailed)

MultiModalModel

A custom nn.Module with three components wired together:

text_encoder   (DistilBERT)  →  text_projection   (Linear) ─┐
                                                              ├→ concatenated context → decoder (GPT-2) → answer
image_encoder  (DINO ViT)    →  image_projection  (Linear) ─┘
  • Text encoder: distilbert-base-uncased — encodes the question, CLS token used as the question representation
  • Image encoder: facebook/dino-vitb16 — encodes the image, CLS token used as the visual representation
  • Projection layers: two learned linear layers align text and image embeddings to GPT-2's hidden size
  • Decoder: gpt2 — takes projected embeddings as context, generates the answer token by token
  • Freezing: configurable — can freeze encoders, decoder, or nothing (freeze='nothing' used in final run)

Data Pipeline

  • Annotations filtered by consensus threshold — only samples where annotators agree are kept, reducing label noise
  • Train/val/test split: 80% / 10% / 10%
  • Custom data_collator handles image preprocessing, question tokenization, and answer label masking (pad tokens masked in loss)
  • Image augmentation on training set: random horizontal flip + random affine (degrees=5, shear=5)

Training Configuration

TrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=128,
    gradient_accumulation_steps=2,       # effective batch size = 64
    evaluation_strategy="epoch",
    fp16=True,
    warmup_ratio=0.1,
    learning_rate=2e-5,
    lr_scheduler_type="cosine",
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    report_to="wandb",
)

Results

Metric Before Training After 3 Epochs
Validation Loss 9.359 0.639
Test Accuracy (exact match) 0.0% 46.9%

Validation Loss Test Accuracy


Inference

# Load model
trained_model = MultiModalModel(
    image_encoder_model="facebook/dino-vitb16",
    text_encoder_model="distilbert-base-uncased",
    decoder_model="gpt2",
    load_from="vqa_custom/pytorch_model.bin"
)

# Run VQA from a URL
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/f/f9/STOP_sign.jpg/440px-STOP_sign.jpg"

image, answers = trained_model.generate(
    url,
    ["What color is this sign?", "What does the sign say?"],
    max_text_length=5,
    eos_token_id=decoder_tokenizer.eos_token_id
)

for question, answer in answers.items():
    print(f"{question}: {answer}")

# Output:
# What color is this sign?: red
# What does the sign say?: stop

Dataset

The model is trained on VQAv2 (MS-COCO 2014). Dataset files are not included in the repo. Download links are provided inside 09_constructing_a_vqa_system.ipynb (Dropbox mirrors for the JSON annotation files + image zip).

Files needed:

  • v2_mscoco_train2014_annotations.json
  • v2_OpenEnded_mscoco_train2014_questions.json
  • train2014/ image folder

Future Improvements

  • Upload trained weights to Hugging Face Hub for one-line inference
  • Experiment with larger decoders (GPT-2 Medium / Large) for better answer generation
  • Add cross-attention between image and text embeddings instead of simple concatenation
  • Benchmark on the full VQAv2 validation split with standard VQA accuracy metric
  • Try CLIP as the image encoder for stronger vision-language alignment out of the box
  • Add a Gradio demo for interactive image + question input

About

An end-to-end Visual Question Answering (VQA) system that takes an image and a natural-language question as input and generates an answer using a multimodal deep learning pipeline.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors