👁️ Visual Q&A System

Because asking an AI "what color is this bus?" shouldn't require a ₹5,000/month API bill.

Problem Statement

Visual Question Answering (VQA) — answering natural language questions about images — is typically locked behind large closed models. This project builds a VQA system from scratch by fusing a text encoder, an image encoder, and a generative decoder into a single custom multimodal architecture, trained end-to-end on MS-COCO.

Project Objective

Design and train a custom MultiModalModel that encodes questions with DistilBERT, encodes images with DINO ViT-B/16, projects both into a shared space, and generates answers autoregressively with GPT-2. Trained for 3 epochs on the VQAv2 dataset, evaluated on validation loss and exact-match accuracy on the test set.

How It Works

Image (URL or file) + Natural Language Question
    ↓
Image Encoder — facebook/dino-vitb16 → image embedding
Text Encoder  — distilbert-base-uncased → question embedding
    ↓
Projection Layers — linear layers align both embeddings to shared dimensionality
    ↓
Decoder — GPT-2 generates answer tokens autoregressively
    ↓
Answer (text) — decoded and stripped of special tokens

Features

Custom multimodal architecture — text encoder + image encoder + generative decoder built as a single nn.Module from scratch
DINO ViT image encoder — facebook/dino-vitb16 for rich visual representations
Autoregressive answer generation — GPT-2 decoder generates open-ended answers, not just classification
Data augmentation — random horizontal flip + affine transforms on training images
Consensus filtering — only training samples where annotators agree above a threshold are used
WandB logging — full training tracked via Weights & Biases
URL inference — load any image from a URL and run multi-question VQA in one call

Setup

# 1. Clone the repo
git clone https://github.com/Daddy-Myth/Visual_Q-A_System.git
cd Visual_Q-A_System

# 2. Install dependencies
pip install datasets transformers[torch] wandb torchvision

# 3. Download the dataset
# Run the download cells in 09_constructing_a_vqa_system.ipynb
# or manually place these files in the project root:
#   - v2_mscoco_train2014_annotations.json
#   - v2_OpenEnded_mscoco_train2014_questions.json
#   - train2014/  (image folder)

# 4. Train
jupyter notebook 09_constructing_a_vqa_system.ipynb

# 5. Run inference
jupyter notebook 09_using_our_vqa.ipynb

Project Structure

Visual_Q-A_System/
├── 09_constructing_a_vqa_system.ipynb   # Model definition, training, evaluation
├── 09_using_our_vqa.ipynb               # Inference on custom images and URLs
├── loss_validation_dataset.png          # Validation loss chart (generated)
├── accuracy_test_dataset.png            # Test accuracy chart (generated)
├── vqa_custom/                          # Saved model checkpoint (generated after training)
└── README.md

Tech Stack

Component	Tool
Text Encoder	`distilbert-base-uncased`
Image Encoder	`facebook/dino-vitb16`
Decoder	`gpt2`
Framework	PyTorch + Hugging Face `transformers`
Dataset	VQAv2 / MS-COCO 2014
Training	Hugging Face `Trainer` (custom subclass)
Experiment Tracking	Weights & Biases
Image Processing	`ViTFeatureExtractor`, `torchvision.transforms`
Language	Python 3 / CUDA / fp16

Model Architecture (Detailed)

MultiModalModel

A custom nn.Module with three components wired together:

text_encoder   (DistilBERT)  →  text_projection   (Linear) ─┐
                                                              ├→ concatenated context → decoder (GPT-2) → answer
image_encoder  (DINO ViT)    →  image_projection  (Linear) ─┘

Text encoder: distilbert-base-uncased — encodes the question, CLS token used as the question representation
Image encoder: facebook/dino-vitb16 — encodes the image, CLS token used as the visual representation
Projection layers: two learned linear layers align text and image embeddings to GPT-2's hidden size
Decoder: gpt2 — takes projected embeddings as context, generates the answer token by token
Freezing: configurable — can freeze encoders, decoder, or nothing (freeze='nothing' used in final run)

Data Pipeline

Annotations filtered by consensus threshold — only samples where annotators agree are kept, reducing label noise
Train/val/test split: 80% / 10% / 10%
Custom data_collator handles image preprocessing, question tokenization, and answer label masking (pad tokens masked in loss)
Image augmentation on training set: random horizontal flip + random affine (degrees=5, shear=5)

Training Configuration

TrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=128,
    gradient_accumulation_steps=2,       # effective batch size = 64
    evaluation_strategy="epoch",
    fp16=True,
    warmup_ratio=0.1,
    learning_rate=2e-5,
    lr_scheduler_type="cosine",
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    report_to="wandb",
)

Results

Metric	Before Training	After 3 Epochs
Validation Loss	9.359	0.639
Test Accuracy (exact match)	0.0%	46.9%

Inference

# Load model
trained_model = MultiModalModel(
    image_encoder_model="facebook/dino-vitb16",
    text_encoder_model="distilbert-base-uncased",
    decoder_model="gpt2",
    load_from="vqa_custom/pytorch_model.bin"
)

# Run VQA from a URL
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/f/f9/STOP_sign.jpg/440px-STOP_sign.jpg"

image, answers = trained_model.generate(
    url,
    ["What color is this sign?", "What does the sign say?"],
    max_text_length=5,
    eos_token_id=decoder_tokenizer.eos_token_id
)

for question, answer in answers.items():
    print(f"{question}: {answer}")

# Output:
# What color is this sign?: red
# What does the sign say?: stop

Dataset

The model is trained on VQAv2 (MS-COCO 2014). Dataset files are not included in the repo. Download links are provided inside 09_constructing_a_vqa_system.ipynb (Dropbox mirrors for the JSON annotation files + image zip).

Files needed:

v2_mscoco_train2014_annotations.json
v2_OpenEnded_mscoco_train2014_questions.json
train2014/ image folder

Future Improvements

Upload trained weights to Hugging Face Hub for one-line inference
Experiment with larger decoders (GPT-2 Medium / Large) for better answer generation
Add cross-attention between image and text embeddings instead of simple concatenation
Benchmark on the full VQAv2 validation split with standard VQA accuracy metric
Try CLIP as the image encoder for stronger vision-language alignment out of the box
Add a Gradio demo for interactive image + question input

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

👁️ Visual Q&A System

Problem Statement

Project Objective

How It Works

Features

Setup

Project Structure

Tech Stack

Model Architecture (Detailed)

MultiModalModel

Data Pipeline

Training Configuration

Results

Inference

Dataset

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
09_constructing_a_vqa_system.ipynb		09_constructing_a_vqa_system.ipynb
09_using_our_vqa.ipynb		09_using_our_vqa.ipynb
LICENSE		LICENSE
README.md		README.md
accuracy_test_dataset.png		accuracy_test_dataset.png
loss_validation_dataset.png		loss_validation_dataset.png

Folders and files

Latest commit

History

Repository files navigation

👁️ Visual Q&A System

Problem Statement

Project Objective

How It Works

Features

Setup

Project Structure

Tech Stack

Model Architecture (Detailed)

MultiModalModel

Data Pipeline

Training Configuration

Results

Inference

Dataset

Future Improvements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages