Skip to content

aishuse/Visual-Question-Answering

Repository files navigation

Visual Question Answering (VQA) - Abstract Scenes

This project implements a Visual Question Answering (VQA) system using the VQA Abstract Scenes Dataset. It takes a cartoon-style image and a natural language question about the image, and predicts an answer.

🔍 Dataset

Name: VQA Abstract Scene Dataset (v2)

Source: VQA Dataset

The dataset contains synthetic scene images built from clipart objects, designed to test reasoning over structured visual scenes.

🧠 Model Architecture

The model uses a combination of:

Image Encoder: Frozen ResNet50 with Global Average Pooling and a Dense Layer to extract image features (input size 224x224).

LSTM-based Question Encoder: Pretrained word embeddings are passed into stacked LSTMs to create a Question Context Vector.

Cross-modal Attention: Multi-Head Cross-Modal Attention is used to align image and question representations.

Fusion Layer: Combines attended image features, global image features, and the question summary.

Classifier: Fully-connected Dense layers (512 → 256) followed by a Softmax over 1,000 answer classes.

The model is trained using the Categorical Crossentropy loss and the Adam optimizer. Checkpointing is used to save model weights after each epoch.

⚙️ Hyperparameters

Loss Function: Categorical Crossentropy

Optimizer: Adam

Evaluation Metric: Accuracy

Batch Size: 128 (initial), 32 (for fine-tuning)

Learning Rate: 0.00001

📊 Best Model Performance

Best Epoch: 24 (out of 30 total)

Exact Match: 27.55%

Partial Match: 41.21%

Validation Accuracy: 51.8%

Validation Loss: 1.7154

Training Accuracy: 54.76%

Training Loss: 1.5631

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published