A deep learning project that automatically generates natural language descriptions for images using an encoder-decoder architecture with Bahdanau attention mechanism.
- Attention Mechanism: Bahdanau (additive) attention for focusing on relevant image regions
- Pre-trained Encoder: ResNet-50 (ImageNet) for robust feature extraction
- Spatial Features: 7×7 grid (49 regions) for fine-grained attention
- Interactive GUI: Gradio-based web interface for easy caption generation
- Comprehensive Evaluation: BLEU-1/2/3/4 metrics with visual reports
- Training Visualization: Real-time loss curves and performance tracking
- Model: ResNet-50 (pretrained on ImageNet)
- Output: 49 spatial regions (7×7 grid)
- Feature Dimension: 2048 per region
- Type: Bahdanau (Additive) Attention
- Attention Dim: 512
- Purpose: Dynamic focus on relevant image regions per word
- Model: LSTM with attention
- Embedding Dim: 512
- Hidden Dim: 512
- Vocabulary Size: 2,590 words
- Parameters: 13.4M
| Metric | Score | Industry Baseline |
|---|---|---|
| BLEU-1 | 0.647 | 0.50-0.60 |
| BLEU-2 | 0.443 | 0.30-0.40 |
| BLEU-3 | 0.306 | 0.18-0.25 |
| BLEU-4 | 0.208 | 0.10-0.15 |
Our model outperforms typical baselines on all metrics!
- Training Loss: 2.22 (final)
- Validation Loss: 3.04 (best)
- Training Time: ~25 minutes (15 epochs on RTX 3060)
- GPU Utilization: 80-95%
- Training Loss: 2.22 (final)
- Validation Loss: 3.04 (best)
- Training Time: ~25 minutes (15 epochs on RTX 3060)
- GPU Utilization: 80-95%
Below are sample predictions on unseen test images:
Interactive HTML report available: Open test_evaluation_report.html to explore all test predictions with images.
- Python 3.12+
- CUDA-capable GPU (recommended)
- 8GB+ RAM
- 10GB+ disk space
-
Clone the repository git clone https://github.com/Triplejw/image-caption-generator.git cd image-caption-generator
-
Create virtual environment python -m venv venv source venv/bin/activate
-
Install dependencies pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 pip install -r requirements.txt
-
Download Flickr8k dataset mkdir -p ~/.kaggle chmod 600 ~/.kaggle/kaggle.json kaggle datasets download -d adityajn105/flickr8k unzip flickr8k.zip -d data
- Deep Learning: PyTorch 2.5
- Computer Vision: ResNet-50
- NLP: NLTK, BLEU metrics
- GUI: Gradio
- Hardware: NVIDIA RTX 3060
- Dataset: Flickr8k from Kaggle
- Architecture: "Show, Attend and Tell" (Xu et al., 2015)
- Pre-trained Model: ResNet-50
MIT License
Joshua JJ Wonder
- GitHub: @Triplejw
- Email: wonderjj2017@gmail.com
Built with ❤️ using PyTorch and Attention

