This project implements an image captioning model using an Encoder-Decoder architecture, where a pretrained ResNet-50 CNN is used to extract image features and a stacked LSTM network generates textual descriptions. The model is trained and evaluated on the Flickr8k dataset, achieving a BLEU-1 score of 62% and BLEU-2 score of 41%, surpassing the original benchmark.
- ๐ง Encoder-Decoder architecture using ResNet-50 + LSTM
- ๐ BLEU score evaluation
- ๐ค Tokenization and padding of captions
- ๐ Data pipeline with preprocessing and feature extraction
- ๐งช Training visualization and performance tracking
- Flickr8k Dataset
- 8,000 images
- 5 human-annotated captions per image
- Download link: Flickr8k Dataset
- Captions: Flickr8k Text
- Python
- TensorFlow & Keras
- ResNet-50 (pretrained on ImageNet)
- LSTM for sequence generation
- Numpy, Matplotlib, Pickle, tqdm
- Pretrained ResNet-50 with final classification layer removed
- Extracted 2048-dimension feature vectors
- Embedding layer for word vectors
- Stacked LSTM layers
- Dense layers to predict the next word in sequence
| Metric | Score |
|---|---|
| BLEU-1 | 65% |
| BLEU-2 | 42% |
| BLEU-3 | 27% |
| BLEU-4 | 18% |
โจ Scores surpass the original paper which achieved BLEU-1 of 61% and BLEU-2 of 41%.