This project implements an Image Caption Generator, a deep learning model that automatically generates descriptive captions for images. It combines Convolutional Neural Networks (CNNs) for image feature extraction and Recurrent Neural Networks (LSTMs) for language modeling, trained on image–caption datasets.
You can upload an image via a FastAPI web interface, and the app returns a meaningful caption generated by the trained model.
Watch the full demo here: YouTube Video
--
- 
Dataset Used: Flickr8k Dataset (8,000 images with 5 captions each). 
- 
Preprocessing Steps: 
Cleaned captions (removed punctuation, lowercase conversion, tokenization).
Added special tokens: "start" and "end" to each caption.
Used InceptionV3 (pretrained on ImageNet) for feature extraction.
Extracted 2048-dimensional feature vectors for each image.
- 
Used Tokenizer from Keras to build a vocabulary from all captions. 
- 
Converted captions to integer sequences. 
- 
Applied padding to make all sequences of equal length. 
- 
Defined max_length based on the longest caption. 
The model follows a CNN + LSTM Encoder–Decoder approach.
🧩 Encoder (Image Feature Extractor)
Input: Extracted feature vector (2048-dim).
Layers:
Dropout(0.5)
Dense(256, activation='relu')
Output: 256-dim projected feature.
Input: Sequence of tokens.
Layers:
Embedding(vocab_size, 256, mask_zero=True)
LSTM(256, return_sequences=True)
Dense(vocab_size, activation='softmax')
The encoder and decoder outputs are merged via add(), followed by Dense layers to predict the next word in the sequence.
Epochs: 17
Batch Size: 32
Optimizer: Adam
Loss: Categorical Cross-Entropy
Used data generator to feed (image features, input sequence, output word) tuples in memory-efficient batches.
Validation captions were generated at intervals to monitor quality.
Generated captions for random test images.
Actual:
- two dogs are playing with each other on the pavement
- black dog and tri-colored dog playing with each other on the road
Predicted:
- two dogs are playing on the road
- Backend (main.py)
Built REST API using FastAPI.
Endpoint /predict/ accepts uploaded image and returns generated caption.
Utilized pre-trained model (model.h5), tokenizer (tokenizer.pkl), and features extractor.
- Frontend
Simple and elegant HTML + CSS form.
Upload an image → get caption → view output instantly.
Deployed locally via:
uvicorn main:app --reloadAli Ahmad
Data Scientist & AI/ML Engineer