Skip to content

This project implements an Image Caption Generator, a deep learning model that automatically generates descriptive captions for images. It combines Convolutional Neural Networks (CNNs) for image feature extraction and Recurrent Neural Networks (LSTMs) for language modeling, trained on image–caption datasets.

Notifications You must be signed in to change notification settings

aliahmad552/image-caption-generator-using-deeplearning-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 Image Caption Generator Using CNN-LSTM (FastAPI Deployment)

📘 Overview

This project implements an Image Caption Generator, a deep learning model that automatically generates descriptive captions for images. It combines Convolutional Neural Networks (CNNs) for image feature extraction and Recurrent Neural Networks (LSTMs) for language modeling, trained on image–caption datasets.

You can upload an image via a FastAPI web interface, and the app returns a meaningful caption generated by the trained model.

🎥 Project Demo

Watch the full demo here: YouTube Video

--

🚀 Project Workflow

1. Data Collection & Preprocessing

  • Dataset Used: Flickr8k Dataset (8,000 images with 5 captions each).

  • Preprocessing Steps:

Cleaned captions (removed punctuation, lowercase conversion, tokenization).

Added special tokens: "start" and "end" to each caption.

Used InceptionV3 (pretrained on ImageNet) for feature extraction.

Extracted 2048-dimensional feature vectors for each image.

2. Text Tokenization

  • Used Tokenizer from Keras to build a vocabulary from all captions.

  • Converted captions to integer sequences.

  • Applied padding to make all sequences of equal length.

  • Defined max_length based on the longest caption.

3. Model Architecture

The model follows a CNN + LSTM Encoder–Decoder approach.

🧩 Encoder (Image Feature Extractor)

Input: Extracted feature vector (2048-dim).

Layers:

Dropout(0.5)

Dense(256, activation='relu')

Output: 256-dim projected feature.

🧠 Decoder (Language Model)

Input: Sequence of tokens.

Layers:

Embedding(vocab_size, 256, mask_zero=True)

LSTM(256, return_sequences=True)

Dense(vocab_size, activation='softmax')

🏗️ Combined Model

The encoder and decoder outputs are merged via add(), followed by Dense layers to predict the next word in the sequence.

🏋️ Training

Epochs: 17

Batch Size: 32

Optimizer: Adam

Loss: Categorical Cross-Entropy

Used data generator to feed (image features, input sequence, output word) tuples in memory-efficient batches.

Validation captions were generated at intervals to monitor quality.

🧩 Model Evaluation

Generated captions for random test images.

Actual:

  • two dogs are playing with each other on the pavement
  • black dog and tri-colored dog playing with each other on the road

Predicted:

  • two dogs are playing on the road

⚙️ FastAPI Deployment

  1. Backend (main.py)

Built REST API using FastAPI.

Endpoint /predict/ accepts uploaded image and returns generated caption.

Utilized pre-trained model (model.h5), tokenizer (tokenizer.pkl), and features extractor.

  1. Frontend

Simple and elegant HTML + CSS form.

Upload an image → get caption → view output instantly.

Deployed locally via:

uvicorn main:app --reload

v✨ Author

Ali Ahmad

Data Scientist & AI/ML Engineer

📧 aliahmaddawana@gmail.com

About

This project implements an Image Caption Generator, a deep learning model that automatically generates descriptive captions for images. It combines Convolutional Neural Networks (CNNs) for image feature extraction and Recurrent Neural Networks (LSTMs) for language modeling, trained on image–caption datasets.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published