This repository contains an Image Captioning model built using TensorFlow. The model uses a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to generate descriptive captions for images. It has been trained on the Flickr8K dataset and evaluated using the SacreBLEU metric.
- CNN for Feature Extraction: The model uses a pre-trained Convolutional Neural Network (Xception) to extract visual features from images.
- RNN for Caption Generation: A Recurrent Neural Network (LSTM) processes the extracted features and generates captions.
- Tokenizer and Embeddings: The textual data is tokenized and embedded for effective processing by the RNN.
- BLEU Score Evaluation: The generated captions are evaluated using the SacreBLEU metric.
The model is trained on the (Flickr8K Dataset), which contains 8,000 images with five captions each. The dataset is preprocessed by tokenizing the captions and encoding them for training the sequence model.
- Feature Extraction:
- A CNN (Xception) extracts feature vectors from input images.
- Text Processing:
- Tokenization and word embeddings are applied to the captions.
- The sequences are padded to maintain uniformity.
- Caption Generation:
- The extracted image features and text sequences are passed through an LSTM-based RNN.
- The final output is a sequence of words forming a caption.
The model is evaluated using the SacreBLEU metric, which measures the similarity between generated and reference captions.