The key components of project:
- Model use the encoder-decoder based architecture. Mobilenet_v3_small used for Encoder and TransformerDecoder used for Decoder
- Preprocessing caption and build tokeizer.
- Model training and evaluation.
- Use greedy search and beam search algorithms for inference task
- Saving model weights and visualizing the results .
This repository using Flickr8k dataset and pytorch framework. The dataset organize the files as follows:
- flickr8k
- images
- image files
- captions.txt
- images
You can download pre-trained best_model.pt weights and encoded images feature_extractor.pkl
You have to change config.root path to your workspace path.
Beam search helps in generating the most optimal caption by considering multiple possibilities at each decoding step, rather than greedily selecting the word with the highest score. The example below demonstrates how using a beam width (k) of 3 results in better captions.