A simple app that generates captions for images using a Transformer decoder and ResNet-18 features. Upload your own image or try a sample to see what the model describes!
- Feature Extractor: ResNet-18 (pretrained)
- Decoder: Transformer (3 layers, 8 heads, 512 emb, 2048 ff, dropout 0.2)
- Vocabulary: 7,234 words
- Metric: BLEU-4 score: 0.18
The app will auto-download these when you run it, so you don't need to do it manually unless you want to.
- Clone this repo:
git clone https://github.com/paudelsamir/Image-Captioning-Transformer.git cd Image-Captioning-Transformer
- Install dependencies:
pip install -r requirements.txt
- Run the app:
streamlit run app.py streamlit run demo_app.py (no requirements needed)
# Run the setup script
setup.bat
# Make setup script executable and run
chmod +x setup.sh
./setup.sh
This is a fun project for learning and demo purposes. For details, see the notebook above.