🖼️ Multimodal Image Captioning with BLIP

Fine-tuned BLIP (Bootstrapping Language-Image Pre-training) for the Flickr8k dataset to generate captions for unseen images.
This project explores vision–language models and demonstrates a complete pipeline from data preprocessing → training → evaluation → interactive demo.

📌 Overview

Dataset: Flickr8k
Model: Salesforce/blip-image-captioning-base
Frameworks: PyTorch, Hugging Face Transformers, Datasets
Training Strategy:
- Parameter-efficient fine-tuning (LoRA)
- Early stopping on validation loss
- Beam search decoding

🚀 Live Demo

👉 Try the model directly on Hugging Face Spaces:
https://huggingface.co/spaces/YaekobB/image-captioning-blip-demo

📂 Repository Structure

multimodal-image-captioning/
│
├── notebooks/                  # Kaggle pipeline notebook
│   └── imagecaptioning-final-edited.ipynb
│
├── results/                    # Training & qualitative results
│   ├── train_vs_val_loss.png   # Training vs Validation loss curve
│   └── Sample_captions/        # Example generated captions
│       ├── photo1_captioned.jpg
│       ├── photo2_captioned.jpg
│       └── photo3_captioned.jpg
│
├── requirements.txt            # Dependencies for local demo
├── README.md                   # Project documentation (this file)
└── .gitignore                  # Ignore large model files

🚀 Training Pipeline (Kaggle Notebook)

Environment Setup – install libraries, configure GPU (T4).
Dataset Prep – parse Flickr8k captions.txt + resize images (224×224).
Data Collator – augmentations for training; clean collator for eval.
Model Setup – BLIP encoder–decoder, LoRA applied to reduce memory.
Training – run with Seq2SeqTrainer (loss-only validation for speed).
Evaluation – compute BLEU-1/2/3/4, ROUGE-L, METEOR on test set.
Inference – generate captions for unseen images.

📊 Evaluation

Test Metrics (Single vs Multi-Reference)

Metric	Single-ref	Multi-ref
test_loss	1.7448	–
BLEU-1	0.2831	0.5676
BLEU-2	0.1709	0.4111
BLEU-3	0.1078	0.2912
BLEU-4	0.0693	0.2039
ROUGE-L	0.3267	0.4547
METEOR	0.3388	0.5123

✔ Multi-reference scoring (5 captions per image) shows stronger alignment with human evaluation.

Training vs Validation Loss

🖼️ Sample Captions

Sample generated captions are available in the results/Sample_captions/ folder.

⚙️ Requirements

See requirements.txt for full details:

torch
transformers==4.56.0
evaluate==0.4.5
accelerate>=0.33.0
pandas, matplotlib, nltk, rouge-score
gradio

🚀 How to Run

1️⃣ Clone & Install

git clone https://github.com/<yaekobB>/multimodal-image-captioning.git
cd multimodal-image-captioning
pip install -r requirements.txt

📌 Highlights

End-to-end pipeline: from dataset preprocessing to interactive demo.
State-of-the-art BLIP model fine-tuned for captioning.

📜 License

MIT License.
You’re free to use and modify this project for research and educational purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🖼️ Multimodal Image Captioning with BLIP

📌 Overview

🚀 Live Demo

📂 Repository Structure

🚀 Training Pipeline (Kaggle Notebook)

📊 Evaluation

Test Metrics (Single vs Multi-Reference)

Training vs Validation Loss

🖼️ Sample Captions

⚙️ Requirements

🚀 How to Run

1️⃣ Clone & Install

📌 Highlights

📜 License

✨ Acknowledgements

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
notebooks		notebooks
results		results
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

yaekobB/multimodal-image-captioning

Folders and files

Latest commit

History

Repository files navigation

🖼️ Multimodal Image Captioning with BLIP

📌 Overview

🚀 Live Demo

📂 Repository Structure

🚀 Training Pipeline (Kaggle Notebook)

📊 Evaluation

Test Metrics (Single vs Multi-Reference)

Training vs Validation Loss

🖼️ Sample Captions

⚙️ Requirements

🚀 How to Run

1️⃣ Clone & Install

📌 Highlights

📜 License

✨ Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages