This repository contains a notebook for fine-tuning the BLIP-2 vision-language model with LoRA on the Flickr8k dataset for image captioning and a demo for interacting with the finetuned model.
Here is a blog explaining how I fine-tuned this VLM, you can also read about this on my website Fine-Tuning BLIP-2 with LoRA on the Flickr8k Dataset for Image Captioning