Skip to content

A multimodal inference pipeline that integrates InstructBLIP with textgen-webui for Vicuna and related models.

License

Notifications You must be signed in to change notification settings

kjerk/instructblip-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InstructBLIP Pipeline

📝 Overview


This is a pipeline providing InstructBLIP multimodal operation for Vicuna family models running on oobabooga/text-generation-webui.


⏩ Just let me run the thing

Clone this repo into your extensions/multimodal/pipelines folder and run the server with --multimodal enabled and a preferred pipeline. Use AutoGPTQ to load.

> cd text-generation-webui
> cd extensions/multimodal/pipelines
> git clone https://github.com/kjerk/instructblip-pipeline
> cd ../../../
> python server.py --auto-devices --chat --listen --loader autogptq --multimodal-pipeline instructblip-7b

👀 Examples

|
Generation Parameter Presets:
  • LLaMA-Precise

  • Big O

💸 Requirements

  • AutoGPTQ loader (ExLlama is not supported for multimodal)

  • No additional dependencies from textgen-webui

VRAM Requirements
instructblip-7b + vicuna-7b

~6GB VRAM

instructblip-13b + vicuna-13b

11GB VRAM

The vanilla Vicuna-7b + InstructBLIP just barely runs on a 24GB gpu using huggingface transformers directly, and the 13b at fp16 is too much, thanks to optimization efforts and Quantized models/AutoGPTQ, on textgen-webui with AutoGTPQ, InstructBLIP and Vicuna can comfortably run on 8GB to 12gb of VRAM. 🙌


Provided Pipelines
  • 'instructblip-7b' for Vicuna-7b family

  • 'instructblip-13b' for Vicuna-13b family

Non-Working Models
  • wizard-vicuna-13b-4bit-128g

🖥️ Inference

Due to the already heavy VRAM requirements of the respective models, the vision encoder and projector are kept on CPU and are relatively quick, while the Qformer is moved to GPU for speed.

☑️ TODO List

  • ✅ Full readme doc

  • ✅ Add demonstration images

  • ☐ Eat something tasty

🔭 Consider List

  • ❔ Allow for GPU inference of the image encoder and projector?

  • ❔ Consider multiple embeddings causing problems and remediations.

📄 License

This pipeline echoes through the LAVIS license and is published under the BSD 3-Clause OSS license.


v1?label=discord&message=TheBloke AI&style=for the badge&color=success&logo=discord&logoColor=green&labelColor=black

GitHub 100000?style=for the badge&logo=github&logoColor=white

About

A multimodal inference pipeline that integrates InstructBLIP with textgen-webui for Vicuna and related models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages