Skip to content

ChatBridge, an approach to learning a unified multimodal model to interpret, correlate, and reason about various modalities without relying on all combinations of paired data.

License

Notifications You must be signed in to change notification settings

lazykumasensei/ChatBridge

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

ChatBridge

ChatBridge, an approach to learning a unified multimodal model to interpret, correlate, and reason about various modalities without relying on all combinations of paired data.

Introduction

ChatBridge is a multimodal language model capable of perceiving real-world multimodal information, as well as following instructions, thinking, and interacting with humans in natural language. Inspired by Flamingo and BLIP-2, we introduce perceiver modules to bridge the encoders and the LLM. we choose open-sourced Vicuna-13B as the LLM, which is built upon LLaMA, and reported to achieve 90% of ChatGPT's quality as per GPT-4's evaluation. As for the modal-specific encoders, we choose EVA-ViT-G as the vision encoder to encode images and videos, and BEAT as the audio encoder to encoder audios.

  • Stage 1: Bridge each modality with language, leverage large-scale language-paired two-modality data for multimodal alignment training, including image-text, video-text, and audio-text pairs.
  • Stage 2: Multimodal Instruction Tuning, instruction-finetune ChatBridge to align the model with user intent on a multimodal instruction dataset MULTIS, enabling more effective zero-shot generalization on multimodal tasks.

overview

Examples

More examples can be found in the project page.

Getting Started

Code and data will be released in June!

Acknowledgement

  • BLIP2 The model architecture of ChatBridge follows BLIP-2. Don't forget to check this great open-source work if you don't know it before!
  • Lavis This repository is built upon Lavis!
  • Vicuna The fantastic language ability of Vicuna with only 13B parameters is just amazing. And it is open-source!
  • MiniGPT4 and LLaVA. We utilize their instruction data and drew inspiration from their approach to design a more comprehensive multimodal instruction dataset. They are all open-source!

If you're using ChatBridge in your research or applications, please cite using this BibTeX:

@article{zhao2023chatbridge,
  title={ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst},
  author={Zhao, Zijia and Guo, Longteng and Yue, Tongtian and Chen, Sihan and Shao, Shuai and Zhu, Xinxin and Yuan, Zehuan and Liu, Jing},
  journal={arXiv preprint arXiv:2305.16103},
  year={2023}
}

License

This repository is under BSD 3-Clause License. Many codes are based on Lavis with BSD 3-Clause License here.

About

ChatBridge, an approach to learning a unified multimodal model to interpret, correlate, and reason about various modalities without relying on all combinations of paired data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published