ChatBridge, an approach to learning a unified multimodal model to interpret, correlate, and reason about various modalities without relying on all combinations of paired data.
ChatBridge is a multimodal language model capable of perceiving real-world multimodal information, as well as following instructions, thinking, and interacting with humans in natural language. Inspired by Flamingo and BLIP-2, we introduce perceiver modules to bridge the encoders and the LLM. we choose open-sourced Vicuna-13B as the LLM, which is built upon LLaMA, and reported to achieve 90% of ChatGPT's quality as per GPT-4's evaluation. As for the modal-specific encoders, we choose EVA-ViT-G as the vision encoder to encode images and videos, and BEAT as the audio encoder to encoder audios.
- Stage 1: Bridge each modality with language, leverage large-scale language-paired two-modality data for multimodal alignment training, including image-text, video-text, and audio-text pairs.
- Stage 2: Multimodal Instruction Tuning, instruction-finetune ChatBridge to align the model with user intent on a multimodal instruction dataset MULTIS, enabling more effective zero-shot generalization on multimodal tasks.
More examples can be found in the project page.
Code and data will be released in June!
- BLIP2 The model architecture of ChatBridge follows BLIP-2. Don't forget to check this great open-source work if you don't know it before!
- Lavis This repository is built upon Lavis!
- Vicuna The fantastic language ability of Vicuna with only 13B parameters is just amazing. And it is open-source!
- MiniGPT4 and LLaVA. We utilize their instruction data and drew inspiration from their approach to design a more comprehensive multimodal instruction dataset. They are all open-source!
If you're using ChatBridge in your research or applications, please cite using this BibTeX:
@article{zhao2023chatbridge,
title={ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst},
author={Zhao, Zijia and Guo, Longteng and Yue, Tongtian and Chen, Sihan and Shao, Shuai and Zhu, Xinxin and Yuan, Zehuan and Liu, Jing},
journal={arXiv preprint arXiv:2305.16103},
year={2023}
}
This repository is under BSD 3-Clause License. Many codes are based on Lavis with BSD 3-Clause License here.