The PyTorch implementation of paper From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models
COMM, an MLLM designed to integrate the visual embeddings of CLIP and DINOv2 with Multi-level features Merging for enhancing the visual capabilities of multi-modal large language model.
[10/16] We released From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models, which is designed to integrate CLIP and DINOv2 with multi-level features merging for enhancing visual capabilities of MLLMs. Checkout the paper. (We have added the pdf of the paper in /images folder)
[10/18] We apologized that the paper and code are under the corporation's legal review. The code release will be delayed. Thanks for your patience!
We evaluate the model's multi-modal capabilities on five major categories of multi-modal tasks: Referring Expression Comprehension, Referring Expression Generation, Object Hallucination Benchmark, Visual Question Answering and Image Captioning. Our COMM achieves SOTA performance on multiple VL tasks as follows.
Please cite our paper if the code is helpful to your research.
@article{jiang2023from,
author = {Jiang, Dongsheng and Liu, Yuchen and Liu, Songlin and Zhang, Xiaopeng and Li, Jin and Xiong, Hongkai and Tian, Qi},
title = {From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models},
journal={arXiv preprint arXiv:2310.08825},
year = {2023}
}
- LLaVA and Shikra: The codebase we built upon, which have the amazing multi-modal capabilities!
- Vicuna: The powerful LLM we used.
- DINOv2: Our used vision encoder.
Thanks for their wonderful works.