Skip to content

Pytorch code for paper From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models

License

Notifications You must be signed in to change notification settings

YuchenLiu98/COMM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 

Repository files navigation

COMM

The PyTorch implementation of paper From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models

Overview

COMM, an MLLM designed to integrate the visual embeddings of CLIP and DINOv2 with Multi-level features Merging for enhancing the visual capabilities of multi-modal large language model.

News

[10/16] We released From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models, which is designed to integrate CLIP and DINOv2 with multi-level features merging for enhancing visual capabilities of MLLMs. Checkout the paper. (We have added the pdf of the paper in /images folder)
[10/18] We apologized that the paper and code are under the corporation's legal review. The code release will be delayed. Thanks for your patience!

Performance

We evaluate the model's multi-modal capabilities on five major categories of multi-modal tasks: Referring Expression Comprehension, Referring Expression Generation, Object Hallucination Benchmark, Visual Question Answering and Image Captioning. Our COMM achieves SOTA performance on multiple VL tasks as follows.

Examples




Citation

Please cite our paper if the code is helpful to your research.

@article{jiang2023from,
    author = {Jiang, Dongsheng and Liu, Yuchen and Liu, Songlin and Zhang, Xiaopeng and Li, Jin and Xiong, Hongkai and Tian, Qi},
    title = {From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models},
    journal={arXiv preprint arXiv:2310.08825},
    year = {2023}
}

Acknowledgement

  • LLaVA and Shikra: The codebase we built upon, which have the amazing multi-modal capabilities!
  • Vicuna: The powerful LLM we used.
  • DINOv2: Our used vision encoder.

Thanks for their wonderful works.

About

Pytorch code for paper From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published