Muhammad Maaz* , Hanoona Rasheed* , Salman Khan and Fahad Khan.
*Equal Contribution
Mohamed bin Zayed University of Artificial Intelligence
- May-26 : Our improved models, code and technical report will be released soon. Stay tuned!
- May-21 : Video-ChatGPT: demo released.
π₯π₯ You can try our demo using the provided examples or by uploading your own videos HERE. π₯π₯
π₯π₯ Or click the image to try the demo! π₯π₯
You can access all the videos we demonstrate on here.
- Video-ChatGPT is a large vision-language model with a dedicated video-encoder and large language model (LLM), enabling video understanding and conversation about videos.
- A simple and scalable multimodal design on top of pretrained video and language encoders that adapts only a linear projection layer for multimodal alignment.
- Data centric focus with human assisted and semi-automatic annotation framework for high-quality video instruction data.
- Unique multimodal (vision-language) capability combining video understanding and language generation that is comprehensively evaluated using quantitative and qualitiative comparisons on video reasoning, creativitiy, spatial and temporal understanding, and action recognition tasks.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
Access the video sample here.
- Develops the first quantitative video conversation evaluation framework for benchmarking the performance of video understanding generative models.
- Evaluates Video-ChatGPT on open-ended question answering tasks using the MSRVTT and MSVD datasets.
- Uses GPT-assisted evaluation to assess the model's capabilities, measuring the accuracy and relative score of generated predictions on a scale of 1-5 (shown on top of the bars in the bar chart below).
- Compares the performance of VideoChat GPT with other models, including the generic video foundation model InternVideo and the video generative model Ask-Anything Video Chat.
- Achieves state-of-the-art (SOTA) performance on both the MSRVTT and MSVD datasets, showcasing the model's exceptional performance in video understanding and question answering tasks.
We present the different types of data included in the instructional data prepared for model tuning, along with the methods used to enrich the ground truth annotations.
- Data Types: The instructional data encompasses detailed descriptions, summarizations, question-answer pairs, creative/generative tasks, and conversational tasks, covering concepts from appearance, temporal relations, reasoning, and more.
- Human Annotation Expansion: The original ground truth annotations are expanded and enriched by human annotators, who provide additional context and detail to enhance the instructional data.
- Incorporation of context from Off-the-Shelf dense image captioning models: State-of-the-art dense captioning and prediction models are utilized to generate predictions that offer supplementary contextual information. A comprehensive method is employed to combine these predictions, leveraging some models specifically for removing noisy context from the data.
- GPT-Assisted Postprocessing: The enriched data undergoes postprocessing using GPT models to refine and optimize the annotations, ensuring high-quality data for effective model training and improved performance.
- LLaMA: A great attempt towards open and efficient LLMs!
- Vicuna: Has the amazing language capabilities!
- LLaVA: our architecture is inspired from LLaVA.
- Thanks to our colleagues at MBZUAI for their essential contribution to the video annotation task, including Dr. Salman Khan, Dr. Fahad Khan, Abdelrahman Shaker, Shahina Kunhimon, Muhammad Uzair, Sanoojan Baliah, Malitha Gunawardhana, Akhtar Munir, Vishal Thengane, Vignagajan Vigneswaran, Dr. Jiale Cao, Dr. Nian Liu, Muhammad Ali, Gayal Kurrupu, Roba Al Majzoub, Jameel Hassan, Hanan Ghani, Dr. Muzammal Naseer, Dr. Akshay Dudhane, Dr. Jean Lahoud, Awais Rauf, without which this project would not be possible.
Please note that this is an ongoing work where we are working on improving our architecture design and finetuning on the video instruction data. We will release our codes and pretrained models very soon. Stay tuned!
If you're using Video-ChatGPT in your research or applications, please cite using this BibTeX:
@misc{maaz2023videochatgpt,
title={Video-ChatGPT},
author={Muhammad Maaz, Hanoona Rasheed, Salman Khan and Fahad Khan},
journal={GitHub repository},
year={2023},
howpublished = {\url{https://github.com/hanoonaR/Video-ChatGPT}}}
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.