Skip to content

Video-ChatGPT is a large vision-language model with a dedicated video-encoder and large language model (LLM), enabling video understanding and conversation about videos.

License

Notifications You must be signed in to change notification settings

2132660698/Video-ChatGPT

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Oryx Video-ChatGPT

Muhammad Maaz* , Hanoona Rasheed* , Salman Khan and Fahad Khan.

*Equal Contribution

Mohamed bin Zayed University of Artificial Intelligence

Demo YouTube DemoClip-1 DemoClip-2 DemoClip-3 DemoClip-4


πŸš€ News

  • May-26 : Our improved models, code and technical report will be released soon. Stay tuned!
  • May-21 : Video-ChatGPT: demo released.

Online Demo

πŸ”₯πŸ”₯ You can try our demo using the provided examples or by uploading your own videos HERE. πŸ”₯πŸ”₯

πŸ”₯πŸ”₯ Or click the image to try the demo! πŸ”₯πŸ”₯ demo You can access all the videos we demonstrate on here.

About Video-ChatGPT

  • Video-ChatGPT is a large vision-language model with a dedicated video-encoder and large language model (LLM), enabling video understanding and conversation about videos.
  • A simple and scalable multimodal design on top of pretrained video and language encoders that adapts only a linear projection layer for multimodal alignment.
  • Data centric focus with human assisted and semi-automatic annotation framework for high-quality video instruction data.
  • Unique multimodal (vision-language) capability combining video understanding and language generation that is comprehensively evaluated using quantitative and qualitiative comparisons on video reasoning, creativitiy, spatial and temporal understanding, and action recognition tasks.

architechtural_overview

Contributions

contributions

Qualitative Analysis

A Comprehensive Evaluation of Video-ChatGPT's Performance across Multiple Tasks.

Video Reasoning Tasks

sample1 Access the video sample here.


sample2 Access the video sample here.


sample3 Access the video sample here.


sample4 Access the video sample here.


Creative and Generative Tasks

sample5 Access the video sample here.


sample6 Access the video sample here.


sample7 Access the video sample here.


Spatial Understanding

sample8 Access the video sample here.


sample9 Access the video sample here.


Video Understanding and Conversational Tasks

sample10 Access the video sample here.


sample11 Access the video sample here.


sample12 Access the video sample here.


sample13 Access the video sample here.


Question Answering Tasks

sample14 Access the video sample here.


sample15 Access the video sample here.


sample16 Access the video sample here.


sample17 Access the video sample here.


Temporal Understanding

sample18 Access the video sample here.


sample19 Access the video sample here.


sample20 Access the video sample here.


sample21 Access the video sample here.


Action recognition

sample22 Access the video sample here.


sample23 Access the video sample here.


Quantitative Analysis

Benchmarking Video-ChatGPT's Performance with State-of-the-Art Metrics and Comparative Evaluation.

  • Develops the first quantitative video conversation evaluation framework for benchmarking the performance of video understanding generative models.
  • Evaluates Video-ChatGPT on open-ended question answering tasks using the MSRVTT and MSVD datasets.
  • Uses GPT-assisted evaluation to assess the model's capabilities, measuring the accuracy and relative score of generated predictions on a scale of 1-5 (shown on top of the bars in the bar chart below).
  • Compares the performance of VideoChat GPT with other models, including the generic video foundation model InternVideo and the video generative model Ask-Anything Video Chat.
  • Achieves state-of-the-art (SOTA) performance on both the MSRVTT and MSVD datasets, showcasing the model's exceptional performance in video understanding and question answering tasks.

Instruction Data for Model Tuning

We present the different types of data included in the instructional data prepared for model tuning, along with the methods used to enrich the ground truth annotations.

  • Data Types: The instructional data encompasses detailed descriptions, summarizations, question-answer pairs, creative/generative tasks, and conversational tasks, covering concepts from appearance, temporal relations, reasoning, and more.
  • Human Annotation Expansion: The original ground truth annotations are expanded and enriched by human annotators, who provide additional context and detail to enhance the instructional data.
  • Incorporation of context from Off-the-Shelf dense image captioning models: State-of-the-art dense captioning and prediction models are utilized to generate predictions that offer supplementary contextual information. A comprehensive method is employed to combine these predictions, leveraging some models specifically for removing noisy context from the data.
  • GPT-Assisted Postprocessing: The enriched data undergoes postprocessing using GPT models to refine and optimize the annotations, ensuring high-quality data for effective model training and improved performance.

Instruction Data Types

sample1

sample2

Data Enrichment Methods

sample1

sample2

Acknowledgement

  • LLaMA: A great attempt towards open and efficient LLMs!
  • Vicuna: Has the amazing language capabilities!
  • LLaVA: our architecture is inspired from LLaVA.
  • Thanks to our colleagues at MBZUAI for their essential contribution to the video annotation task, including Dr. Salman Khan, Dr. Fahad Khan, Abdelrahman Shaker, Shahina Kunhimon, Muhammad Uzair, Sanoojan Baliah, Malitha Gunawardhana, Akhtar Munir, Vishal Thengane, Vignagajan Vigneswaran, Dr. Jiale Cao, Dr. Nian Liu, Muhammad Ali, Gayal Kurrupu, Roba Al Majzoub, Jameel Hassan, Hanan Ghani, Dr. Muzammal Naseer, Dr. Akshay Dudhane, Dr. Jean Lahoud, Awais Rauf, without which this project would not be possible.

Note:

Please note that this is an ongoing work where we are working on improving our architecture design and finetuning on the video instruction data. We will release our codes and pretrained models very soon. Stay tuned!

If you're using Video-ChatGPT in your research or applications, please cite using this BibTeX:

@misc{maaz2023videochatgpt,
      title={Video-ChatGPT}, 
      author={Muhammad Maaz, Hanoona Rasheed, Salman Khan and Fahad Khan},
      journal={GitHub repository},
      year={2023},
      howpublished = {\url{https://github.com/hanoonaR/Video-ChatGPT}}}

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


About

Video-ChatGPT is a large vision-language model with a dedicated video-encoder and large language model (LLM), enabling video understanding and conversation about videos.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published