Oryx Video-ChatGPT

Muhammad Maaz* , Hanoona Rasheed* , Salman Khan and Fahad Khan.

*Equal Contribution

Mohamed bin Zayed University of Artificial Intelligence

🚀 News

May-26 : Our improved models, code and technical report will be released soon. Stay tuned!
May-21 : Video-ChatGPT: demo released.

Online Demo

🔥🔥 You can try our demo using the provided examples or by uploading your own videos HERE. 🔥🔥

🔥🔥 Or click the image to try the demo! 🔥🔥 You can access all the videos we demonstrate on here.

About Video-ChatGPT

Video-ChatGPT is a large vision-language model with a dedicated video-encoder and large language model (LLM), enabling video understanding and conversation about videos.
A simple and scalable multimodal design on top of pretrained video and language encoders that adapts only a linear projection layer for multimodal alignment.
Data centric focus with human assisted and semi-automatic annotation framework for high-quality video instruction data.
Unique multimodal (vision-language) capability combining video understanding and language generation that is comprehensively evaluated using quantitative and qualitiative comparisons on video reasoning, creativitiy, spatial and temporal understanding, and action recognition tasks.

Contributions

Qualitative Analysis

A Comprehensive Evaluation of Video-ChatGPT's Performance across Multiple Tasks.

Video Reasoning Tasks

Access the video sample here.

Creative and Generative Tasks

Access the video sample here.

Spatial Understanding

Access the video sample here.

Video Understanding and Conversational Tasks

Access the video sample here.

Question Answering Tasks

Access the video sample here.

Temporal Understanding

Access the video sample here.

Action recognition

Access the video sample here.

Quantitative Analysis

Benchmarking Video-ChatGPT's Performance with State-of-the-Art Metrics and Comparative Evaluation.

Develops the first quantitative video conversation evaluation framework for benchmarking the performance of video understanding generative models.
Evaluates Video-ChatGPT on open-ended question answering tasks using the MSRVTT and MSVD datasets.
Uses GPT-assisted evaluation to assess the model's capabilities, measuring the accuracy and relative score of generated predictions on a scale of 1-5 (shown on top of the bars in the bar chart below).
Compares the performance of VideoChat GPT with other models, including the generic video foundation model InternVideo and the video generative model Ask-Anything Video Chat.
Achieves state-of-the-art (SOTA) performance on both the MSRVTT and MSVD datasets, showcasing the model's exceptional performance in video understanding and question answering tasks.

Instruction Data for Model Tuning

We present the different types of data included in the instructional data prepared for model tuning, along with the methods used to enrich the ground truth annotations.

Data Types: The instructional data encompasses detailed descriptions, summarizations, question-answer pairs, creative/generative tasks, and conversational tasks, covering concepts from appearance, temporal relations, reasoning, and more.
Human Annotation Expansion: The original ground truth annotations are expanded and enriched by human annotators, who provide additional context and detail to enhance the instructional data.
Incorporation of context from Off-the-Shelf dense image captioning models: State-of-the-art dense captioning and prediction models are utilized to generate predictions that offer supplementary contextual information. A comprehensive method is employed to combine these predictions, leveraging some models specifically for removing noisy context from the data.
GPT-Assisted Postprocessing: The enriched data undergoes postprocessing using GPT models to refine and optimize the annotations, ensuring high-quality data for effective model training and improved performance.

Instruction Data Types

Data Enrichment Methods

Acknowledgement

LLaMA: A great attempt towards open and efficient LLMs!
Vicuna: Has the amazing language capabilities!
LLaVA: our architecture is inspired from LLaVA.
Thanks to our colleagues at MBZUAI for their essential contribution to the video annotation task, including Dr. Salman Khan, Dr. Fahad Khan, Abdelrahman Shaker, Shahina Kunhimon, Muhammad Uzair, Sanoojan Baliah, Malitha Gunawardhana, Akhtar Munir, Vishal Thengane, Vignagajan Vigneswaran, Dr. Jiale Cao, Dr. Nian Liu, Muhammad Ali, Gayal Kurrupu, Roba Al Majzoub, Jameel Hassan, Hanan Ghani, Dr. Muzammal Naseer, Dr. Akshay Dudhane, Dr. Jean Lahoud, Awais Rauf, without which this project would not be possible.

Note:

Please note that this is an ongoing work where we are working on improving our architecture design and finetuning on the video instruction data. We will release our codes and pretrained models very soon. Stay tuned!

If you're using Video-ChatGPT in your research or applications, please cite using this BibTeX:

@misc{maaz2023videochatgpt,
      title={Video-ChatGPT}, 
      author={Muhammad Maaz, Hanoona Rasheed, Salman Khan and Fahad Khan},
      journal={GitHub repository},
      year={2023},
      howpublished = {\url{https://github.com/hanoonaR/Video-ChatGPT}}}

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
dataset_samples		dataset_samples
demo_samples		demo_samples
images		images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Oryx Video-ChatGPT

🚀 News

Online Demo

About Video-ChatGPT

Contributions

Qualitative Analysis

A Comprehensive Evaluation of Video-ChatGPT's Performance across Multiple Tasks.

Video Reasoning Tasks

Creative and Generative Tasks

Spatial Understanding

Video Understanding and Conversational Tasks

Question Answering Tasks

Temporal Understanding

Action recognition

Quantitative Analysis

Benchmarking Video-ChatGPT's Performance with State-of-the-Art Metrics and Comparative Evaluation.

Instruction Data for Model Tuning

Instruction Data Types

Data Enrichment Methods

Acknowledgement

Note:

License

About

Releases

Packages

License

2132660698/Video-ChatGPT

Folders and files

Latest commit

History

Repository files navigation

Oryx Video-ChatGPT

🚀 News

Online Demo

About Video-ChatGPT

Contributions

Qualitative Analysis

A Comprehensive Evaluation of Video-ChatGPT's Performance across Multiple Tasks.

Video Reasoning Tasks

Creative and Generative Tasks

Spatial Understanding

Video Understanding and Conversational Tasks

Question Answering Tasks

Temporal Understanding

Action recognition

Quantitative Analysis

Benchmarking Video-ChatGPT's Performance with State-of-the-Art Metrics and Comparative Evaluation.

Instruction Data for Model Tuning

Instruction Data Types

Data Enrichment Methods

Acknowledgement

Note:

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages