VLM-Video-Understanding

A minimalistic demo for image inference and video understanding using OpenCV, built on top of several popular open-source Vision-Language Models (VLMs). This repository provides Colab notebooks demonstrating how to apply these VLMs to video and image tasks using Python and Gradio.

Overview

This project showcases lightweight inference pipelines for the following:

Video frame extraction and preprocessing
Image-level inference with VLMs
Real-time or pre-recorded video understanding
OCR-based text extraction from video frames

Models Included

The repository supports a variety of open-source models and configurations, including:

Aya-Vision-8B
Florence-2-Base
Gemma3-VL
MiMo-VL-7B-RL
MiMo-VL-7B-SFT
Qwen2-VL
Qwen2.5-VL
Qwen-2VL-MessyOCR
RolmOCR-Qwen2.5-VL
olmOCR-Qwen2-VL
typhoon-ocr-7b-Qwen2.5VL

Each model has a dedicated Colab notebook to help users understand how to use it with video inputs.

Technologies Used

Python
OpenCV – for video and image processing
Gradio – for interactive UI
Jupyter Notebooks – for easy experimentation
Hugging Face Transformers – for loading VLMs

Folder Structure


├── Aya-Vision-8B/
├── Florence-2-Base/
├── Gemma3-VL/
├── MiMo-VL-7B-RL/
├── MiMo-VL-7B-SFT/
├── Qwen2-VL/
├── Qwen2.5-VL/
├── Qwen-2VL-MessyOCR/
├── RolmOCR-Qwen2.5-VL/
├── olmOCR-Qwen2-VL/
├── typhoon-ocr-7b-Qwen2.5VL/
├── LICENSE
└── README.md

Getting Started

Clone the repository:

git clone https://github.com/PRITHIVSAKTHIUR/VLM-Video-Understanding.git
cd VLM-Video-Understanding

Open any of the Colab notebooks and follow the instructions to run image or video inference.
Optionally, install dependencies locally:

pip install opencv-python gradio transformers

Hugging Face Dataset

The models and examples are supported by a dataset on Hugging Face:

VLM-Video-Understanding

License

This project is licensed under the Apache-2.0 License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VLM-Video-Understanding

Overview

Models Included

Technologies Used

Folder Structure

Getting Started

Hugging Face Dataset

License

About

Uh oh!

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Aya-Vision-8B		Aya-Vision-8B
Florence-2-Base		Florence-2-Base
Gemma3-VL		Gemma3-VL
Imgscope-OCR-2B-0527		Imgscope-OCR-2B-0527
Inkscope-Captions-2B-0526		Inkscope-Captions-2B-0526
MiMo-VL-7B-RL		MiMo-VL-7B-RL
MiMo-VL-7B-SFT		MiMo-VL-7B-SFT
Qwen-2VL-MessyOCR		Qwen-2VL-MessyOCR
Qwen2-VL		Qwen2-VL
Qwen2.5-VL		Qwen2.5-VL
RolmOCR-Qwen2.5-VL		RolmOCR-Qwen2.5-VL
olmOCR-Qwen2-VL		olmOCR-Qwen2-VL
typhoon-ocr-7b-Qwen2.5VL		typhoon-ocr-7b-Qwen2.5VL
LICENSE		LICENSE
README.md		README.md

License

PRITHIVSAKTHIUR/VLM-Video-Understanding

Folders and files

Latest commit

History

Repository files navigation

VLM-Video-Understanding

Overview

Models Included

Technologies Used

Folder Structure

Getting Started

Hugging Face Dataset

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages