✨✨This repository includes MLLM works on the real-world Scene-Text VQA tasks.
- Text Visual Question Answering (TextVQA) is a task where models answer questions based on textual information within visual scenes (images or videos), requiring both scene text recognition and reasoning.
- With the rise of Multimodal Large Language Models (MLLMs), this task is crucial for enhancing scene-text-aware multimodal understanding. It pushes MLLMs to integrate visual-textual information and improve real-world QA assistance.
-
Awesome Datasets
-
Awesome Papers
Name | Paper | Venue | Download Link | Split | Leaderboard |
---|---|---|---|---|---|
TextVQA | Towards VQA Models That Can Read | CVPR | GitHub | train / val / test | |
ST-VQA | Scene Text Visual Question Answering | ICCV | GitHub | train / val / test | |
OCRBench | OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models | SCIS | GitHub | test | Link |
OCRBench v2 | OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning | arxiv | GitHub | test | Link |
Name | Paper | Venue | Download Link | Split | Leaderboard |
---|---|---|---|---|---|
M4viteVQA | Towards Video Text Visual Question Answering: Benchmark and Baseline | NeurIPS'2022 | GitHub | train / val / test | |
RoadTextVQA | Reading Between the Lanes: Text VideoQA on the Road | ICDAR'2023 | GitHub | train / val / test | |
NewsVideoQA | Watching the News: Towards VideoQA Models that can Read | WACV'2023 | GitHub | test | link |
ViTXT-GQA | Scene-Text Grounding for Text-Based Video Question Answering | arxiv'2024 | GitHub | test | |
EgoTextVQA | EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering | CVPR'2025 | GitHub | test | Link |
Title | Venue | Date | Code | Demo |
---|---|---|---|---|
MiniCPM-V 2.6: A GPT-4V Level MLLM for single image, multi-image and video understanding |
arXiv | 2024-08-06 | GitHub | Demo |
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions |
NeurIPS D&B track | 2024-10-01 | GitHub | Demo |
Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution |
arXiv | 2024-10-03 | GitHub | Demo |
Video Instruction Tuning with Synthetic Data (LLaVA-Video) |
arXiv | 2024-10-04 | GitHub | Demo |
VILA: Optimized Vision Language Models |
arXiv | 2024-12-05 | GitHub | Demo |
CogVLM2: Visual Language Models for Image and Video Understanding |
arXiv | 2024-08-29 | GitHub | Demo |
LongVILA: Scaling Long-Context Visual Language Models for Long Videos |
arXiv | 2024-12-13 | GitHub | Demo |
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution |
arXiv | 2024-09-19 | GitHub | Demo |
Scaling Vision Pre-Training to 4K Resolution |
arXiv | 2025-03-25 | GitHub | Demo |