Awesome-MLLM-TextVQA

✨✨This repository includes MLLM works on the real-world Scene-Text VQA tasks.

Introduction

Text Visual Question Answering (TextVQA) is a task where models answer questions based on textual information within visual scenes (images or videos), requiring both scene text recognition and reasoning.
With the rise of Multimodal Large Language Models (MLLMs), this task is crucial for enhancing scene-text-aware multimodal understanding. It pushes MLLMs to integrate visual-textual information and improve real-world QA assistance.

Name	Paper	Venue	Download Link	Split	Leaderboard
TextVQA	Towards VQA Models That Can Read	CVPR	GitHub	train / val / test
ST-VQA	Scene Text Visual Question Answering	ICCV	GitHub	train / val / test
OCRBench	OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models	SCIS	GitHub	test	Link
OCRBench v2	OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning	arxiv	GitHub	test	Link

Video Datasets

Name	Paper	Venue	Download Link	Split	Leaderboard
M4viteVQA	Towards Video Text Visual Question Answering: Benchmark and Baseline	NeurIPS'2022	GitHub	train / val / test
RoadTextVQA	Reading Between the Lanes: Text VideoQA on the Road	ICDAR'2023	GitHub	train / val / test
NewsVideoQA	Watching the News: Towards VideoQA Models that can Read	WACV'2023	GitHub	test	link
ViTXT-GQA	Scene-Text Grounding for Text-Based Video Question Answering	arxiv'2024	GitHub	test
EgoTextVQA	EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering	CVPR'2025	GitHub	test	Link

Awesome Papers

Image-based Multimodal Large Language Models

Title	Venue	Date	Code	Demo
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	arXiv	2023-10-13	GitHub	Local Demo
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions	AAAI	2023-12-18	GitHub	Local Demo
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	CVPR (Oral)	2024-01-15	GitHub	Demo
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge	arXiv	2024-01-30	GitHub	-
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models	arXiv	2024-03-05	GitHub	Local Demo
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models	CVPR	2024-02-27	GitHub	Local Demo
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document	arxiv	2024-05-15	GitHub	Local Demo
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations	arXiv	2024-02-06	GitHub	Demo
CogVLM: Visual Expert for Pretrained Language Models	arXiv	2024-02-04	GitHub	Local Demo
CogAgent: A Visual Language Model for GUI Agents	CVPR (Highlight)	2024-04-05	GitHub	Local Demo
MiniCPM-V: A GPT-4V Level MLLM on Your Phone	arXiv	2024-05-23	GitHub	Demo
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models	arXiv	2024-06-14	GitHub	-
Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration	arXiv	2025-01-09	GitHub	-
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer	arXiv	2024-12-18	GitHub	-
Efficient Architectures for High Resolution Vision-Language Models	arXiv	2025-01-05	GitHub	-

Video-based Multimodal Large Language Models

Title	Venue	Date	Code	Demo
MiniCPM-V 2.6: A GPT-4V Level MLLM for single image, multi-image and video understanding	arXiv	2024-08-06	GitHub	Demo
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions	NeurIPS D&B track	2024-10-01	GitHub	Demo
Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution	arXiv	2024-10-03	GitHub	Demo
Video Instruction Tuning with Synthetic Data (LLaVA-Video)	arXiv	2024-10-04	GitHub	Demo
VILA: Optimized Vision Language Models	arXiv	2024-12-05	GitHub	Demo
CogVLM2: Visual Language Models for Image and Video Understanding	arXiv	2024-08-29	GitHub	Demo
LongVILA: Scaling Long-Context Visual Language Models for Long Videos	arXiv	2024-12-13	GitHub	Demo
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution	arXiv	2024-09-19	GitHub	Demo
Scaling Vision Pre-Training to 4K Resolution	arXiv	2025-03-25	GitHub	Demo

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome-MLLM-TextVQA

Introduction

Table of Contents

Awesome Datasets

Image Datasets

Video Datasets

Awesome Papers

Image-based Multimodal Large Language Models

Video-based Multimodal Large Language Models

Others

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

zhousheng97/Awesome-MLLM-TextVQA

Folders and files

Latest commit

History

Repository files navigation

Awesome-MLLM-TextVQA

Introduction

Table of Contents

Awesome Datasets

Image Datasets

Video Datasets

Awesome Papers

Image-based Multimodal Large Language Models

Video-based Multimodal Large Language Models

Others

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Packages