Skip to content

zhousheng97/Awesome-MLLM-TextVQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 

Repository files navigation

Awesome-MLLM-TextVQA

✨✨This repository includes MLLM works on the real-world Scene-Text VQA tasks.

Introduction

  • Text Visual Question Answering (TextVQA) is a task where models answer questions based on textual information within visual scenes (images or videos), requiring both scene text recognition and reasoning.
  • With the rise of Multimodal Large Language Models (MLLMs), this task is crucial for enhancing scene-text-aware multimodal understanding. It pushes MLLMs to integrate visual-textual information and improve real-world QA assistance.

Table of Contents


Awesome Datasets

Image Datasets

Name Paper Venue Download Link Split Leaderboard
TextVQA Towards VQA Models That Can Read CVPR GitHub train / val / test
ST-VQA Scene Text Visual Question Answering ICCV GitHub train / val / test
OCRBench OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models SCIS GitHub test Link
OCRBench v2 OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning arxiv GitHub test Link

Video Datasets

Name Paper Venue Download Link Split Leaderboard
M4viteVQA Towards Video Text Visual Question Answering: Benchmark and Baseline NeurIPS'2022 GitHub train / val / test
RoadTextVQA Reading Between the Lanes: Text VideoQA on the Road ICDAR'2023 GitHub train / val / test
NewsVideoQA Watching the News: Towards VideoQA Models that can Read WACV'2023 GitHub test link
ViTXT-GQA Scene-Text Grounding for Text-Based Video Question Answering arxiv'2024 GitHub test
EgoTextVQA EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering CVPR'2025 GitHub test Link

Awesome Papers

Image-based Multimodal Large Language Models

Title Venue Date Code Demo
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond GitHub stars arXiv 2023-10-13 GitHub Local Demo
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions GitHub stars AAAI 2023-12-18 GitHub Local Demo
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks GitHub stars CVPR (Oral) 2024-01-15 GitHub Demo
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge GitHub stars arXiv 2024-01-30 GitHub -
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models GitHub stars arXiv 2024-03-05 GitHub Local Demo
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models GitHub stars CVPR 2024-02-27 GitHub Local Demo
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document GitHub stars arxiv 2024-05-15 GitHub Local Demo
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations GitHub stars arXiv 2024-02-06 GitHub Demo
CogVLM: Visual Expert for Pretrained Language Models GitHub stars arXiv 2024-02-04 GitHub Local Demo
CogAgent: A Visual Language Model for GUI Agents GitHub stars CVPR (Highlight) 2024-04-05 GitHub Local Demo
MiniCPM-V: A GPT-4V Level MLLM on Your Phone GitHub stars arXiv 2024-05-23 GitHub Demo
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models GitHub stars arXiv 2024-06-14 GitHub -
Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration GitHub stars arXiv 2025-01-09 GitHub -
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer GitHub stars arXiv 2024-12-18 GitHub -
Efficient Architectures for High Resolution Vision-Language Models GitHub stars arXiv 2025-01-05 GitHub -

Video-based Multimodal Large Language Models

Title Venue Date Code Demo
MiniCPM-V 2.6: A GPT-4V Level MLLM for single image, multi-image and video understanding GitHub stars arXiv 2024-08-06 GitHub Demo
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions GitHub stars NeurIPS D&B track 2024-10-01 GitHub Demo
Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution GitHub stars arXiv 2024-10-03 GitHub Demo
Video Instruction Tuning with Synthetic Data (LLaVA-Video) GitHub stars arXiv 2024-10-04 GitHub Demo
VILA: Optimized Vision Language Models GitHub stars arXiv 2024-12-05 GitHub Demo
CogVLM2: Visual Language Models for Image and Video Understanding GitHub stars arXiv 2024-08-29 GitHub Demo
LongVILA: Scaling Long-Context Visual Language Models for Long Videos GitHub stars arXiv 2024-12-13 GitHub Demo
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution GitHub stars arXiv 2024-09-19 GitHub Demo
Scaling Vision Pre-Training to 4K Resolution GitHub stars arXiv 2025-03-25 GitHub Demo

Others

About

✨✨Latest Research on Multimodal Large Language Models on Scene-Text VQA Tasks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •