ModernAI: Awesome Modern Artificial Intelligence

🔥Hot update in progress ...

Large Model Evolutionary Graph

LLM

MLLM (LLaMA-based)

Survey

Agent AI: Surveying the Horizons of Multimodal Interaction [arXiv 2401] [paper]
MM-LLMs: Recent Advances in MultiModal Large Language Models [arXiv 2401] [paper]

Large Language Model (LLM)

OLMo: Accelerating the Science of Language Models [arXiv 2402] [paper] [code]

Chinese Large Language Model (CLLM)

Large Vision Backbone

AIM: Scalable Pre-training of Large Autoregressive Image Models [arXiv 2401] [paper] [code]

Large Vision Model (LVM)

Sequential Modeling Enables Scalable Learning for Large Vision Models [arXiv 2312] [paper] [code] (💥Visual GPT Time?)

Large Vision-Language Model (VLM)

UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [arXiv 2401] [paper] [code]

Vision Foundation Model (VFM)

SAM: Segment Anything Model [ICCV 2023 Best Paper Honorable Mention] [paper] [code]
SSA: Semantic segment anything [github 2023] [paper] [code]
SEEM: Segment Everything Everywhere All at Once [arXiv 2304] [paper] [code]
RAM: Recognize Anything - A Strong Image Tagging Model [arXiv 2306] [paper] [code]
Semantic-SAM: Segment and Recognize Anything at Any Granularity [arXiv 2307] [paper] [code]
UNINEXT: Universal Instance Perception as Object Discovery and Retrieval [CVPR 2023] [paper] [code]
APE: Aligning and Prompting Everything All at Once for Universal Visual Perception [arXiv 2312] [paper] [code]
GLEE: General Object Foundation Model for Images and Videos at Scale [arXiv 2312] [paper] [code]
OMG-Seg : Is One Model Good Enough For All Segmentation? [arXiv 2401] [paper] [[code]]](https://github.com/lxtGH/OMG-Seg)
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data [arXiv 2401] [paper] [[code]]](https://github.com/LiheYoung/Depth-Anything)
ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation [arXiv 2401] [paper] [[code]]](https://github.com/Lszcoding/ClipSAM)
PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation [arXiv 2401] [paper] [[code]]](https://github.com/xzz2/pa-sam)
YOLO-World: Real-Time Open-Vocabulary Object Detection [arXiv 2401] [paper] [[code]]](https://github.com/AILab-CVC/YOLO-World)

Multimodal Large Language Model (MLLM) / Large Multimodal Model (LMM)

Model	Vision	Projector	LLM	OKVQA	GQA	VSR	IconVQA	VizWiz	HM	VQA^v2	SQA^I	VQA^T	POPE	MME^P	MME^C	MMB	MMB^CN	SEED^I	LLaVA^W	MM-Vet	QBench
MiniGPT-v2	EVA-Clip-g	Linear	LLaMA-2-7B	56.9²	60.3	60.6²	47.7²	32.9	58.2²
MiniGPT-v2-Chat	EVA-Clip-g	Linear	LLaMA-2-7B	57.8¹	60.1	62.9¹	51.5¹	53.6	58.8¹
Qwen-VL-Chat			Qwen-7B		57.5^∗			38.9		78.2^∗	68.2	61.5		1487.5	360.7²	60.6	56.7	58.2
LLaVA-1.5			Vicuna-1.5-7B		62.0^∗			50.0		78.5^∗	66.8	58.2	85.9¹	1510.7	316.1⁺	64.3	58.3	58.6	63.4	30.5	58.7
LLaVA-1.5 +ShareGPT4V			Vicuna-1.5-7B					57.2		80.6²	68.4			1567.4²	376.4¹	68.8	62.2	69.7¹	72.6	37.6	63.4^1∗
LLaVA-1.5			Vicuna-1.5-13B		63.3¹			53.6		80.0^∗	71.6	61.3	85.9¹	1531.3	295.4⁺	67.7	63.6	61.6	70.7	35.4	62.1^2∗
VILA-7B			LLaMA-2-7B		62.3^∗			57.8		79.9^∗	68.2	64.4	85.5^2∗	1533.0		68.9	61.7	61.1	69.7	34.9
VILA-13B			LLaMA-2-13B		63.3^1∗			60.6²		80.8^1∗	73.7^1∗	66.6^1∗	84.2	1570.1^1∗		70.3^2∗	64.3^2∗	62.8^2∗	73.0^2∗	38.8^2∗
VILA-13B +ShareGPT4V			LLaMA-2-13B		63.2^2∗			62.4¹		80.6^2∗	73.1^2∗	65.3^2∗	84.8	1556.5		70.8^1∗	65.4^1∗	61.4	78.4^1∗	45.7^1∗
SPHINX
SPHINX-Plus
SPHINX-Plus-2K
SPHINX-MoE
InternVL
LLaVA-1.6

+ indicates ShareGPT4V's (Chen et al., 2023e) re-implemented test results.
∗ indicates that the training images of the datasets are observed during training.

Paradigm Comparison

LAVIS: A Library for Language-Vision Intelligence [ACL 2023] [paper] [code]
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [ICML 2023] [paper] [code]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [arXiv 2305] [paper] [code]
MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models [arXiv 2304] [paper] [code]
MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-task Learning [github 2310] [paper] [code]
VisualGLM-6B: Chinese and English multimodal conversational language model [ACL 2022] [paper] [code]
Kosmos-2: Grounding Multimodal Large Language Models to the World [arXiv 2306] [paper] [code]
NExT-GPT: Any-to-Any Multimodal LLM [arXiv 2309] [paper] [code]
LLaVA/-1.5: Large Language and Vision Assistant [NeurIPS 2023] [paper] [arXiv 2310] [paper] [code]
🦉mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [arXiv 2304] [paper] [code]
🦉mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration [arXiv 2311] [paper] [code]
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks [arXiv 2305] [paper] [code]
🦅Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic [arXiv 2306] [paper] [code]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond [arXiv 2308] [paper] [code]
LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [arXiv 2309] [paper] [code]
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model [arXiv 2309] [paper] [code]
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition [arXiv 2309] [paper] [code]
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens [arXiv 2310] [paper] [code]
CogVLM: Visual Expert for Large Language Models [github 2310] [paper] [code]
🐦Woodpecker: Hallucination Correction for Multimodal Large Language Models [arXiv 2310] [paper] [code]
SoM: Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V [arXiv 2310] [paper] [code]
Ferret: Refer and Ground Anything Any-Where at Any Granularity [arXiv 2310] [paper] [code]
🦦OtterHD: A High-Resolution Multi-modality Model [arXiv 2311] [paper] [code]
NExT-Chat: An LMM for Chat, Detection and Segmentation [arXiv 2311] [paper] [project]
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models [arXiv 2311] [paper] [code]
InfMLLM: A Unified Framework for Visual-Language Tasks [arXiv 2311] [paper] [code]
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (FLD-5B) [arXiv 2311] [paper] [code] [dataset]
🦁LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge [arXiv 2311] [paper] [code]
🐵Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models [arXiv 2311] [paper] [code]
CG-VLM: Contrastive Vision-Language Alignment Makes Efficient Instruction Learner [arXiv 2311] [paper] [code]
🐲PixelLM: Pixel Reasoning with Large Multimodal Model [arXiv 2312] [paper] [code]
🐝Honeybee: Locality-enhanced Projector for Multimodal LLM [arXiv 2312] [paper] [code]
VILA: On Pre-training for Visual Language Models [arXiv 2312] [paper] [code]
CogAgent: A Visual Language Model for GUI Agents [arXiv 2312] [paper] [code] (support 1120×1120 resolution)
PixelLLM: Pixel Aligned Language Models [arXiv 2312] [paper] [code]
🦅Osprey: Pixel Understanding with Visual Instruction Tuning [arXiv 2312] [paper] [code]
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action [arXiv 2312] [paper] [code]
VistaLLM: Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [arXiv 2312] [paper] [code]
Emu2: Generative Multimodal Models are In-Context Learners [arXiv 2312] [paper] [code]
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs [arXiv 2312] [paper] [code]
BakLLaVA-1: BakLLaVA 1 is a Mistral 7B base augmented with the LLaVA 1.5 architecture [github 2310] [paper] [code]
LEGO: Language Enhanced Multi-modal Grounding Model [arXiv 2401] [paper] [code]
MMVP: Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [arXiv 2401] [paper] [code]
ModaVerse: Efficiently Transforming Modalities with LLMs [arXiv 2401] [paper] [code]
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [arXiv 2401] [paper] [code]
LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs [arXiv 2401] [paper] [code]
🎓InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models [arXiv 2401] [paper] [code]
MouSi: Poly-Visual-Expert Vision-Language Models [arXiv 2401] [paper] [code]
Yi Vision Language Model [HF 2401]

Multimodal Small Language Model (MSLM) / Small Multimodal Model (SMM)

Vary-toy: Small Language Model Meets with Reinforced Vision Vocabulary [arXiv 2401] [paper] [code]

Image Generation with MLLM

Generating Images with Multimodal Language Models [NeurIPS 2023] [paper] [code]
DreamLLM: Synergistic Multimodal Comprehension and Creation [arXiv 2309] [paper] [code]
Guiding Instruction-based Image Editing via Multimodal Large Language Models [arXiv 2309] [paper] [code]
KOSMOS-G: Generating Images in Context with Multimodal Large Language Models [arXiv 2310] [paper] [code]
LLMGA: Multimodal Large Language Model based Generation Assistant [arXiv 2311] [paper] [code]

Modern Autonomous Driving (MAD)

End-to-End Solution

UniAD: Planning-oriented Autonomous Driving [CVPR 2023] [paper] [code]
Scene as Occupancy [arXiv 2306] [paper] [code]
FusionAD: Multi-modality Fusion for Prediction and Planning Tasks of Autonomous Driving [arXiv 2308] [paper] [code]
BEVGPT: Generative Pre-trained Large Model for Autonomous Driving Prediction, Decision-Making, and Planning [arXiv 2310] [paper] [code]
UniVision: A Unified Framework for Vision-Centric 3D Perception [arXiv 2401] [paper] [code]

with Large Language Model

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models [arXiv 2307] [paper] [code]
LINGO-1: Exploring Natural Language for Autonomous Driving (Vision-Language-Action Models, VLAMs) [Wayve 2309] [blog]
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [arXiv 2310] [paper] [code]

Embodied AI (EAI) and Robo Agent

VIMA: General Robot Manipulation with Multimodal Prompts [arXiv 2210] [paper] [code]
PaLM-E: An Embodied Multimodal Language Model [arXiv 2303] [paper] [code]
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [arXiv 2307] [CoRL 2023] [paper] [code]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [arXiv 2307] [paper] [project]
RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking [arXiv 2309] [paper] [code]
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning [arXiv 2401] [paper] [code]

Neural Radiance Fields (NeRF)

EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision [arXiv 2311] [paper] [code]

Diffusion Model

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image [arXiv 2310] [paper] [code]
Vlogger: Make Your Dream A Vlog [arXiv 2401] [paper] [code]
BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models [arXiv 2401] [paper] [code]

World Model

CWM: Unifying (Machine) Vision via Counterfactual World Modeling [arXiv 2306] [paper] [code]
MILE: Model-Based Imitation Learning for Urban Driving [Wayve 2210] [NeurIPS 2022] [paper] [code] [blog]
GAIA-1: A Generative World Model for Autonomous Driving [Wayve 2310] [arXiv 2309] [paper] [code]
ADriver-I: A General World Model for Autonomous Driving [arXiv 2311] [paper] [code]
OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving [arXiv 2311] [paper] [code]
LWM: World Model on Million-Length Video and Language with RingAttention [arXiv 2402] [paper] [code]

Artificial Intelligence Generated Content (AIGC)

Text-to-Image

Text-to-Video

Sora: Video generation models as world simulators [openai 2402] [technical report] (💥Visual GPT Time?)

Text-to-3D

Image-to-3D

Artificial General Intelligence (AGI)

New Method

[Instruction Tuning] FLAN: Finetuned Language Models are Zero-Shot Learners [ICLR 2022] [paper] [code]

New Dataset

DriveLM: Drive on Language [paper] [project]
MagicDrive: Street View Generation with Diverse 3D Geometry Control [arXiv 2310] [paper] [code]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models [paper] [project] [blog]
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning (LVIS-Instruct4V) [arXiv 2311] [paper] [code] [dataset]
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (FLD-5B) [arXiv 2311] [paper] [code] [dataset]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions [paper] [code] [dataset]

New Vision Backbone

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [arXiv 2401] [paper] [code]
VMamba: Visual State Space Model [arXiv 2401] [paper] [code]

Benchmark

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences [arXiv 2401] [paper] [code]

Platform and API

SenseNova 商汤日日新开放平台 [url]

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

becauseofAI/ModernAI

Folders and files

Latest commit

History

Repository files navigation

ModernAI: Awesome Modern Artificial Intelligence

🔥Hot update in progress ...

Large Model Evolutionary Graph

Survey

Large Language Model (LLM)

Chinese Large Language Model (CLLM)

Large Vision Backbone

Large Vision Model (LVM)

Large Vision-Language Model (VLM)

Vision Foundation Model (VFM)

Multimodal Large Language Model (MLLM) / Large Multimodal Model (LMM)

Multimodal Small Language Model (MSLM) / Small Multimodal Model (SMM)

Image Generation with MLLM

Modern Autonomous Driving (MAD)

End-to-End Solution

with Large Language Model

Embodied AI (EAI) and Robo Agent

Neural Radiance Fields (NeRF)

Diffusion Model

World Model

Artificial Intelligence Generated Content (AIGC)

Text-to-Image

Text-to-Video

Text-to-3D

Image-to-3D

Artificial General Intelligence (AGI)

New Method

New Dataset

New Vision Backbone

Benchmark

Platform and API

SOTA Downstream Task

Zero-shot Object Detection about of Visual Grounding, Opne-set, Open-vocabulary, Open-world

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages