Awesome VLA

A Survey on Vision-Language-Action Models for Embodied AI

Yueen Ma¹, Zixing Song¹, Yuzheng Zhuang², Jianye Hao², Irwin King¹

The Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China (Email: {yema21, zxsong, king}@cse.cuhk.edu.hk)
Huawei Noah's Ark Lab, Shenzhen, China (Email: {zhuangyuzheng, haojianye}@huawei.com)

The official repo of the survey, containing a curated list of papers on Vision-Language-Action Models for Embodied AI.

Feel free to send us pull requests or emails to add papers!

If you find this repository useful, please consider citing, staring, and sharing with others!

Content

Components of VLA
Low-level Control Policies
- Non-Transformer Control Policies
- Transformer-based Control Policies
Task Planners
- Monolithic Task Planners
- Modular Task Planners
  - Language-based Task Planners
  - Code-based Task Planners
Related Surveys
Citation

Definitions

Generalized VLA
Input: state, instruction.
Output: action.
Large VLA
A special type of generalized VLA that is adapted from large VLMs. (Same as VLA defined by RT-2.)

Taxonomy

Components of VLA

Reinforcement Learning

DT: "Decision Transformer: Reinforcement Learning via Sequence Modeling", NeurIPS, 2021 [Paper][Code]

Trajectory Transformer: "Offline Reinforcement Learning as One Big Sequence Modeling Problem", NeurIPS, 2021 [Paper][Code]

SEED: "Primitive Skill-based Robot Learning from Human Evaluative Feedback", IROS, 2023 [Paper][Code]

Reflexion: "Reflexion: Language Agents with Verbal Reinforcement Learning", NeurIPS, 2023 [Paper][Code]

Pretrained Visual Representations

"Learning Transferable Visual Models From Natural Language Supervision", ICML, 2021 [Paper][Website][Code]

MVP: "Real-World Robot Learning with Masked Visual Pre-training", CoRL, 2022 [Paper][Website][Code]

Voltron: "Language-Driven Representation Learning for Robotics", RSS, 2023 [Paper]

VC-1: "Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?", NeurIPS, 2023 [Paper][Website][Code]

"The (Un)surprising Effectiveness of Pre-Trained Vision Models for Control", ICML, 2022 [Paper]

R3M: "R3M: A Universal Visual Representation for Robot Manipulation", CoRL, 2022 [Paper][Website][Code]

VIP: "VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training", ICLR, 2023 [Paper][Website][Code]

DINOv2: "DINOv2: Learning Robust Visual Features without Supervision", Trans. Mach. Learn. Res., 2023 [Paper][Code]

RPT: "Robot Learning with Sensorimotor Pre-training", CoRL, 2023 [Paper][Website]

I-JEPA: "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture", CVPR, 2023 [Paper]

Theia: "Theia: Distilling Diverse Vision Foundation Models for Robot Learning", CoRL, 2024 [Paper]
HRP: "HRP: Human Affordances for Robotic Pre-Training", RSS, 2024 [Paper][Website][Code]

Video Representations

F3RM: "Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation", CoRL, 2023 [Paper][Website][Code]

PhysGaussian: "PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics", CVPR, 2024 [Paper][Website][Code]

UniGS: "UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting", ICLR, 2025 [Paper][Code]

That Sounds Right: "That Sounds Right: Auditory Self-Supervision for Dynamic Robot Manipulation", CoRL, 2023 [Paper][Code]

Dynamics Learning

MaskDP: "Masked Autoencoding for Scalable and Generalizable Decision Making", NeurIPS, 2022 [Paper][Code]

PACT: "PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training", IROS, 2023 [Paper]

GR-1: "Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation", ICLR, 2024 [Paper]

SMART: "SMART: Self-supervised Multi-task pretrAining with contRol Transformers", ICLR, 2023 [Paper]

MIDAS: "Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning", ICML, 2024 [Paper][Website]

Vi-PRoM: "Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods", IROS, 2023 [Paper][Website]

VPT: "Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos", NeurIPS, 2022 [Paper]

World Models

"A Path Towards Autonomous Machine Intelligence", OpenReview, 2022 [Paper]

DreamerV1: "Dream to Control: Learning Behaviors by Latent Imagination", ICLR, 2020 [Paper]

DreamerV2: "Mastering Atari with Discrete World Models", ICLR, 2021 [Paper]

DreamerV3: "Mastering Diverse Domains through World Models", arXiv, Jan 2023 [Paper]

DayDreamer: "DayDreamer: World Models for Physical Robot Learning", CoRL, 2022 [Paper]

TWM: "Transformer-based World Models Are Happy With 100k Interactions", ICLR, 2023 [Paper]

IRIS: "Transformers are Sample-Efficient World Models", ICLR, 2023 [Paper][Code]

LLM-induced World Models

DECKARD: "Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling", ICML, 2023 [Paper][Website][Code]

LLM-MCTS: "Large Language Models as Commonsense Knowledge for Large-Scale Task Planning", NeurIPS, 2023 [Paper]

RAP: "Reasoning with Language Model is Planning with World Model", EMNLP, 2023 [Paper]

LLM+P: "LLM+P: Empowering Large Language Models with Optimal Planning Proficiency", arXiv, Apr 2023 [Paper][Code]

LLM-DM: "Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning", NeurIPS, 2023 [Paper][Website][Code]

Visual World Models

E2WM: "Language Models Meet World Models: Embodied Experiences Enhance Language Models", NeurIPS, 2023 [Paper][Code]

Genie: "Genie: Generative Interactive Environments", ICML, 2024 [Paper][Website]

3D-VLA: "3D-VLA: A 3D Vision-Language-Action Generative World Model", ICML, 2024 [Paper][Code]

UniSim: "Learning Interactive Real-World Simulators", ICLR, 2024 [Paper][Code]

Reasoning

ThinkBot: "ThinkBot: Embodied Instruction Following with Thought Chain Reasoning", arXiv, Dec 2023 [Paper]

ReAct: "ReAct: Synergizing Reasoning and Acting in Language Models", ICLR, 2023 [Paper]

RAT: "RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation", arXiv, Mar 2024 [Paper]

ECoT: "Robotic Control via Embodied Chain-of-Thought Reasoning", arXiv, Jul 2024 [Paper]

Tree-Planner: "Tree-Planner: Efficient Close-loop Task Planning with Large Language Models", ICLR, 2024 [Paper]

OpenVLA: "OpenVLA: An Open-Source Vision-Language-Action Model", arXiv, Jun 2024 [Paper]

Low-level Control Policies

Non-Transformer Control Policies

Transporter Networks: "Transporter Networks: Rearranging the Visual World for Robotic Manipulation", CoRL, 2020 [Paper]

CLIPort: "CLIPort: What and Where Pathways for Robotic Manipulation", CoRL, 2021 [Paper][Website][Code]

BC-Z: "BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning", CoRL, 2021 [Paper][Website][Code]

HULC: "What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data", arXiv, Apr 2022 [Paper][Website][Code]

HULC++: "Grounding Language with Visual Affordances over Unstructured Data", ICRA, 2023 [Paper][Website][Paper]

MCIL: "Language Conditioned Imitation Learning over Unstructured Data", Robotics: Science and Systems, 2021 [Paper][Website][Paper]

UniPi: "Learning Universal Policies via Text-Guided Video Generation", NeurIPS, 2023 [Paper][Website]

Transformer-based Control Policies

RoboFlamingo: "Vision-Language Foundation Models as Effective Robot Imitators", arXiv, Jan 2025 [Paper][Website][Code]

ACT: "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware", Robotics: Science and Systems, 2023 [Paper]

RoboCat: "RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation", arXiv, Mar 2021 [Paper]

Gato: "A Generalist Agent", Trans. Mach. Learn. Res., 2022 [Paper]

RT-Trajectory: "RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches", ICLR, 2023 [Paper]

Q-Transformer: "Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions", arXiv, Sep 2023 [Paper]

Interactive Language: "Interactive Language: Talking to Robots in Real Time", arXiv, Oct 2022 [Paper]

RT-1: "RT-1: Robotics Transformer for Real-World Control at Scale", RSS, 2023 [Paper][Website]

MT-ACT: "RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking", ICRA, 2024 [Paper][Code][Code]

Hiveformer: "Instruction-driven history-aware policies for robotic manipulations", CoRL, 2022 [Paper][Website][Code]

Control Policies for Multimodal Instructions

VIMA: "VIMA: General Robot Manipulation with Multimodal Prompts", arXiv, Oct 2022 [Paper]

MOO: "Open-World Object Manipulation using Pre-trained Vision-Language Models", CoRL, 2023 [Paper]

Control Policies with 3D Vision

VER: "Volumetric Environment Representation for Vision-Language Navigation", CVPR, 2024 [Paper][Code]

RVT: "RVT: Robotic View Transformer for 3D Object Manipulation", CoRL, 2023 [Paper]

RVT-2: "RVT-2: Learning Precise Manipulation from Few Demonstrations", arXiv, Jun 2024 [Paper]

RoboUniView: "RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulaiton", [Code]

PerAct: "Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation", CoRL, 2022 [Paper]

Act3D: "Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation", CoRL, 2023 [Paper][Website][Code]

Diffusion-based Control Policies

MDT: "Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals", Robotics: Science and Systems, 2024 [Paper][Website][Code]

RDT-1B: "RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation", arXiv, Oct 2024 [Paper][Website][Code]

Diffusion Policy: "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion", Robotics: Science and Systems, 2023 [Paper][Website][Code]

Octo: "Octo: An Open-Source Generalist Robot Policy", Robotics: Science and Systems, 2024 [Paper][Website][Code]

SUDD: "Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition", CoRL, 2023 [Paper][Code]

Diffusion-based Control Policies with 3D Vision

3D Diffuser Actor: "3D Diffuser Actor: Policy Diffusion with 3D Scene Representations", arXiv, Feb 2024 [Paper][Code]

DP3: "3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations", Proceedings of Robotics: Science and Systems (RSS), 2024 [Paper][Website][Code]

Control Policies for Motion Planning

VoxPoser: "VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models", CoRL, 2023 [Paper][Website][Code]

Language costs: "Correcting Robot Plans with Natural Language Feedback", Robotics: Science and Systems, 2022 [Paper][Website]

RoboTAP: "RoboTAP: Tracking Arbitrary Points for Few-Shot Visual Imitation", ICRA, 2024 [Paper][Website]

Control Policies with Point-based Action

ReKep: "ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation", arXiv, Sep 2024 [Paper][Website][Code]

RoboPoint: "RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics", arXiv, Jun 2024 [Paper][Website][Code]

PIVOT: "PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs", ICML, 2024 [Paper][Website]

Large VLA

RT-2: "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", CoRL, 2023 [Paper][Website]

RT-H: "RT-H: Action Hierarchies Using Language", Robotics: Science and Systems, 2024 [Paper][Website]

RT-X, OXE: "Open X-Embodiment: Robotic Learning Datasets and RT-X Models", arXiv, Oct 2023 [Paper][Website][Code]

OpenVLA: "OpenVLA: An Open-Source Vision-Language-Action Model", CoRL, 2024 [Paper][Website][Code]

π0: "π0: A Vision-Language-Action Flow Model for General Robot Control", arXiv, Oct 2024 [Paper][Website]

Task Planners

Monolithic Task Planners

Grounded Task Planners

(SL)^3: "Skill Induction and Planning with Latent Language", ACL, 2022 [Paper]

Translated <LM>: "Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents", ICML, 2022 [Paper][Code]

SayCan: "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances", CoRL, 2022 [Paper][Website][Code]

End-to-end Task Planners

EmbodiedGPT: "EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought", NeurIPS, 2023 [Paper][Code]

PaLM-E: "PaLM-E: An Embodied Multimodal Language Model", ICML, 2023 [Paper][Website]

End-to-end Task Planners with 3D Vision

MultiPLY: "MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World", CVPR, 2024 [Paper]

3D-LLM: "3D-LLM: Injecting the 3D World into Large Language Models", NeurIPS, 2023 [Paper][Website]

LEO: "An Embodied Generalist Agent in 3D World", ICML, 2024 [Paper][Website][Code]

ShapeLLM: "ShapeLLM: Universal 3D Object Understanding for Embodied Interaction", ECCV, 2024 [Paper][Website][Code]

Modular Task Planners

Language-based Task Planners

ReAct: "ReAct: Synergizing Reasoning and Acting in Language Models", ICLR, 2023 [Paper][Website][Code]

Socratic Models: "Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language", ICLR, 2023 [Paper]

LID: "Pre-Trained Language Models for Interactive Decision-Making", NeurIPS, 2022 [Paper][Website][Code]

Inner Monologue: "Inner Monologue: Embodied Reasoning through Planning with Language Models", arXiv, Jul 2022 [Paper][Website]

LLM-Planner: "LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models", ICCV, 2023 [Paper][Website][Website]

Code-based Task Planners

ChatGPT for Robotics: "ChatGPT for Robotics: Design Principles and Model Abilities", IEEE Access, 2023 [Paper][Website][Code]

DEPS: "Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents", arXiv, Feb 2023 [Paper][Code]

ConceptGraphs: "ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning", ICRA, 2023 [Paper][Website][Code]

CaP: "Code as Policies: Language Model Programs for Embodied Control", ICRA, 2023 [Paper][Website][Code]

ProgPrompt: "ProgPrompt: Generating Situated Robot Task Plans using Large Language Models", ICRA, 2023 [Paper][Website][Code]

COME-robot: "Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V", arXiv, Apr 2024 [Paper][Website]

Related Surveys

"Foundation Models in Robotics: Applications, Challenges, and the Future", arXiv, Dec 2023 [Paper]

"Real-World Robot Applications of Foundation Models: A Review", arXiv, Feb 2024 [Paper]

"Large Language Models for Robotics: Opportunities, Challenges, and Perspectives", arXiv, Jan 2024 [Paper]

"Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis", arXiv, Dec 2023 [Paper]

"Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond", arXiv, May 2024 [Paper]

Citation

Thank you for your interest! If you find our work helpful, please consider citing us with:

@article{DBLP:journals/corr/abs-2405-14093,
  author       = {Yueen Ma and
                  Zixing Song and
                  Yuzheng Zhuang and
                  Jianye Hao and
                  Irwin King},
  title        = {A Survey on Vision-Language-Action Models for Embodied {AI}},
  journal      = {CoRR},
  volume       = {abs/2405.14093},
  year         = {2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
imgs		imgs
.gitignore		.gitignore
README.md		README.md
how-to-PR.md		how-to-PR.md

yueen-ma/Awesome-VLA

Folders and files

Latest commit

History

Repository files navigation