Yueen Ma1, Zixing Song1, Yuzheng Zhuang2, Jianye Hao2, Irwin King1
-
The Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China (Email: {yema21, zxsong, king}@cse.cuhk.edu.hk)
-
Huawei Noah's Ark Lab, Shenzhen, China (Email: {zhuangyuzheng, haojianye}@huawei.com)
The official repo of the survey, containing a curated list of papers on Vision-Language-Action Models for Embodied AI.
Feel free to send us pull requests or emails to add papers!
If you find this repository useful, please consider citing, staring, and sharing with others!
- Components of VLA
- Low-level Control Policies
- Task Planners
- Related Surveys
- Citation
-
Generalized VLA
Input: state, instruction.
Output: action. -
Large VLA
A special type of generalized VLA that is adapted from large VLMs. (Same as VLA defined by RT-2.)
- DT: "Decision Transformer: Reinforcement Learning via Sequence Modeling", NeurIPS, 2021 [Paper][Code]
- Trajectory Transformer: "Offline Reinforcement Learning as One Big Sequence Modeling Problem", NeurIPS, 2021 [Paper][Code]
- SEED: "Primitive Skill-based Robot Learning from Human Evaluative Feedback", IROS, 2023 [Paper][Code]
- Reflexion: "Reflexion: Language Agents with Verbal Reinforcement Learning", NeurIPS, 2023 [Paper][Code]
- "Learning Transferable Visual Models From Natural Language Supervision", ICML, 2021 [Paper][Website][Code]
- Voltron: "Language-Driven Representation Learning for Robotics", RSS, 2023 [Paper]
- VC-1: "Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?", NeurIPS, 2023 [Paper][Website][Code]
- "The (Un)surprising Effectiveness of Pre-Trained Vision Models for Control", ICML, 2022 [Paper]
- R3M: "R3M: A Universal Visual Representation for Robot Manipulation", CoRL, 2022 [Paper][Website][Code]
- VIP: "VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training", ICLR, 2023 [Paper][Website][Code]
- DINOv2: "DINOv2: Learning Robust Visual Features without Supervision", Trans. Mach. Learn. Res., 2023 [Paper][Code]
- I-JEPA: "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture", CVPR, 2023 [Paper]
-
Theia: "Theia: Distilling Diverse Vision Foundation Models for Robot Learning", CoRL, 2024 [Paper]
-
HRP: "HRP: Human Affordances for Robotic Pre-Training", RSS, 2024 [Paper][Website][Code]
- F3RM: "Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation", CoRL, 2023 [Paper][Website][Code]
- PhysGaussian: "PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics", CVPR, 2024 [Paper][Website][Code]
- UniGS: "UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting", ICLR, 2025 [Paper][Code]
- That Sounds Right: "That Sounds Right: Auditory Self-Supervision for Dynamic Robot Manipulation", CoRL, 2023 [Paper][Code]
- MaskDP: "Masked Autoencoding for Scalable and Generalizable Decision Making", NeurIPS, 2022 [Paper][Code]
- PACT: "PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training", IROS, 2023 [Paper]
- GR-1: "Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation", ICLR, 2024 [Paper]
- SMART: "SMART: Self-supervised Multi-task pretrAining with contRol Transformers", ICLR, 2023 [Paper]
- MIDAS: "Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning", ICML, 2024 [Paper][Website]
- Vi-PRoM: "Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods", IROS, 2023 [Paper][Website]
- VPT: "Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos", NeurIPS, 2022 [Paper]
- "A Path Towards Autonomous Machine Intelligence", OpenReview, 2022 [Paper]
- DreamerV1: "Dream to Control: Learning Behaviors by Latent Imagination", ICLR, 2020 [Paper]
- DreamerV2: "Mastering Atari with Discrete World Models", ICLR, 2021 [Paper]
- DreamerV3: "Mastering Diverse Domains through World Models", arXiv, Jan 2023 [Paper]
- DayDreamer: "DayDreamer: World Models for Physical Robot Learning", CoRL, 2022 [Paper]
- TWM: "Transformer-based World Models Are Happy With 100k Interactions", ICLR, 2023 [Paper]
- DECKARD: "Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling", ICML, 2023 [Paper][Website][Code]
- LLM-MCTS: "Large Language Models as Commonsense Knowledge for Large-Scale Task Planning", NeurIPS, 2023 [Paper]
- RAP: "Reasoning with Language Model is Planning with World Model", EMNLP, 2023 [Paper]
- LLM+P: "LLM+P: Empowering Large Language Models with Optimal Planning Proficiency", arXiv, Apr 2023 [Paper][Code]
- LLM-DM: "Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning", NeurIPS, 2023 [Paper][Website][Code]
- E2WM: "Language Models Meet World Models: Embodied Experiences Enhance Language Models", NeurIPS, 2023 [Paper][Code]
- ThinkBot: "ThinkBot: Embodied Instruction Following with Thought Chain Reasoning", arXiv, Dec 2023 [Paper]
- ReAct: "ReAct: Synergizing Reasoning and Acting in Language Models", ICLR, 2023 [Paper]
- RAT: "RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation", arXiv, Mar 2024 [Paper]
- ECoT: "Robotic Control via Embodied Chain-of-Thought Reasoning", arXiv, Jul 2024 [Paper]
- Tree-Planner: "Tree-Planner: Efficient Close-loop Task Planning with Large Language Models", ICLR, 2024 [Paper]
- OpenVLA: "OpenVLA: An Open-Source Vision-Language-Action Model", arXiv, Jun 2024 [Paper]
- Transporter Networks: "Transporter Networks: Rearranging the Visual World for Robotic Manipulation", CoRL, 2020 [Paper]
- CLIPort: "CLIPort: What and Where Pathways for Robotic Manipulation", CoRL, 2021 [Paper][Website][Code]
- BC-Z: "BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning", CoRL, 2021 [Paper][Website][Code]
- HULC: "What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data", arXiv, Apr 2022 [Paper][Website][Code]
- HULC++: "Grounding Language with Visual Affordances over Unstructured Data", ICRA, 2023 [Paper][Website][Paper]
- MCIL: "Language Conditioned Imitation Learning over Unstructured Data", Robotics: Science and Systems, 2021 [Paper][Website][Paper]
- UniPi: "Learning Universal Policies via Text-Guided Video Generation", NeurIPS, 2023 [Paper][Website]
- RoboFlamingo: "Vision-Language Foundation Models as Effective Robot Imitators", arXiv, Jan 2025 [Paper][Website][Code]
- ACT: "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware", Robotics: Science and Systems, 2023 [Paper]
- RoboCat: "RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation", arXiv, Mar 2021 [Paper]
- Gato: "A Generalist Agent", Trans. Mach. Learn. Res., 2022 [Paper]
- RT-Trajectory: "RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches", ICLR, 2023 [Paper]
- Q-Transformer: "Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions", arXiv, Sep 2023 [Paper]
- Interactive Language: "Interactive Language: Talking to Robots in Real Time", arXiv, Oct 2022 [Paper]
- MT-ACT: "RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking", ICRA, 2024 [Paper][Code][Code]
- Hiveformer: "Instruction-driven history-aware policies for robotic manipulations", CoRL, 2022 [Paper][Website][Code]
- VIMA: "VIMA: General Robot Manipulation with Multimodal Prompts", arXiv, Oct 2022 [Paper]
- MOO: "Open-World Object Manipulation using Pre-trained Vision-Language Models", CoRL, 2023 [Paper]
- VER: "Volumetric Environment Representation for Vision-Language Navigation", CVPR, 2024 [Paper][Code]
- RVT: "RVT: Robotic View Transformer for 3D Object Manipulation", CoRL, 2023 [Paper]
- RVT-2: "RVT-2: Learning Precise Manipulation from Few Demonstrations", arXiv, Jun 2024 [Paper]
- RoboUniView: "RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulaiton", [Code]
- PerAct: "Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation", CoRL, 2022 [Paper]
- Act3D: "Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation", CoRL, 2023 [Paper][Website][Code]
- MDT: "Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals", Robotics: Science and Systems, 2024 [Paper][Website][Code]
- RDT-1B: "RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation", arXiv, Oct 2024 [Paper][Website][Code]
- Diffusion Policy: "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion", Robotics: Science and Systems, 2023 [Paper][Website][Code]
- Octo: "Octo: An Open-Source Generalist Robot Policy", Robotics: Science and Systems, 2024 [Paper][Website][Code]
- SUDD: "Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition", CoRL, 2023 [Paper][Code]
- 3D Diffuser Actor: "3D Diffuser Actor: Policy Diffusion with 3D Scene Representations", arXiv, Feb 2024 [Paper][Code]
- DP3: "3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations", Proceedings of Robotics: Science and Systems (RSS), 2024 [Paper][Website][Code]
- VoxPoser: "VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models", CoRL, 2023 [Paper][Website][Code]
- Language costs: "Correcting Robot Plans with Natural Language Feedback", Robotics: Science and Systems, 2022 [Paper][Website]
- RoboTAP: "RoboTAP: Tracking Arbitrary Points for Few-Shot Visual Imitation", ICRA, 2024 [Paper][Website]
- ReKep: "ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation", arXiv, Sep 2024 [Paper][Website][Code]
- RoboPoint: "RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics", arXiv, Jun 2024 [Paper][Website][Code]
- PIVOT: "PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs", ICML, 2024 [Paper][Website]
- RT-2: "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", CoRL, 2023 [Paper][Website]
- RT-H: "RT-H: Action Hierarchies Using Language", Robotics: Science and Systems, 2024 [Paper][Website]
- RT-X, OXE: "Open X-Embodiment: Robotic Learning Datasets and RT-X Models", arXiv, Oct 2023 [Paper][Website][Code]
- π0: "π0: A Vision-Language-Action Flow Model for General Robot Control", arXiv, Oct 2024 [Paper][Website]
- (SL)^3: "Skill Induction and Planning with Latent Language", ACL, 2022 [Paper]
- Translated <LM>: "Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents", ICML, 2022 [Paper][Code]
- SayCan: "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances", CoRL, 2022 [Paper][Website][Code]
- EmbodiedGPT: "EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought", NeurIPS, 2023 [Paper][Code]
- MultiPLY: "MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World", CVPR, 2024 [Paper]
- ShapeLLM: "ShapeLLM: Universal 3D Object Understanding for Embodied Interaction", ECCV, 2024 [Paper][Website][Code]
- ReAct: "ReAct: Synergizing Reasoning and Acting in Language Models", ICLR, 2023 [Paper][Website][Code]
- Socratic Models: "Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language", ICLR, 2023 [Paper]
- LID: "Pre-Trained Language Models for Interactive Decision-Making", NeurIPS, 2022 [Paper][Website][Code]
- Inner Monologue: "Inner Monologue: Embodied Reasoning through Planning with Language Models", arXiv, Jul 2022 [Paper][Website]
- LLM-Planner: "LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models", ICCV, 2023 [Paper][Website][Website]
- ChatGPT for Robotics: "ChatGPT for Robotics: Design Principles and Model Abilities", IEEE Access, 2023 [Paper][Website][Code]
- DEPS: "Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents", arXiv, Feb 2023 [Paper][Code]
- ConceptGraphs: "ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning", ICRA, 2023 [Paper][Website][Code]
- CaP: "Code as Policies: Language Model Programs for Embodied Control", ICRA, 2023 [Paper][Website][Code]
- ProgPrompt: "ProgPrompt: Generating Situated Robot Task Plans using Large Language Models", ICRA, 2023 [Paper][Website][Code]
- COME-robot: "Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V", arXiv, Apr 2024 [Paper][Website]
- "Foundation Models in Robotics: Applications, Challenges, and the Future", arXiv, Dec 2023 [Paper]
- "Real-World Robot Applications of Foundation Models: A Review", arXiv, Feb 2024 [Paper]
- "Large Language Models for Robotics: Opportunities, Challenges, and Perspectives", arXiv, Jan 2024 [Paper]
- "Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis", arXiv, Dec 2023 [Paper]
- "Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond", arXiv, May 2024 [Paper]
Thank you for your interest! If you find our work helpful, please consider citing us with:
@article{DBLP:journals/corr/abs-2405-14093,
author = {Yueen Ma and
Zixing Song and
Yuzheng Zhuang and
Jianye Hao and
Irwin King},
title = {A Survey on Vision-Language-Action Models for Embodied {AI}},
journal = {CoRR},
volume = {abs/2405.14093},
year = {2024}
}