Official implementation of "Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation".
Paper | Website | Video | Hugging Face
- Installation
- Simulation environment
- Policy evaluation
- Policy training
- Citation
- License & Acknowledgements
- Clone this repository
git clone git@github.com:yunhaif/reflect-vlm.git
cd reflect-vlm
- Install packages
conda create -n reflectvlm python=3.9 -y
conda activate reflectvlm
pip install -e .
- (Optional) Install additional packages if you want to train VLM policies.
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
We provide a simple script to play with the simulation environment.
python scripts/interact.py
This will generate a task in MuJoCo with interactive visualization. You can interact with the environment by typing the actions. Just launch the script and follow the instructions — it works on Mac too!
The task generated by our procedural task generator is controlled by a seed. You can generate as many tasks as you want by simply changing the environment seed!
python scripts/interact.py --env_seed 1000001
Models are available on Hugging Face, including:
ReflectVLM-llava-v1.5-13b-base
: a base VLM policy trained on a fixed expert dataset.ReflectVLM-llava-v1.5-13b-post-trained
: the VLM policy trained with our post-training strategy with reflection mechanism.ReflectVLM-diffusion
: the diffusion dynamics model.
We provide scripts to run evaluation on the 100 procedurally-generated test tasks. Models will be automatically downloaded from Hugging Face.
To evaluate the base policy:
bash scripts/eval_base_vlm.sh
To evaluate our post-trained policy with reflection:
bash scripts/eval_reflect_vlm.sh {sim|diffusion}
Choose either sim
or diffusion
as the dynamics model used in the reflection mechanism.
You can add your own agent under the agent
folder. Create a new class and implement the act()
method to process observation and get action.
class MyAgent:
def __init__(self, ...):
... # initialize model etc.
def act(self, img, goal_img, inp):
"""
Args:
img: the current image
goal_img: the goal image
inp: the input prompt
Returns:
action: a string of action
"""
action = ... # get action from model
return action
Coming soon...
The script scripts/diffusion_demo.py
can be used to test diffusion generation:
python scripts/diffusion_demo.py
We provide some sample images under assets/images/diffusion_examples
.
If you find our work useful in your research, please consider citing with the following BibTeX:
@misc{feng2025reflective,
title={Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation},
author={Yunhai Feng and Jiaming Han and Zhuoran Yang and Xiangyu Yue and Sergey Levine and Jianlan Luo},
year={2025},
eprint={2502.16707},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2502.16707},
}
This repository is licensed under the MIT license. LLaVA is licensed under the Apache 2.0 license.
Part of the simulation environment is adapted from Metaworld and mjctrl.