The official implementation of our NeurIPS 2024 paper:
Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation
Qingwen Bu, Jia Zeng, Li Chen, Yanchao Yang, Guyue Zhou, Junchi Yan, Ping Luo, Heming Cui, Yi Ma and Hongyang Li
📬 If you have any questions, please feel free to contact: Qingwen Bu ( qwbu01@sjtu.edu.cn )
Full code and checkpoints release is coming soon. Please stay tuned.🦾
- 🍀 CLOVER employs a text-conditioned video diffusion model for generating visual plans as reference inputs, then these sub-goals guide the feedback-driven policy to generate actions with an error measurement strategy.
- Owing to the closed-loop attribute, CLOVER is robust to visual distraction and object variation:
- This closed-loop mechanism enables achieving the desired states accurately and reliably, thereby facilitating the execution of long-term tasks:
cook-fish.mp4
- [2024/09/16] We released our paper on arXiv.
- Training script for visual planner
- Checkpoints release (Scheduled Release Date: Mid-October, 2024)
- Evaluation codes on CALVIN (Scheduled Release Date: Mid-October, 2024)
- Policy training codes on CALVIN (Estimated Release Period: November, 2024)
Our training are conducted with PyTorch 1.13.1, CUDA 11.7, Ubuntu 22.04, and NVIDIA Tesla A100 (80 GB). The closed-loop evaluation on CALVIN is run on a system with NVIDIA RTX 3090.
We did further testing with PyTorch 2.2.0 + CUDA 11.8, and the training also goes fine.
- (Optional) We use conda to manage the environment.
conda create -n clover python=3.8
conda activate clover
- Install dependencies.
cd visual_planner
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
pip install git+https://github.com/hassony2/torch_videovision
pip install -e .
- Installation of CALVIN simulator.
git clone --recurse-submodules https://github.com/mees/calvin.git
export CALVIN_ROOT=$(pwd)/calvin
cd $CALVIN_ROOT
sh install.sh
We release model weights of our Visual Planner and Feedback-driven Policy at HuggingFace.
-
The visual planner requires 24 GB GPU VRAM with a batch size of 4 (per GPU), video length of 8 and image size of 128.
-
- We use OpenAI-CLIP to encode task instructions for conditioning.
-
Please modify accelerate_cfg.yaml first according to your setup.
accelerate launch --config_file accelerate_cfg.yaml train.py \
--learning_rate 1e-4 \
--train_num_steps 300000 \
--save_and_sample_every 10000 \
--train_batch_size 32 \
--sample_per_seq 8 \
--sampling_step 5 \
--with_text_conditioning \
--diffusion_steps 100 \
--sample_steps 10 \
--with_depth \
--flow_reg \
--results_folder *path_to_save_your_ckpts*
-
- Set your CALVIN and checkpoint path at FeedbackPolicy/eval_calvin.sh
- We train our policy with input size of 192*192, please modify the config file correspondingly in VC-1 Config with
img_size: 192
anduse_cls: False
.
cd ./FeedbackPolicy
bash eval_calvin.sh
If you find the project helpful for your research, please consider citing our paper:
@article{bu2024clover,
title={Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation},
author={Bu, Qingwen and Zeng, Jia and Chen, Li and Yang, Yanchao and Zhou, Guyue and Yan, Junchi and Luo, Ping and Cui, Heming and Ma, Yi and Li, Hongyang},
journal={arXiv preprint arXiv:2409.09016},
year={2024}
}
We thank AVDC and RoboFlamingo for their open-sourced work!