
Xiao Fu1 ,
Xintao Wang2 ✉,
Xian Liu1,
Jianhong Bai3,
Runsen Xu1,
Pengfei Wan2,
Di Zhang2,
Dahua Lin1✉
1The Chinese University of Hong Kong
2Kling Team, Kuaishou Technology
3Zhejiang University
✉: Corresponding Authors
🔥 RoboMaster synthesizes realistic robotic manipulation video given an initial frame, a prompt, a user-defined object mask, and a collaborative trajectory describing the motion of both robotic arm and manipulated object in decomposed interaction phases. It supports diverse manipulation skills and can generalize to in-the-wild scenarios.
teaser.mp4
- Add inference codes with checkpoints.
- Add training codes.
- Add evaluation codes.
- Add Gradio demo to generate model inputs on in-the-wild images.
- Release full training data.
- Our environment setup is identical to CogVideoX. You can refer to their configuration to complete the environment setup.
conda create -n robomaster python=3.10 conda activate robomaster
- Download
ckpts
from here and place it under the base rootRoboMaster
. The checkpoints are organized as follows:├── ckpts ├── CogVideoX-Fun-V1.5-5b-InP (pretrained model base) ├── RoboMaster (post-trained transformer)
-
Robotic Manipulation on Diverse Out-of-Domain Objects.
python inference_inthewild.py \ --input_path demos/diverse_ood_objs \ --output_path samples/infer_diverse_ood_objs \ --transformer_path ckpts/RoboMaster \ --model_path ckpts/CogVideoX-Fun-V1.5-5b-InP
-
Robotic Manipulation with Diverse Skills
python inference_inthewild.py \ --input_path demos/diverse_skills \ --output_path samples/infer_diverse_skills \ --transformer_path ckpts/RoboMaster \ --model_path ckpts/CogVideoX-Fun-V1.5-5b-InP
-
Long Video Generation in Auto-Regressive Manner
python inference_inthewild.py \ --input_path demos/long_video \ --output_path samples/long_video \ --transformer_path ckpts/RoboMaster \ --model_path ckpts/CogVideoX-Fun-V1.5-5b-InP
- We fine-tune the base model on videos with a resolution of 640×480 and 37 frames using 8 GPUs. During preprocessing, videos with fewer than 16 frames are excluded.
cd scripts bash train_injector.sh
├── RoboMaster
├── eval_metrics
├── VBench
├── common_metrics_on_video_quality
├── eval_traj
├── results
├── bridge_eval_gt
├── bridge_eval_ours
├── bridge_eval_ours_tracking
-
Download
eval_metrics.zip
from here and extract it under the base root. -
Generating
bridge_eval_ours
. (Note that the results may vary slightly across different computing machines, even with the same seed. We have prepared the reference files undereval_metrics/results
)cd RoboMaster/ python inference_eval.py
-
Generating
bridge_eval_ours_tracking
: Install CoTracker3, and then estimate tracking points with grid size 30 onbridge_eval_ours
.
- Evaluation of VBench metrics.
cd eval_metrics/VBench python evaluate.py \ --dimension aesthetic_quality imaging_quality temporal_flickering motion_smoothness subject_consistency background_consistency \ --videos_path ../results/bridge_eval_ours \ --mode=custom_input \ --output_path evaluation_results
- Evaluation of FVD and FID metrics.
cd eval_metrics/common_metrics_on_video_quality python calculate.py -v1_f ../results/bridge_eval_ours -v2_f ../results/bridge_eval_gt python -m pytorch_fid eval_1 eval_2
- Estimation of TrajError metrics. (Note that we exclude some samples listed in
failed_track.txt
, due to failed estimation by CoTracker3)cd eval_metrics/eval_traj python calculate_traj.py \ --input_path_1 ../results/bridge_eval_ours \ --input_path_2 ../results/bridge_eval_gt \ --tracking_path ../results/bridge_eval_ours_tracking \ --output_path evaluation_results
- Check the visualization videos under
evaluation_results
. We blend the trajectories of robotic arm and object throughout the entire video for better illustration.
If you find this work helpful, please consider citing:
@article{fu2025robomaster,
title={Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control},
author={Fu, Xiao and Wang, Xintao and Liu, Xian and Bai, Jianhong and Xu, Runsen and Wan, Pengfei and Zhang, Di and Lin, Dahua},
journal={arXiv preprint arXiv:2506.01943},
year={2025}
}