Skip to content

[ARXIV’25] Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

Notifications You must be signed in to change notification settings

KwaiVGI/RoboMaster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

Version        

Xiao Fu1 , Xintao Wang2 ✉, Xian Liu1, Jianhong Bai3, Runsen Xu1,
Pengfei Wan2, Di Zhang2, Dahua Lin1✉

1The Chinese University of Hong Kong 2Kling Team, Kuaishou Technology 3Zhejiang University
✉: Corresponding Authors

🌟 Introduction

🔥 RoboMaster synthesizes realistic robotic manipulation video given an initial frame, a prompt, a user-defined object mask, and a collaborative trajectory describing the motion of both robotic arm and manipulated object in decomposed interaction phases. It supports diverse manipulation skills and can generalize to in-the-wild scenarios.

teaser.mp4

📝 TODO List

  • Add inference codes with checkpoints.
  • Add training codes.
  • Add evaluation codes.
  • Add Gradio demo to generate model inputs on in-the-wild images.
  • Release full training data.

⚙️ Quick Start

1. Environment Setup

  1. Our environment setup is identical to CogVideoX. You can refer to their configuration to complete the environment setup.
    conda create -n robomaster python=3.10
    conda activate robomaster
  2. Download ckpts from here and place it under the base root RoboMaster. The checkpoints are organized as follows:
    ├── ckpts
        ├── CogVideoX-Fun-V1.5-5b-InP   (pretrained model base)
        ├── RoboMaster                  (post-trained transformer)
    

2. Generate Website Demos

  1. Robotic Manipulation on Diverse Out-of-Domain Objects.

    python inference_inthewild.py \
        --input_path demos/diverse_ood_objs \
        --output_path samples/infer_diverse_ood_objs \
        --transformer_path ckpts/RoboMaster \
        --model_path ckpts/CogVideoX-Fun-V1.5-5b-InP
  2. Robotic Manipulation with Diverse Skills

    python inference_inthewild.py \
        --input_path demos/diverse_skills \
        --output_path samples/infer_diverse_skills \
        --transformer_path ckpts/RoboMaster \
        --model_path ckpts/CogVideoX-Fun-V1.5-5b-InP
  3. Long Video Generation in Auto-Regressive Manner

    python inference_inthewild.py \
        --input_path demos/long_video \
        --output_path samples/long_video \
        --transformer_path ckpts/RoboMaster \
        --model_path ckpts/CogVideoX-Fun-V1.5-5b-InP

3. Start Training

  1. We fine-tune the base model on videos with a resolution of 640×480 and 37 frames using 8 GPUs. During preprocessing, videos with fewer than 16 frames are excluded.
    cd scripts
    bash train_injector.sh

🚀 Benchmark Evaluation

├── RoboMaster
  ├── eval_metrics
      ├── VBench
      ├── common_metrics_on_video_quality
      ├── eval_traj
      ├── results
          ├── bridge_eval_gt
          ├── bridge_eval_ours
          ├── bridge_eval_ours_tracking

1. Prepare Evaluation Files & Inference on Benchmark

  1. Download eval_metrics.zip from here and extract it under the base root.

  2. Generating bridge_eval_ours. (Note that the results may vary slightly across different computing machines, even with the same seed. We have prepared the reference files under eval_metrics/results)

    cd RoboMaster/
    python inference_eval.py
  3. Generating bridge_eval_ours_tracking: Install CoTracker3, and then estimate tracking points with grid size 30 on bridge_eval_ours.

2. Evaluation on Visual Quality

  1. Evaluation of VBench metrics.
    cd eval_metrics/VBench
    python evaluate.py \
        --dimension aesthetic_quality imaging_quality temporal_flickering motion_smoothness subject_consistency background_consistency \
        --videos_path ../results/bridge_eval_ours \
        --mode=custom_input \
        --output_path evaluation_results
  2. Evaluation of FVD and FID metrics.
    cd eval_metrics/common_metrics_on_video_quality
    python calculate.py -v1_f ../results/bridge_eval_ours -v2_f ../results/bridge_eval_gt
    python -m pytorch_fid eval_1 eval_2

3. Evaluation on Trajectory (Robotic Arm & Manipulated Object)

  1. Estimation of TrajError metrics. (Note that we exclude some samples listed in failed_track.txt, due to failed estimation by CoTracker3)
    cd eval_metrics/eval_traj
    python calculate_traj.py \
        --input_path_1 ../results/bridge_eval_ours \
        --input_path_2 ../results/bridge_eval_gt \
        --tracking_path ../results/bridge_eval_ours_tracking \
        --output_path evaluation_results
  2. Check the visualization videos under evaluation_results. We blend the trajectories of robotic arm and object throughout the entire video for better illustration.

🔗 Citation

If you find this work helpful, please consider citing:

@article{fu2025robomaster,
  title={Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control},
  author={Fu, Xiao and Wang, Xintao and Liu, Xian and Bai, Jianhong and Xu, Runsen and Wan, Pengfei and Zhang, Di and Lin, Dahua},
  journal={arXiv preprint arXiv:2506.01943},
  year={2025}
}

About

[ARXIV’25] Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published