SAMPO is a scale-wise autoregressive world model for video prediction and robotic control. It models temporal dynamics through frame-wise causal generation, while capturing spatial structure via multi-scale tokenization and coarse-to-fine prediction. A trajectory-aware motion prompt further enhances spatiotemporal grounding. SAMPO supports high-fidelity, action-conditioned rollouts for visual planning and model-based reinforcement learning.
conda create -n SAMPO python==3.9
conda activate SAMPO
pip install -r requirements.txt🔥 Note:
You need replace the diffusers/models/autoencoders/vae.py file with the one in SAMPO-main/SAMPO/vae_2/vae.py to use the VectorQuantizer2 version of SAMPO.
cp SAMPO-main/SAMPO/vae_2/vae.py envs/SAMPO/lib/python3.9/site-packages/diffusers/models/autoencoders/vae.pyTo evaluate the FVD metric, download the pretrained I3D model into pretrained_models/i3d/i3d_torchscript.pt.
git clone https://github.com/facebookresearch/co-tracker
cd co-tracker
pip install -e .
pip install matplotlib flow_vis tqdm tensorboardDownload the pretrained CoTracker model:
mkdir -p checkpoints
cd checkpoints
# download the online (multi window) model
wget https://huggingface.co/facebook/cotracker3/resolve/main/scaled_online.pth
cd ..Detail in Huggingface Model Hub.
Download datasets from Open X-Embodiment and extract single episodes as .npz files:
python datasets/oxe_data_converter.py --dataset_name {dataset name, e.g. bridge} --input_path {path to downloaded OXE} --output_path {path to stored npz}To replicate our pre-training on OXE, you need to extract all datasets listed under OXE_SELECT in SAMPO/data/dataset_mixes.py.
See instructions at datasets on preprocessing more datasets.
Download datasets from 1X World Mode and extract single episodes as .npz files:
python datasets/preprocess_1x.py --train_root {path to downloaded OXE} --save_path {path to stored npz}Pre-training instructions in OXE dataset
accelerate launch train_tokenizer.py \
--exp_name oxe-64-act-free-tokenizer --output_dir log_vqgan --seed 0 --mixed_precision bf16 \
--model_type ctx_vqgan \
--learning_rate 5e-4 --discr_learning_rate 5e-4 \
--train_batch_size 16 --gradient_accumulation_steps 1 --disc_start 1000005 \
--oxe_data_mixes_type select_sthsth --resolution 64 --dataloader_num_workers 16 \
--rand_select --video_stepsize 1 --segment_horizon 16 --segment_length 8 --context_length 2 \
--dataset_path {path to preprocessed_OXE} \
accelerate launch train_var.py \
--exp_name oxe-64-act-free-transformer --output_dir log_trm --seed 0 --mixed_precision bf16 \
--vqgan_type ctx_vqgan \
--pretrained_model_name_or_path {log directory of finetuned tokenizer}/unwrapped_model \
--per_device_train_batch_size 16 --gradient_accumulation_steps 1 \
--learning_rate 1e-4 --lr_scheduler_type cosine \
--oxe_data_mixes_type select --resolution 64 --dataloader_num_workers 16 \
--dataset_path {path to preprocessed_OXE} \
--video_stepsize 1 --segment_length 16 --context_length 2 \
--weight_decay 0.01 --llama_attn_drop 0.1 --embed_no_wdFine-tuning instructions for BAIR dataset
accelerate launch train_tokenizer.py \
--exp_name bair_tokenizer_ft --output_dir log_vqgan --seed 0 --mixed_precision bf16 \
--model_type ctx_vqgan \
--train_batch_size 16 --gradient_accumulation_steps 1 --disc_start 1000005 \
--oxe_data_mixes_type bair --resolution 64 --dataloader_num_workers 16 \
--rand_select --video_stepsize 1 --segment_horizon 16 --segment_length 8 --context_length 1 \
--pretrained_model_name_or_path pretrained_models/SAMPO-oxe-64-act-free/tokenizer \
--max_train_steps 200005
accelerate launch train_var.py \
--exp_name bair_llama_ft --output_dir log_trm --seed 0 --mixed_precision bf16 \
--vqgan_type ctx_vqgan \
--pretrained_model_name_or_path {log directory of finetuned tokenizer}/unwrapped_model \
--config_name configs/llama/config.json \
--pretrained_transformer_path pretrained_models/SAMPO-oxe-64-act-free/transformer \
--per_device_train_batch_size 16 --gradient_accumulation_steps 1 \
--learning_rate 1e-4 --lr_scheduler_type cosine \
--oxe_data_mixes_type bair --resolution 64 --dataloader_num_workers 16 \
--video_stepsize 1 --segment_length 16 --context_length 1 \
--use_eval_dataset --use_fvd --use_frame_metrics \
--weight_decay 0.01 --llama_attn_drop 0.1 --embed_no_wd \
--max_train_steps 70005SAMPO's performance across multiple datasets, demonstrating its effectiveness in various tasks including human and robotic manipulation predictions.
Visualization of motion prompts.
Our codebase is based on thuml/iVideoGPT and facebookresearch/CoTracker. We sincerely thank the authors of iVideoGPT for their groundbreaking work, which has laid a solid foundation for the development of this project.
This project is licensed under the MIT License - see the LICENSE file for details.
Copyright (c) 2025. All rights reserved.
This repository contains modified components from the iVideoGPT and CoTracker projects. These modifications are provided under the MIT License, which allows for commercial use, distribution, modification, and private use, under the conditions specified in the license.
By using this repository, you agree to the terms of the MIT License, including the disclaimers and limitations of liability.
Note: The original authors of iVideoGPT and CoTracker are not responsible for the modifications made in this repository.


