Krish Agarwal1, Zhuoming Chen1, Cheng Luo, Yongqi Chen3, Haizhong Zheng1, Xun Huang3, Atri Rudra2, Beidi Chen1
1 Carnegie Mellon University 2 University at Buffalo 3 Morpheus AI
This repository is an implementation of MonarchRT, a method to sparsely parameterize attention maps in video diffusion transformer (DiT) models using Monarch matrices, achieving minimal quality degradation. MonarchRT is effective even for real-time video DiTs that employ few-step diffusion and autoregressive generation. Using our efficient Triton kernel implementation, MonarchRT achieves, for the first time, true real-time 16 FPS video generation with Self-Forcing on a single RTX 5090 while continuing to match quality with the dense model (even under a smaller training budget).
demo.mp4
Create a conda environment and install dependencies:
conda create -n monarch_rt python=3.10 -y
conda activate monarch_rt
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
python setup.py develop
If you see errors like ModuleNotFoundError: No module named 'pkg_resources', you may need to first downgrade setuptools and pip to be able to build CLIP (following this comment):
pip install "setuptools>=65.0.0,<81"
pip install "pip==25.0"
Download the released Wan2.1-1.3B and Self-Forcing checkpoints. We also provide instructions for training MonarchRT versions of these models below.
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir-use-symlinks False --local-dir wan_models/Wan2.1-T2V-1.3B
huggingface-cli download gdhe17/Self-Forcing checkpoints/self_forcing_dmd.pt --local-dir .
Example inference script using the chunk-wise autoregressive Self-Forcing checkpoint:
python inference.py \
--config_path configs/self_forcing_dmd.yaml \
--output_folder videos/self_forcing_dmd \
--checkpoint_path checkpoints/self_forcing_dmd.pt \
--data_path prompts/MovieGenVideoBench_extended.txt \
--use_ema
Other config files and corresponding checkpoints can be found in the configs folder. For example, you can run training-free MonarchRT on Self-Forcing with:
python inference.py \
--config_path configs/self_forcing_monarch_dmd.yaml \
--output_folder videos/self_forcing_dmd \
--checkpoint_path checkpoints/self_forcing_dmd.pt \
--data_path prompts/MovieGenVideoBench_extended.txt \
--use_ema
The first time you run this, it will likely take some time for Triton to compile over many autotune configs. The compiled kernels will be cached for subsequent runs, so this time will be reduced. However, autotune will still run each time (while Triton can cache the autotune timings, it currently ignores this using kernel pre-hooks), so you will only see latency improvements with MonarchRT if you generate multiple videos at a time.
huggingface-cli download gdhe17/Self-Forcing checkpoints/ode_init.pt --local-dir .
huggingface-cli download gdhe17/Self-Forcing vidprom_filtered_extended.txt --local-dir prompts
torchrun --nnodes=8 --nproc_per_node=8 --rdzv_id=5235 \
--rdzv_backend=c10d \
--rdzv_endpoint $MASTER_ADDR \
train.py \
--config_path configs/self_forcing_monarch_dmd.yaml \
--logdir logs/self_forcing_monarch_dmd \
--disable-wandb
This will produce a torch distributed checkpoint under logs/self_forcing_monarch_dmd/checkpoint_model_000600, which you can convert using:
python -m torch.distributed.checkpoint.format_utils dcp_to_torch logs/self_forcing_monarch_dmd/checkpoint_model_000600 logs/self_forcing_monarch_dmd/model.pt
Then you can use logs/self_forcing_monarch_dmd/model.pt as the checkpoint path when performing inference.
The Self-Forcing training algorithm is data-free when using their provided ODE initialization checkpoint. It is possible to acheive even higher quality by introducing MonarchRT earlier in the training pipeline, specifically during the causal initialization. Although CausVid performs the initialization using the teacher's ODE trajectories, we find that we can achieve similar results by directly training with diffusion loss on videos generated by the teacher,
huggingface-cli download zhengqili/Self-Forcing-Data --repo-type dataset --include "wanx_14B_shift-3.0_cfg-5.0_lmdb_70K/**" --local-dir data/wanx_14B_shift-3.0_cfg-5.0_lmdb_70K --local-dir-use-symlinks False
Start with the causal initialization:
torchrun --nnodes=8 --nproc_per_node=8 --rdzv_id=5235 \
--rdzv_backend=c10d \
--rdzv_endpoint $MASTER_ADDR \
train.py \
--config_path configs/wan_monarch_causal_training.yaml \
--logdir logs/wan_monarch_causal \
--disable-wandb
Then convert the produced distributed checkpoint:
python -m torch.distributed.checkpoint.format_utils dcp_to_torch logs/wan_monarch_causal/checkpoint_model_001000 logs/wan_monarch_causal/model.pt
Then perform the Self-Forcing DMD training:
torchrun --nnodes=8 --nproc_per_node=8 --rdzv_id=5235 \
--rdzv_backend=c10d \
--rdzv_endpoint $MASTER_ADDR \
train.py \
--config_path configs/self_forcing_monarch_from_monarch_dmd.yaml \
--logdir logs/self_forcing_monarch_from_monarch_dmd \
--disable-wandb
Then convert this distributed checkpoint:
python -m torch.distributed.checkpoint.format_utils dcp_to_torch logs/self_forcing_monarch_from_monarch_dmd/checkpoint_model_000600 logs/self_forcing_monarch_from_monarch_dmd/model.pt
Now you can use logs/self_forcing_monarch_from_monarch_dmd/model.pt as the checkpoint path when performing inference.
MonarchRT is also compatible with traditional bidirectional models. Below are the instructions to train a MonarchRT version of Wan2.1-T2V-1.3B.
This is the same data used for causal initialization.
huggingface-cli download zhengqili/Self-Forcing-Data --repo-type dataset --include "wanx_14B_shift-3.0_cfg-5.0_lmdb_70K/**" --local-dir data/wanx_14B_shift-3.0_cfg-5.0_lmdb_70K --local-dir-use-symlinks False
torchrun --nnodes=8 --nproc_per_node=8 --rdzv_id=5235 \
--rdzv_backend=c10d \
--rdzv_endpoint $MASTER_ADDR \
train.py \
--config_path configs/wan_monarch_finetuning.yaml \
--logdir logs/wan_monarch_finetuning \
--disable-wandb
Then convert the produced distributed checkpoint:
python -m torch.distributed.checkpoint.format_utils dcp_to_torch logs/wan_monarch_finetuning/checkpoint_model_001000 logs/wan_monarch_finetuning/model.pt
Then you can use logs/wan_monarch_finetuning/model.pt as the checkpoint path when performing inference.
You can also train a 4-step MonarchRT version of Wan2.1-T2V-1.3B using DMD (which we show results for in our paper).
torchrun --nnodes=8 --nproc_per_node=8 --rdzv_id=5235 \
--rdzv_backend=c10d \
--rdzv_endpoint $MASTER_ADDR \
train.py \
--config_path configs/wan_monarch_fewstep_dmd.yaml \
--logdir logs/wan_monarch_fewstep_dmd \
--disable-wandb
Then convert the produced distributed checkpoint:
python -m torch.distributed.checkpoint.format_utils dcp_to_torch logs/wan_monarch_fewstep_dmd/checkpoint_model_001000 logs/wan_monarch_fewstep_dmd/model.pt
Then you can use logs/wan_monarch_fewstep_dmd/model.pt as the checkpoint path when performing inference.
The code here is build on the open-source implementation of Self Forcing, which is itself built on top of the open-source implementation of CausVid by Tianwei Yin and the Wan2.1 repo.
@misc{agarwal2026monarchrtefficientattentionrealtime,
title={MonarchRT: Efficient Attention for Real-Time Video Generation},
author={Krish Agarwal and Zhuoming Chen and Cheng Luo and Yongqi Chen and Haizhong Zheng and Xun Huang and Atri Rudra and Beidi Chen},
year={2026},
eprint={2602.12271},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.12271},
}