HERO: Human Reaction Generation from Videos (ICCV 2025)

[arXiv]

📍 Get You Ready

Details

1. Environment and Dependencies

cd HERO_release
conda create -n HERO python=3.10
conda activate HERO
pip install -r requirements.txt
pip install mmcv-full==1.7.0 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.12/index.html

2. Models

Visit [Google Drive] to download the models, then unzip and place the result in ./checkpoints.

3. Dataset

Download ViMo dataset, then unzip and place the result in ../Data/VIMO.

👾 Train the Models

Details

Note: You have to train RVQ BEFORE training masked/residual transformers. The latter two can be trained simultaneously.

Train RVQ

python train_vq_vimo.py --name rvq_bs256_finetune_ep10 --gpu_id 0 --window_size 20 \
    --dataset_name vimo --batch_size 256 --num_quantizers 6 --max_epoch 10 \
    --warm_up_iter 20 --milestones 1600 3200 --finetune

Train Masked Transformer

python train_mask_transformer_memo_cross_vimo.py --name mtrans_memo_cross_l6_bs64_ep200 --gpu_id 0 \
    --dataset_name vimo --batch_size 64 --max_epoch 200 --vq_name rvq_bs256_finetune_ep10 \
    --milestones 6000 --warm_up_iter 250 --n_layers 6

Train Residual Transformer

python train_res_transformer_memo_cross_vimo.py --name rtrans_memo_cross_l6_bs64_ep200 --gpu_id 1 \
    --dataset_name vimo --batch_size 64 --max_epoch 200 --vq_name rvq_bs256_finetune_ep10 \
    --milestones 6000 --warm_up_iter 250 --n_layers 6

--name: name your model. This will create to model space as ./checkpoints/<dataset_name>/<name>
--batch_size: we use 256 for rvq training. For masked/residual transformer, we use 64.
--num_quantizers: number of quantization layers, 6 is used in our case.
--vq_name: when training masked/residual transformer, you need to specify the name of rvq model for tokenization.
--n_layers: number of transformer decoder layers, 6 is used in our case.

All the trained models and intermediate results will be saved in space ./checkpoints/<dataset_name>/<name>.

📖 Evaluation

Details

Evaluate RVQ Reconstruction:

python eval_vq_vimo.py --gpu_id 0 --name rvq_bs256_finetune_ep10 --dataset_name vimo --ext rvq_nq6

Evaluate Video-to-reaction Generation:

python eval_trans_res_memo_cross_vimo.py --dataset_name vimo --vq_name rvq_bs256_finetune_ep10 \
    --name mtrans_memo_cross_l6_bs64_ep200 --res_name rtrans_memo_cross_l6_bs64_ep200 \
    --gpu_id 1 --cond_scale 4 --time_steps 10 --ext rvq1_rtrans1_bs64_cs4_ts10 \
    --which_epoch all --test_txt test.txt

--name: model name of masked transformer.
--res_name: model name of residual transformer.
--cond_scale: scale of classifer-free guidance.
--time_steps: number of iterations for inference.
--ext: filename for saving evaluation results.
--which_epoch: checkpoint name of masked transformer.

The final evaluation results will be saved in ./checkpoints/<dataset_name>/<name>/eval/<ext>.log

🚀 Generation

Details

python gen.py --gpu_id 0 --ext exp1 --dataset_name vimo --vq_name rvq_bs256_finetune_ep10 \
    --name mtrans_memo_cross_l6_bs64_ep200 --res_name rtrans_memo_cross_l6_bs64_ep200 \
    --video_path <path to the input video> --motion_length <the number of poses for generation>

motion_length indicates the number of poses, which must be an integer and will be rounded by 4. The maximum value is 200.

The generated motion and stick figure animation will be stored under folder ./generation/<ext>/.

For the motion visualization with SMPL, please refer to T2M-GPT and MLD.

Acknowlegements

We sincerely thank the open-sourcing of these works where our code is based on:

MoMask, TC-CLIP, T2M-GPT and MLD.

Citation

If you find our code or paper helpful, please consider starring our repository and citing:

@article{yu2025hero,
  title={HERO: Human Reaction Generation from Videos},
  author={Yu, Chengjun and Zhai, Wei and Yang, Yuhang and Cao, Yang and Zha, Zheng-Jun},
  journal={arXiv preprint arXiv:2503.08270},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
common		common
custom_clip		custom_clip
data		data
models		models
motion_loaders		motion_loaders
options		options
tome		tome
utils		utils
visualization		visualization
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval_trans_res_memo_cross_vimo.py		eval_trans_res_memo_cross_vimo.py
eval_vq_vimo.py		eval_vq_vimo.py
gen.py		gen.py
requirements.txt		requirements.txt
train_mask_transformer_memo_cross_vimo.py		train_mask_transformer_memo_cross_vimo.py
train_res_transformer_memo_cross_vimo.py		train_res_transformer_memo_cross_vimo.py
train_vq_vimo.py		train_vq_vimo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HERO: Human Reaction Generation from Videos (ICCV 2025)

[arXiv]

📍 Get You Ready

1. Environment and Dependencies

2. Models

3. Dataset

👾 Train the Models

Train RVQ

Train Masked Transformer

Train Residual Transformer

📖 Evaluation

Evaluate RVQ Reconstruction:

Evaluate Video-to-reaction Generation:

🚀 Generation

Acknowlegements

Citation

About

Uh oh!

Releases

Packages

Languages

License

JackYu6/HERO_release

Folders and files

Latest commit

History

Repository files navigation

HERO: Human Reaction Generation from Videos (ICCV 2025)

[arXiv]

📍 Get You Ready

1. Environment and Dependencies

2. Models

3. Dataset

👾 Train the Models

Train RVQ

Train Masked Transformer

Train Residual Transformer

📖 Evaluation

Evaluate RVQ Reconstruction:

Evaluate Video-to-reaction Generation:

🚀 Generation

Acknowlegements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages