Intelligent Vision Processing Lab. (IVPL), Sookmyung Women's University, Seoul, Republic of Korea
This repository is the official PyTorch implementation of the HiRT (Dissertation for the Degree of Doctor of Philosophy by Young-Ju Choi).
Official paper: Young-Ju Choi, Byung-Gyu Kim∗, HiRT: Hierarchical Recurrent Transformer Network for Video Super-Resolution (VSR), Engineering Applications of Artificial Intelligence (Elsevier), Volume 166, Part B: 113714, (https://doi.org/10.1016/j.engappai.2025.113714), 15 February 2026 (Ranked 2.5%, IF=8.0)
Video super-resolution (VSR) is a crucial technology for enhancing video frame quality, relying on effectively utilizing spatial correlation within frames and temporal dependencies between consecutive frames. Existing methods struggle to restore fine details for various motion types and lack true bi-directional access. Recent research predominantly focuses on residual block-based and transformer-based backbones demonstrating notable effectiveness in VSR. However, many methods treat spatial features uniformly, resulting in inadequate information acquisition and detail enhancement in feature extraction. This paper proposes the hierarchical recurrent transformer (HiRT) to enhance the recurrent propagation in the frequency domain. The hierarchical recurrent propagation in the proposed HiRT consists of the uni-directional backward and forward stages and bi-directional stage. This multi-stage-based structure can deal with various type of motion. The proposed HiRT comprises three transformer modules: the global transformer block, local transformer block, and image transformer block. The global transformer block improves low-frequency feature, which contains global background information of a frame. The high-frequency components are enhanced in the local transformer block. Alongside the image transformer, incorporating discrete wavelet transform (DWT)-based transformer processes can enhance both background and edge details. Experimental comparisons with state-of-the-art (SOTA) methods on benchmark datasets demonstrate the superiority of the proposed approach. The proposed HiRT outperforms all compared methods in terms of SSIM on REDS4 and Vid4 benchmarks. Especially, the proposed HiRT surpasses the VRT which is the transformer-based SOTA method with 0.32dB and 0.0068 on REDS4 in PSNR and SSIM, respectively. The proposed HiRT can brings about 0.12dB and 0.07dB higher PSNR performances compared to the BasicVSR++ on REDS4 and Vid4, respectively. Moreover, the proposed HiRT achieves about 0.0133 and 0.0067 of SSIM improvement on REDS4 compared to the Multi-Scale-T and LGDFNet-BPP as the latest VSR methods, respectively.
-
Anaconda3
-
Python == 3.8
conda create --name hirt python=3.8
-
Trained on PyTorch 1.9.1 and CUDA 11.1
Run in ./
pip install -r requirements.txt BASICSR_EXT=True python setup.py develop
We used REDS and Vimeo90K datasets for training and Vid4 and REDS4 datasets for testing.
-
Prepare for REDS and REDS4
-
Please refer to Dataset.md in our Deep-Video-Super-Resolution repository for more details.
-
Download dataset from the official website.
-
-
Prepare for Vimeo90K
-
Please refer to Dataset.md in our Deep-Video-Super-Resolution repository for more details.
-
Download dataset from the official website.
-
Generate LR data
Run in ./scripts/
python generate_LR_Vimeo90K.py
-
-
Prepare for Vid4
-
Please refer to Dataset.md in our Deep-Video-Super-Resolution repository for more details.
-
Download dataset from here.
-
Generate LR data
Run in ./scripts/
python generate_LR_Vid4.py
-
Pre-trained models are available in below link.
Please save the pre-trained models to './experiments/pretrained_models/HiRT/'.
Run in ./
-
REDS
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ./dist_train.sh 8 ./options/train_HiRT_REDS.yml
-
Vimeo90K
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ./dist_train.sh 8 ./options/train_HiRT_Vimeo90K.yml
Run in ./
-
REDS4
CUDA_VISIBLE_DEVICES=0 bash ./dist_test.sh 1 ./options/test_HiRT_REDS4.yml
-
Vid4
CUDA_VISIBLE_DEVICES=0 bash ./dist_test.sh 1 ./options/test_HiRT_Vid4.yml
The codes are heavily based on BasicSR and PSRT. Thanks for their awesome works.
BasicSR :
@misc{basicsr,
author = {Xintao Wang and Liangbin Xie and Ke Yu and Kelvin C.K. Chan and Chen Change Loy and Chao Dong},
title = {{BasicSR}: Open Source Image and Video Restoration Toolbox},
howpublished = {\url{https://github.com/XPixelGroup/BasicSR}},
year = {2022}
}@article{shi2022rethinking,
title={Rethinking Alignment in Video Super-Resolution Transformers},
author={Shi, Shuwei and Gu, Jinjin and Xie, Liangbin and Wang, Xintao and Yang, Yujiu and Dong, Chao},
journal={arXiv preprint arXiv:2207.08494},
year={2022}
}



