This is the official pytorch implementation of the paper "RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution"
Zhicheng Geng*, Luming Liang*, Tianyu Ding and Ilya Zharkov
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022
Space-time video super-resolution (STVSR) is the task of interpolating videos with both Low Frame Rate (LFR) and Low Resolution (LR) to produce a High-Frame-Rate (HFR) and also High-Resolution (HR) counterpart. The existing methods based on Convolutional Neural Network (CNN) succeed in achieving visually satisfied results while suffer from slow inference speed due to their heavy architectures. We propose to resolve this issue by using a spatial-temporal transformer that naturally incorporates the spatial and temporal super resolution modules into a single model. Unlike CNN-based methods, we do not explicitly use separated building blocks for temporal interpolations and spatial super-resolutions; instead, we only use a single end-to-end transformer architecture. Specifically, a reusable dictionary is built by encoders based on the input LFR and LR frames, which is then utilized in the decoder part to synthesize the HFR and HR frames.
Below is performance of RSTT on Vid4 dataset using small (S), medium (M) and large (L) architectures compared to other baseline models. We plot FPS versus PSNR. Note that 24 FPS is the standard cinematic frame rate. We also plot the number of parameters (in millions) versus PSNR.
The features extracted from four input LFR and LR frames are processed by encoders Ek, k = 0, 1, 2, 3 to build dictionaries that will be used as inputs for the decoders Dk, k = 0, 1, 2, 3. The query builder generates a vector of queries Q which are then used to synthesize a sequence of seven consecutive HFR and HR frames.
Cuda 11.4
Python 3.8.11
torch 1.9.0 or higher
$ git clone https://github.com/llmpass/RSTT.git
$ pip install -r requirements.txt
Note that the torch version must be compatible to the cuda version, not necessary to be 1.9.0 here. For example, with cuda version 11.X, torch 1.9.0 is too old to use, may cause problems like
Cuda error: no kernel image is available for execution on the device
Download vimeo90k Septuplet dataset for training and evaluation:
http://toflow.csail.mit.edu/index.html#septuplet
Choose "The original training + test set (82GB)".
cp datasets/vimeo_septuplet/*.txt /path/to/vimeo/
python ./datasets/prepare_vimeo.py --path /path/to/vimeo/
Download Vid4 dataset for evaluation:
https://drive.google.com/drive/folders/10-gUO6zBeOpWEamrWKCtSkkUFukB9W5m
Make sure writing a yml file with settings pointing to correct paths, for example:
python train.py --config ./configs/RSTT-S.yml
Make sure writing a yml file with settings pointing to correct paths, for example:
python eval_vid4.py --config ./configs/RSTT-S-eval-vid4.yml
Make sure writing a yml file with settings pointing to correct paths, for example:
python eval_vimeo90k.py --config ./configs/RSTT-S-eval-vimeo90k.yml
@article{geng2022rstt,
title={RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution},
author={Zhicheng Geng and Luming Liang and Tianyu Ding and Ilya Zharkov},
journal={arXiv preprint arXiv:2203.14186},
year={2022}
}
or
@inproceedings{geng2022rstt,
title={RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution},
author={Zhicheng Geng and Luming Liang and Tianyu Ding and Ilya Zharkov},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={17441--17451},
year={2022}
}
Our code is built on Zooming-Slow-Mo, EDVR, UFormer, and Swin-Transformer. We thank the authors for sharing their codes.
The code is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License for NonCommercial use only. Any commercial use should get formal permission first.