This repo contains an official PyTorch implementation of our paper: Multilateral Temporal-view Pyramid Transformer for Video Inpainting Detection.
- 2024-12-04: Trainging and Test code is uploaded.
- 2024-07-23: DVI-VI/OP/CP is uploaded.
- 2024-07-21: π’ Mumpy is accepted by BMVC2024!
- 2024-05-08: The repo is created.
- Provide the generation code for YTVI.
- The pre-trained weights will be uploaded soon.
- Make model and code available.
- Make user guideline available.
- Provide
YTVI,DVIandFVIdataset. - More analysis on why multiple-pyramid decoder.
- Why Mumpy is not compare with VIFST: Video Inpainting Localization Using Multi-view Spatial-Frequency Traces(PRICAI 2023)?
- First, although the author once made the source code open, no pre-trained weights were provided. I can not reproduce the results.
- Second, when evaluating the IoU and F1 values, the author used OpenCV methods to enhance the appearance of the predictions, which is both unreasonable and unfair.
- The source code is here, and the results from VIFST is produced by the above operation.
def process_mask(file, bin=False):
if bin:
gray = file
else:
img = cv2.imread(file)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# restrict to [0 - 255]
_, threshold = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_MASK)
# noise removal
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
opening = cv2.morphologyEx(threshold, cv2.MORPH_OPEN, kernel, iterations=2)
sure_bg = cv2.dilate(opening, kernel, iterations=2) # sure background area
sure_op = cv2.erode(opening, kernel, iterations=2) # sure foreground area
sure_fg = cv2.erode(sure_bg, kernel, iterations=2) # sure foreground area
# cv2.imshow('1', to_binary(sure_fg))
# cv2.imshow('2', to_binary(sure_op))
# cv2.imshow('gray', gray)
# cv2.waitKey(-1)
return to_binary(sure_fg), to_binary(sure_op)- Regarding prior work on video inpainting detection, I have confirmed with the first author that all prediction results were obtained using a threshold of 0.5. Here, I call on all researchers in this field to ensure fair comparisons. Additionally, the DVI dataset contains relatively few samples, and most of its videos exhibit high frame density, resulting in minimal motion across consecutive frames. As a result, spatial inconsistency cues play a predominant role. This explains why current SOTA image manipulation detection methods achieve highly competitive performance on the DVI dataset. Therefore, the YTVI dataset may serve as a more reliable benchmark for evaluation in this field.
./Mumpy
βββ scripts (train or test scripts folder)
β βββ train_davis.sh (script of training on DVI)
β βββ train_youtube.sh(script of trainging on YTVI)
β βββ measure.sh(script of measuring F1 and IoU)
β βββ test.sh(script of generating localization)
βββ configs (training or test configuration folder)
β βββ davis (folder)
β β βββ config.py (davis inpainting dataset configuration file)
β β βββ db_info.yaml (davis dataset basic information file)
β βββ youtube (folder)
β β βββ config.py (youtube inpainting dataset configuration file)
β β βββ youtubevos_2018.yaml (youtube 2018 dataset basic information file)
βββ dataloaders (dataloader folder)
β βββ base.py (functional class)
β βββ universaldataloader.py
β βββ universaldataset.py
βββ models (encoder, decoder, model factory, modules)
β βββ encoder (folder)
β β βββ encoder.py (overall encoder structure of mumpy, baseline)
β β βββ multiTemporalViewEncoder.py (implementation of multiTemporalViewEncoder)
β βββ decoder (folder)
β β βββ decoder.py (multi pyramid decoder)
β βββ factory (folder)
β β βββ modelFactory.py (encoder factory)
β βββ modules (folder)
β β βββ blocks.py (vit block)
β β βββ dct.py
β β βββ deformableAttention.py
β β βββ swinTransformer.py
βββ utils(util folder)
β βββ optimizer (folder)
β β βββ scheduler.py
β β βββ factory.py
β βββ dataset_utils.py
β βββ io_aux.py
β βββ loss.py
β βββ randaugment.py (data augmentation)
β βββ utils.py (load or save model, ....)
βββ train.py (training files)
βββ test.py (generate localization results)
βββ measure.py (localization evaluation)
βββ weights(imagenet pretrain weight folder)
βββ results (trained model folder)
βββ weights (put the pre-trained weights in.)
βββ images(images folder)
The code was running on:
- Ubuntu 22.04 LTS, Python 3.9, CUDA 11.7, GeForce RTX 3090ti
- To create your environment by
conda create --n mumpy python=3.9
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt- Go to mumpy_weights_link to download the weights from, and then put them in
weights.
Mumpy is trained on DVI or YTVI. Taking training from DVI as an example.
- First replace
__C.PATH.SEQUENCES,__C.PATH.SEQUENCES2and__C.PATH.SEQUENCES3inconfig.pyto be the training data path. - The global variables related to data in the configuration file also need to be modified(e.g.,
__C.PATH.ANNOTATIONS,__C.PATH.DATA,__C.FILES.DB_INFO). - The training of Mumpy requires changing the parameters in the mentationed script. You can adjust it as needed.
- run ./scripts/train_davis.sh
Mumpy is tested on DVI, FVI and YTVI. Taking testing on DVI as an example.
- run
./scripts/test.shοΌ as you can changemodel_nameandtest_epoch. - When testing with different datasets, you only need to change the dataset parameter. For example, DVI corresponds to
davisand YTVI corresponds toyoutubevosAll the commands have already been provided in the file. - run
./scripts/measure.sh, theinputmeans generated localization result andmask_dirmeans corresponding ground truth.
Since the size of the generated image by inpainting methods can influence the richness of the provided information, for fair comparision, we generated 224x224 inpainted images on OP and CP. However, as VI only supports 256x256 and 512x512 images, we resized them accordingly. All the experiments on DVI follow the above principle.
- YTVI is built upon Youtube-vos 2018, which contains 3471 videos with 5945 object instances in its training set. Since only the training set of this dataset is fully annotated, we use it to construct YTVI.
- Specifically, with the goal of further improving the comprehensiveness, we adopt many more recent video inpainting methods on this dataset, including EG2 [1], FF [2], PP [3], and ISVI [4], together with VI, OP and CP. These inpainting methods are applied to the object regions annotated by ground truth masks.
- The file
./configs/youtube/youtubevos_2018.yamlcontains the videos selected for YTVI. Simply follow the instructions for each method to generate the corresponding inpainting videos, and you will obtain the YTVI dataset.
[1] CVPR 2022 Towards an end-to-end framework for flow-guided video inpainting.
[2] ICCV 2021 Fuseformer: Fusing fine-grained informationin transformers for video inpainting.
[3] ICCV 2023 ProPainter:Improving propagation and transformer for video inpainting.
[4] CVPR 2022 Inertia-guided flow completion and style fusion for video inpainting.
FVI dataset contains 100 test videos that are processed by object removal, and are usually used for demonstrating detection generalization. The download link is here.
If you have any questions, please feel free to reach me out at yingzhang@stu.ouc.edu.cn.
If you find our repo useful for your research, please consider citing our paper:
@inproceedings{Zhang_2024_BMVC,
author = {Ying Zhang and Yuezun Li and Bo Peng and Jiaran Zhou and Huiyu Zhou and Junyu Dong},
title = {Mumpy: Multilateral Temporal-view Pyramid Transformer for Video Inpainting Detection},
booktitle = {35th British Machine Vision Conference 2024, {BMVC} 2024, Glasgow, UK, November 25-28, 2024},
publisher = {BMVA},
year = {2024},
url = {https://papers.bmvc2024.org/0318.pdf}
}