Mumpy: Multilateral Temporal-view Pyramid Transformer for Video Inpainting Detection

This repo contains an official PyTorch implementation of our paper: Multilateral Temporal-view Pyramid Transformer for Video Inpainting Detection.

⏰Update

2024-12-04: Trainging and Test code is uploaded.
2024-07-23: DVI-VI/OP/CP is uploaded.
2024-07-21: 📢 Mumpy is accepted by BMVC2024!
2024-05-08: The repo is created.

⏳Todo

Provide the generation code for YTVI.
The pre-trained weights will be uploaded soon.
Make model and code available.
Make user guideline available.
Provide ~~YTVI~~, ~~DVI~~ and ~~FVI~~ dataset.
More analysis on why multiple-pyramid decoder.

💬Discussion

Why Mumpy is not compare with VIFST: Video Inpainting Localization Using Multi-view Spatial-Frequency Traces(PRICAI 2023)?

First, although the author once made the source code open, no pre-trained weights were provided. I can not reproduce the results.
Second, when evaluating the IoU and F1 values, the author used OpenCV methods to enhance the appearance of the predictions, which is both unreasonable and unfair.
The source code is here, and the results from VIFST is produced by the above operation.

    def process_mask(file, bin=False):
    if bin:
        gray = file
    else:
        img = cv2.imread(file)
        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    # restrict to [0 - 255]
    _, threshold = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_MASK)
    
    # noise removal
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
    opening = cv2.morphologyEx(threshold, cv2.MORPH_OPEN, kernel, iterations=2)

    sure_bg = cv2.dilate(opening, kernel, iterations=2)  # sure background area
    sure_op = cv2.erode(opening, kernel, iterations=2)  # sure foreground area
    sure_fg = cv2.erode(sure_bg, kernel, iterations=2)  # sure foreground area
    # cv2.imshow('1', to_binary(sure_fg))
    # cv2.imshow('2', to_binary(sure_op))
    # cv2.imshow('gray', gray)
    # cv2.waitKey(-1)
    return to_binary(sure_fg), to_binary(sure_op)

Regarding prior work on video inpainting detection, I have confirmed with the first author that all prediction results were obtained using a threshold of 0.5. Here, I call on all researchers in this field to ensure fair comparisons. Additionally, the DVI dataset contains relatively few samples, and most of its videos exhibit high frame density, resulting in minimal motion across consecutive frames. As a result, spatial inconsistency cues play a predominant role. This explains why current SOTA image manipulation detection methods achieve highly competitive performance on the DVI dataset. Therefore, the YTVI dataset may serve as a more reliable benchmark for evaluation in this field.
🌏Overview

🌄Get Started

Quick Start of Source Code

./Mumpy
├── scripts (train or test scripts folder)
│      ├── train_davis.sh (script of training on DVI)
│      ├── train_youtube.sh(script of trainging on YTVI)
│      ├── measure.sh(script of measuring F1 and IoU)
│      └── test.sh(script of generating localization)
├── configs (training or test configuration folder)
│      ├── davis (folder)
│      │      ├── config.py (davis inpainting dataset configuration file)
│      │      ├── db_info.yaml (davis dataset basic information file)
│      ├── youtube (folder)
│      │      ├── config.py (youtube inpainting dataset configuration file)
│      │      ├── youtubevos_2018.yaml (youtube 2018 dataset basic information file)
├── dataloaders (dataloader folder)
│      ├── base.py (functional class)
│      ├── universaldataloader.py
│      └── universaldataset.py
├── models (encoder, decoder, model factory, modules)
│      ├── encoder (folder)
│      │      ├── encoder.py (overall encoder structure of mumpy, baseline)
│      │      ├── multiTemporalViewEncoder.py (implementation of multiTemporalViewEncoder)
│      ├── decoder (folder)
│      │      ├── decoder.py (multi pyramid decoder)
│      ├── factory (folder)
│      │      ├── modelFactory.py (encoder factory)
│      ├── modules (folder)
│      │      ├── blocks.py (vit block)
│      │      ├── dct.py
│      │      ├── deformableAttention.py
│      │      └── swinTransformer.py
├── utils(util folder)
│      ├── optimizer (folder)
│      │      ├── scheduler.py
│      │      ├── factory.py
│      ├── dataset_utils.py
│      ├── io_aux.py
│      ├── loss.py
│      ├── randaugment.py (data augmentation)
│      ├── utils.py (load or save model, ....)
├── train.py (training files)
├── test.py (generate localization results)
├── measure.py (localization evaluation)
├── weights(imagenet pretrain weight folder)
├── results (trained model folder)
├── weights (put the pre-trained weights in.)
└── images(images folder)

Preparation

The code was running on:

Ubuntu 22.04 LTS, Python 3.9, CUDA 11.7, GeForce RTX 3090ti

To create your environment by

conda create --n mumpy python=3.9
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt

Go to mumpy_weights_link to download the weights from, and then put them in weights.

Training

Mumpy is trained on DVI or YTVI. Taking training from DVI as an example.

First replace __C.PATH.SEQUENCES , __C.PATH.SEQUENCES2 and __C.PATH.SEQUENCES3 in config.py to be the training data path.
The global variables related to data in the configuration file also need to be modified(e.g., __C.PATH.ANNOTATIONS, __C.PATH.DATA, __C.FILES.DB_INFO).
The training of Mumpy requires changing the parameters in the mentationed script. You can adjust it as needed.
run ./scripts/train_davis.sh

Test

Mumpy is tested on DVI, FVI and YTVI. Taking testing on DVI as an example.

run ./scripts/test.sh， as you can change model_name and test_epoch.
When testing with different datasets, you only need to change the dataset parameter. For example, DVI corresponds to davis and YTVI corresponds toyoutubevos All the commands have already been provided in the file.
run ./scripts/measure.sh, the input means generated localization result and mask_dir means corresponding ground truth.

📑Dataset

Davis Inpainting Dataset (DVI)

Since the size of the generated image by inpainting methods can influence the richness of the provided information, for fair comparision, we generated 224x224 inpainted images on OP and CP. However, as VI only supports 256x256 and 512x512 images, we resized them accordingly. All the experiments on DVI follow the above principle.

YouTube-VOS Video Inpainting dataset (YTVI)

YTVI is built upon Youtube-vos 2018, which contains 3471 videos with 5945 object instances in its training set. Since only the training set of this dataset is fully annotated, we use it to construct YTVI.
Specifically, with the goal of further improving the comprehensiveness, we adopt many more recent video inpainting methods on this dataset, including EG2 [1], FF [2], PP [3], and ISVI [4], together with VI, OP and CP. These inpainting methods are applied to the object regions annotated by ground truth masks.
The file ./configs/youtube/youtubevos_2018.yaml contains the videos selected for YTVI. Simply follow the instructions for each method to generate the corresponding inpainting videos, and you will obtain the YTVI dataset.

[1] CVPR 2022 Towards an end-to-end framework for flow-guided video inpainting.

[2] ICCV 2021 Fuseformer: Fusing fine-grained informationin transformers for video inpainting.

[3] ICCV 2023 ProPainter:Improving propagation and transformer for video inpainting.

[4] CVPR 2022 Inertia-guided flow completion and style fusion for video inpainting.

Free-from Video Inpainting dataset (FVI)

FVI dataset contains 100 test videos that are processed by object removal, and are usually used for demonstrating detection generalization. The download link is here.

📧Contact

If you have any questions, please feel free to reach me out at yingzhang@stu.ouc.edu.cn.

Citation

If you find our repo useful for your research, please consider citing our paper:

@inproceedings{Zhang_2024_BMVC,
author    = {Ying Zhang and Yuezun Li and Bo Peng and Jiaran Zhou and Huiyu Zhou and Junyu Dong},
title     = {Mumpy: Multilateral Temporal-view Pyramid Transformer for Video Inpainting Detection},
booktitle = {35th British Machine Vision Conference 2024, {BMVC} 2024, Glasgow, UK, November 25-28, 2024},
publisher = {BMVA},
year      = {2024},
url       = {https://papers.bmvc2024.org/0318.pdf}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Mumpy: Multilateral Temporal-view Pyramid Transformer for Video Inpainting Detection

⏰Update

⏳Todo

💬Discussion

🌏Overview

🌄Get Started

Quick Start of Source Code

Preparation

Training

Test

📑Dataset

Davis Inpainting Dataset (DVI)

YouTube-VOS Video Inpainting dataset (YTVI)

Free-from Video Inpainting dataset (FVI)

📧Contact

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.idea		.idea
configs		configs
dataloaders		dataloaders
images		images
models		models
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
args.py		args.py
measure.py		measure.py
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

Uh oh!

License

Uh oh!

OUC-VAS/Mumpy

Folders and files

Latest commit

History

Repository files navigation

Mumpy: Multilateral Temporal-view Pyramid Transformer for Video Inpainting Detection

⏰Update

⏳Todo

💬Discussion

🌏Overview

🌄Get Started

Quick Start of Source Code

Preparation

Training

Test

📑Dataset

Davis Inpainting Dataset (DVI)

YouTube-VOS Video Inpainting dataset (YTVI)

Free-from Video Inpainting dataset (FVI)

📧Contact

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages