Official PyTorch code for PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media
If PROVE is helpful to your projects, please help star this repo. Thanks!
PROVE is a unified evaluation framework for object removal in images and videos, addressing the critical gap between existing metrics and human perception. It consists of:
- RC-S (Removal Coherence - Spatial): Measures how well the inpainted region blends with surrounding background within a single frame via sliding-window MMD on DINOv2 patch features.
- RC-T (Removal Coherence - Temporal): Measures temporal coherence of the inpainted region across consecutive frames via distribution tracking within shared restored regions.
- PROVE-Bench: A two-tier real-world benchmark comprising PROVE-M (80 motion-augmented paired videos with GT) and PROVE-H (100 challenging videos without GT).
| Limitation | Existing Metrics | Our Solution |
|---|---|---|
| Full-Reference Bias | PSNR/SSIM/LPIPS reward copy-paste over genuine erasure | RC-S: no GT required, local region evaluation |
| No-Reference Blind Spots | ReMOVE/CFD favor blurry outputs | RC-S: DINOv2 + MMD robust to blur bias |
| Temporal Insensitivity | TC/TF dominated by unchanged background | RC-T: localized temporal distribution matching |
| Method | PSNRβ | SSIMβ | LPIPSβ | ReMOVEβ | CFDβ | RC-Sβ | RC-Tβ |
|---|---|---|---|---|---|---|---|
| FGT | 21.6511 | 0.8619 | 0.2013 | 0.8622 | 0.3229 | 0.3797 | 0.8031 |
| ProPainter | 22.1846 | 0.8768 | 0.1559 | 0.8676 | 0.2774 | 0.4427 | 0.5951 |
| DiffuEraser | 22.0758 | 0.8706 | 0.1518 | 0.8681 | 0.3308 | 0.4787 | 0.4851 |
| VACE (1.3B) | 20.0826 | 0.8654 | 0.1545 | 0.8117 | 0.3283 | 0.4036 | 0.5217 |
| Minimax-Remover (1.3B) | 21.7476 | 0.8707 | 0.1542 | 0.8710 | 0.3202 | 0.4793 | 0.4485 |
| GenOmni (CogV5B) | 25.0165 | 0.9030 | 0.1223 | 0.8755 | 0.3842 | 0.5029 | 0.3145 |
| GenOmni (Wan1.3B) | 25.1480 | 0.9017 | 0.1109 | 0.8815 | 0.3457 | 0.5188 | 0.3238 |
| ROSE (1.3B) | 26.1333 | 0.9003 | 0.1212 | 0.8803 | 0.3364 | 0.4924 | 0.6538 |
| EffectErase (1.3B) | 27.0049 | 0.9098 | 0.1142 | 0.8841 | 0.3412 | 0.5270 | 0.2728 |
| UnderEraser (14B) | 28.3325 | 0.9156 | 0.0981 | 0.8824 | 0.2986 | 0.5188 | 0.3276 |
| SVOR (1.3B) | 27.4289 | 0.9239 | 0.0839 | 0.8836 | 0.2794 | 0.5236 | 0.2987 |
| Method | PSNRβ | SSIMβ | LPIPSβ | ReMOVEβ | CFDβ | RC-Sβ | RC-Tβ |
|---|---|---|---|---|---|---|---|
| FGT | 29.4448 | 0.8615 | 0.1927 | 0.8474 | 0.3065 | 0.3716 | 0.5866 |
| ProPainter | 33.3531 | 0.9274 | 0.1063 | 0.8383 | 0.2830 | 0.3932 | 0.4453 |
| DiffuEraser | 31.4112 | 0.9178 | 0.1098 | 0.8440 | 0.3165 | 0.4387 | 0.3911 |
| VACE (1.3B) | 26.7266 | 0.8898 | 0.1071 | 0.8047 | 0.3288 | 0.4192 | 0.3438 |
| Minimax-Remover (1.3B) | 29.6021 | 0.8660 | 0.1315 | 0.8545 | 0.3320 | 0.4617 | 0.3277 |
| GenOmni (CogV5B) | 28.7643 | 0.8873 | 0.1183 | 0.8536 | 0.3516 | 0.5006 | 0.2141 |
| GenOmni (Wan1.3B) | 29.3140 | 0.8940 | 0.1027 | 0.8596 | 0.3422 | 0.5127 | 0.2368 |
| ROSE (1.3B) | 27.6261 | 0.8508 | 0.1402 | 0.8538 | 0.3361 | 0.4687 | 0.4373 |
| EffectErase (1.3B) | 24.3793 | 0.8156 | 0.1742 | 0.8532 | 0.3590 | 0.5081 | 0.2363 |
| UnderEraser (14B) | 27.4989 | 0.8485 | 0.1434 | 0.8560 | 0.3165 | 0.5075 | 0.2688 |
| SVOR (1.3B) | 27.5335 | 0.8907 | 0.1046 | 0.8574 | 0.3107 | 0.5166 | 0.2419 |
Note: Due to compliance requirements, the open-source data differs slightly from the data used in the paper. The results above are based on the open-source version and may exhibit minor numerical differences from the paper, but the overall trends remain consistent.
- Python environment (Python 3.10+)
pytorch 2.6+
transformers 4.51+
opencv-python
numpy
scikit-image
pandas
tqdm
- Pretrained Models:
- Download DINOv2-giant and update the
DINO_PATHinrun_prove_metrics.py.
- Dataset Configuration:
- Download the PROVE-Bench dataset from HuggingFace.
- Update datasets in
utils/dataset.pyas per your dataset setup.
DATASET = {
# Video datasets
"PROVE-M": {
"inputs": "/PATH/TO/RAW_VIDEOS",
"masks": "/PATH/TO/MASKS",
"type": "video"
},
"PROVE-H": {
"inputs": "/PATH/TO/RAW_VIDEOS",
"masks": "/PATH/TO/MASKS",
"type": "video"
},
# Image dataset
"rord": {
"inputs": "/PATH/TO/RAW_IMAGES",
"masks": "/PATH/TO/MASKS",
"type": "image"
}
}Attention:
- Generated results must share the same filenames as the originals (extensions may differ).
- Masks are required for both metrics. White regions indicate the removed object.
PROVE/
βββ run_prove_metrics.py # Main evaluation script
βββ README.md
βββ utils/
βββ __init__.py
βββ dataset.py # Dataset configuration
βββ media_utils.py # Video/image I/O and pairing
βββ metrics.py # RC-S and RC-T implementations
βββ bbox.py # Bounding box utilities
βββ predictors.py # DINOv2 feature predictor
python run_prove_metrics.py \
--dataset PROVE-M \
--result_dir /PATH/TO/GENERATED_VIDEOS \
--metrics rc_s rc_t \
--out_csv results.csvpython run_prove_metrics.py \
--dataset rord \
--result_dir /PATH/TO/GENERATED_IMAGES \
--metrics rc_s \
--out_csv results.csvNote: RC-T is only applicable to video datasets and will be automatically skipped for image datasets.
| Argument | Description | Default |
|---|---|---|
--dataset |
Dataset name (PROVE-M, PROVE-H, rord) |
required |
--result_dir |
Directory containing generated results | required |
--metrics |
Metrics to compute: rc_s, rc_t |
rc_s rc_t |
--out_csv |
Output CSV filename | metrics_prove.csv |
--mask_dir |
Override default mask directory | None |
--max_items |
Limit number of items to process | None |
--device |
Compute device | cuda |
The output CSV contains per-item scores and a summary row:
| case_id | rc_s | rc_t | time |
|---|---|---|---|
| video_001.mp4 | 0.1523 | 0.1482 | 12.34 |
| video_002.mp4 | 0.1487 | 0.1501 | 11.87 |
| AVERAGE | 0.1505 | 0.1492 | 12.11 |
- RC-S: higher is better (smaller discrepancy between inpainted region and background).
- RC-T: lower is better (higher temporal consistency across frames).
Our work benefits from the following open-source projects:
If you find our repo useful for your research, please consider citing our paper:
@article{li2026prove,
title={PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media},
author={Li, Fuhao and You, Shaofeng and Hu, Jiagao and Liu, Yu and Chen, Yuxuan and Wang, Zepeng and Wang, Fei and Zhou, Daiguo and Luan, Jian},
journal={arXiv preprint arXiv:2605.14534},
year={2026}
}