Remove hardcoded subtitles, watermarks, and text overlays from video.
Auto-detect targets, generate masks, and inpaint — pip install videowipe and go.
videowipe detects and removes hardcoded text, watermarks, logos, and timestamps from video. A full pipeline runs in one command: sample frames → detect text regions → select targets (with optional OCR and natural-language intent parsing) → generate masks → inpaint the background.
No manual mask required. The built-in detector handles multilingual content out of the box.
STTN is the default inpainting backend. Any external model can be plugged in via --external-command — ProPainter has been validated as a higher-quality alternative.
Requires Python 3.8+ and either ONNX Runtime or PyTorch.
# If you already have PyTorch:
pip install videowipe
# Lightweight ONNX Runtime backend:
pip install videowipe[onnx]
# Or the PyTorch backend:
pip install videowipe[torch]
# Optional: OCR text recognition for better detection accuracy
pip install videowipe[ocr]Model weights download automatically on first run to ~/.videowipe/weights/. No manual setup needed.
from videowipe import remove_text
# Mask is optional — subtitle regions are auto-detected if omitted
remove_text(
video="input.mp4",
output="result/",
)
# Or provide your own mask for full control
remove_text(
video="input.mp4",
mask="mask.png",
output="result/",
)Use task="clean" for the complete detection pipeline with target selection, intent parsing, and OCR:
from videowipe import WipeEngine
engine = WipeEngine(task="clean", detect_mode="balanced", ocr="auto")
engine.process(
video="input.mp4",
targets=["subtitle", "watermark"],
regions=["bottom"],
intent="remove Chinese subtitles and logo watermark",
output="result/",
)
engine.cleanup()Reuse the engine to avoid reloading the model:
from videowipe import WipeEngine
engine = WipeEngine(task="detext")
engine.process(video="clip1.mp4", output="result/")
engine.process(video="clip2.mp4", mask="mask.png", output="result/")
engine.cleanup()# Auto-detect and remove all text overlays (recommended)
videowipe clean input.mp4 -o result/
# With manual mask
videowipe clean input.mp4 -m mask.png -o result/# Only remove specific target types
videowipe clean input.mp4 --target subtitle
videowipe clean input.mp4 --target watermark
# Target a specific screen region
videowipe clean input.mp4 --region bottom
videowipe clean input.mp4 --region top-right
# Natural language intent
videowipe clean input.mp4 --intent "remove bottom Chinese subtitles"
# Preview detection results without processing
videowipe clean input.mp4 --preview -o result/
# Interactively confirm detected targets
videowipe clean input.mp4 --confirm| Flag | Description | Default |
|---|---|---|
--target |
Target type to clean (can repeat): subtitle, timestamp, watermark, logo |
auto-detect all |
--region |
Screen region (can repeat): top, bottom, top-left, top-right, bottom-left, bottom-right, center |
all regions |
--intent |
Natural-language cleanup intent | — |
--preview |
Write detection artifacts only (no inpainting) | off |
--confirm |
Show detected targets and confirm before processing | off |
--detect-mode |
Detection preset: fast (24 frames), balanced (50), sensitive (80) |
balanced |
--ocr |
OCR text recognition: auto, off, rapidocr |
auto |
--agent |
Local LLM CLI for intent-based selection (e.g., claude, codex) |
— |
--external-command |
External inpainting command (bypasses built-in STTN) | — |
-g, --gap |
Segment length per pass; higher = better quality, slower | 200 |
-d, --dual |
Show original video side-by-side in output | off |
-m, --mask |
Mask image path (auto-detect if omitted) | auto |
Legacy: detext command
The detext command auto-detects subtitles only. Prefer clean for new usage.
# Auto-detect subtitles
videowipe detext -v input.mp4 -o result/
# With manual mask
videowipe detext -v input.mp4 -m mask.png -o result/| Flag | Description | Default |
|---|---|---|
-v, --video |
Input video path | required |
-m, --mask |
Mask image path (auto-detect if omitted) | auto |
-o, --output |
Output directory | result/ |
-w, --weight |
Model weight path. PyTorch accepts .pth/.pt; ONNX expects a prefix path ending in .onnx with matching _encoder, _transformer, and _decoder files. |
auto |
-g, --gap |
Segment length per pass; higher = better quality, slower | 200 |
-d, --dual |
Show original video side-by-side in output | off |
--external-command |
External inpainting command (bypasses built-in STTN) | — |
Pass --external-command to use any third-party inpainting model instead of the built-in STTN. The command receives <video> <mask> <output_dir> and must produce an output video in the output directory.
ProPainter has been validated as a higher-quality alternative. A ready-to-use wrapper is included:
# Clone ProPainter outside this repo first
git clone https://github.com/sczhou/ProPainter.git ../models/ProPainter
# Use via the named model (recommended)
videowipe clean input.mp4 --model propainter --propainter-dir ../models/ProPainter
# Or via the generic external command (equivalent, now argv-form)
videowipe clean input.mp4 --external-command "python scripts/propainter_wipe.py"Note: ProPainter requires a GPU with ~16GB VRAM for 480p video and is licensed under NTU S-Lab License 1.0 (non-commercial).
Quality comparison: ProPainter vs STTN
Tested on a multilingual music video (Korean + Burmese subtitles, 852x480, 10s clip). Both models used the same mask.
| Original | ProPainter (GPU fp16) | STTN (CPU ONNX) |
|---|---|---|
![]() |
![]() |
![]() |
Comparison images are in pics/comparison/.
| Before | After |
|---|---|
Built-in detector locates text regions across multilingual content without manual masks:
| Video | Candidates | Selected | Types |
|---|---|---|---|
| Chinese drama | 4 | 2 | top subtitle, bottom subtitle |
| English clip | 2 | 2 | bottom subtitle |
| Music video (Korean + Burmese) | 7 | 5 | top watermark, bottom multilingual subtitles |
Tested with --detect-mode balanced (50 sampled frames). Green boxes show selected regions for inpainting.
The pipeline has three stages:
-
Detection — A DBNet-based text detector samples frames across the video, finds text regions in each frame, clusters them by position, and selects the best preview frame. Supports multilingual content out of the box.
-
Target selection — Detected regions are classified by type (subtitle, watermark, logo, timestamp). Optional OCR reads the text content. An intent parser (rule-based or LLM-backed via
--agent) lets you specify what to remove in natural language. -
Inpainting — Masked regions are filled in using temporal information from neighboring frames. The default backend is STTN (8-layer spatial-temporal transformer with CNN encoder). Any external model can be substituted via
--external-command.
No Python? No problem. Run videowipe directly with Docker.
CPU:
docker pull ghcr.io/kkenny0/videowipe:latest
docker run --rm -v "$(pwd)":/data ghcr.io/kkenny0/videowipe clean /data/input.mp4 -o /data/result/GPU (requires NVIDIA Container Toolkit):
docker pull ghcr.io/kkenny0/videowipe:gpu
docker run --rm --gpus all -v "$(pwd)":/data ghcr.io/kkenny0/videowipe:gpu clean /data/input.mp4 -o /data/result/Or use the included wrapper script (auto-detects GPU):
./scripts/docker-videowipe.sh clean input.mp4 -o result/| Image | Size | GPU | Notes |
|---|---|---|---|
videowipe:latest |
~480 MB | No | CPU only, smallest image |
videowipe:gpu |
~1.4 GB | Yes | ONNX Runtime with CUDA |
Use --target to select the image variant:
# CPU
docker build --target runtime-cpu -t videowipe:latest .
# GPU (requires NVIDIA Container Toolkit at build time for base image)
docker build --target runtime-gpu --build-arg VARIANT=gpu -t videowipe:gpu .Note: The GPU image requires a machine with NVIDIA runtime to verify CUDA execution. Without it, ONNX Runtime silently falls back to CPU.
Run after building:
# CPU
docker run --rm -v "$(pwd)":/data videowipe:latest clean /data/input.mp4 -o /data/result/
# GPU
docker run --rm --gpus all -v "$(pwd)":/data videowipe:gpu clean /data/input.mp4 -o /data/result/This project builds on STTN and the original Video-Auto-Wipe implementation. The built-in text detection model is from OnnxOCR.
MIT





