Skip to content

Real-time on-device video matting + background blur with temporal stability (RVM recurrent states) - benchmarked on RTX 4060 Ti (ONNX Runtime).

License

Notifications You must be signed in to change notification settings

mcherif/edgescope-studio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Video CI

EdgeScope Studio is a local-first computer vision lab for prototyping image and video pipelines on your own machine (offline).

It has two modes:

  • Image mode (RTMDet + SAM): general object detection + promptable segmentation for still images.
  • Video mode (RVM): real-time portrait matting + background blur with temporal stability (recurrent states).

The core idea:

  • Image mode: load your images -> run a permissive detector (RTMDet Tiny on COCO) + SAM -> inspect boxes & masks -> iterate on thresholds, models, and logic without touching the cloud.
  • Video mode: RVM recurrent states -> alpha matte -> blur compositor.

Video mode uses RVM to produce a temporally-stable alpha matte per frame; no detector/SAM in the loop.

Why RVM vs detect+SAM for video: RVM is video-native and keeps recurrent state, so edges stay stable frame-to-frame and inference is faster than running detector + SAM on every frame.

This is designed as a general CV tool, but with a strong focus on on-device and privacy-preserving use cases (e.g. ergonomics / digital wellbeing, industrial inspection, etc.).

For real-time portrait effects we use Robust Video Matting (RVM) (video-native, recurrent temporal states). For general object segmentation in still images we use detect -> segment (RTMDet + SAM).

RVM is a video matting model that keeps recurrent state across frames, which stabilizes edges and reduces flicker compared to per-frame-only inference. Those temporal states let the model "remember" motion and fine hair detail so the mask stays coherent over time.

EdgeScope Studio UI

Quick start: Video mode (RVM background blur)

Requirements: Windows + NVIDIA GPU recommended (ONNX Runtime CUDA).
Model: Robust Video Matting (RVM) MobileNetV3 ONNX (downloaded locally; not committed).

1) Setup (download/verify model, write IO metadata)

python scripts/setup_video.py

2) Run webcam demo (OpenCV window)

python scripts/run_video.py --device cuda --input-size 512 --downsample 0.25
# CPU fallback (slower)
python scripts/run_video.py --device cpu --input-size 512 --downsample 0.25

Controls: q quit, b toggle blur/debug, r reset temporal state

3) Windows capture backend (auto-select + caching)

Default backend is auto (probe + cache). Auto mode will:

  • run a short blur-off probe (msmf vs dshow)
  • select the best stable backend and cache the decision
  • define stable as: passed health check (frames flowing / not stuck) + warning_count == 0

Override / reprobe:

python scripts/run_video.py --backend dshow ...
python scripts/run_video.py --backend msmf ...
python scripts/run_video.py --backend auto --reprobe ...

Probe directly:

python scripts/probe_backends.py --device cuda --input-size 512 --downsample 0.25 --width 1280 --height 720 --duration 20

4) Benchmark (headless)

# Blur ON (pin backend for reproducibility)
python scripts/benchmark_video.py --device cuda --backend dshow --input-size 512 --downsample 0.25 \
  --width 1280 --height 720 --duration 30 --blur \
  --out benchmarks/rvm_512_ds025_720p_blur.json

# Blur OFF
python scripts/benchmark_video.py --device cuda --backend dshow --input-size 512 --downsample 0.25 \
  --width 1280 --height 720 --duration 30 \
  --out benchmarks/rvm_512_ds025_720p_no_blur.json

Note: Results can vary by camera/driver/virtual-cam; run scripts/probe_backends.py to pick the best backend on your system.

Related scripts: scripts/run_video.py, scripts/benchmark_video.py, scripts/compare_compositing_precision.py.

Pipeline overview: Capture -> Preprocess -> ORT RVM -> Alpha Matte -> Compositing -> Output.

EdgeScope Studio Video Pipeline

Results (frozen config)

Metric Value
Input Pexels clip (ID 6517471), 720p file input
Config 512 / 0.25 / dshow / blur_scale=0.5 / blur_sigma=8
Throughput (optimized) FPS mean 39.02
Latency (optimized) total p95 29.96 ms
Temporal stability (Sobel gradient edges jitter mean) OFF 0.08246 -> ON 0.06315 (-23.4%)
Compositing optimization comp mean 9.51 -> 5.72 ms; FPS 32.67 -> 39.02 (+19.4%)

File-input benchmarks are not real-time limited (no 30 FPS camera cadence), so FPS can exceed 30 and represents pipeline throughput.

Webcam note: webcam numbers vary with capture backend and scene motion; see Appendix.

Repro commands: see Runtime stacks matter for file-input throughput commands and Temporal stability (RVM vs no temporal state) for temporal jitter commands.

Reproducibility / Environment

Capture environment provenance for both venv and conda before comparing benchmark numbers.

This captures:

  • Python version/executable/platform
  • pip freeze (and conda list when applicable)
  • ONNX Runtime version/device/providers
  • OpenCV version + full build info
  • DLL resolution and loaded-module paths for: onnxruntime_providers_cuda.dll, cudnn64_9.dll, cublas64_12.dll, cudart64_12.dll
# Venv runtime snapshot (example path)
C:\Users\msi\AppData\Local\Temp\edgescope-studio-main\.venv-video\Scripts\python.exe `
  scripts/capture_runtime_env.py `
  --model models/rvm_mobilenetv3_fp32.onnx `
  --out benchmarks/env_venv_runtime.json

# Conda runtime snapshot
python scripts/capture_runtime_env.py `
  --model models/rvm_mobilenetv3_fp32.onnx `
  --out benchmarks/env_conda_runtime.json

Compare:

  • onnxruntime.providers_available
  • session_probe.providers_active
  • dlls.*.loaded_module_path

If DLL loaded-module paths differ between environments, benchmark differences are expected even on the same machine/driver.

Runtime stacks matter

Same code/model/clip can yield different throughput depending on Python runtime and DLL search order.

Stack FPS mean Infer mean (ms) Comp mean (ms) Total mean (ms)
Venv stack (ab_venv.json) 46.86 16.09 4.43 21.32
Conda stack (ab_conda_forced_dll_order.json) 42.63 17.04 4.83 23.44
Delta (venv - conda) +4.23 (+9.9%) -0.95 -0.40 -2.12

Artifacts:

  • benchmarks/ab_venv.json
  • benchmarks/ab_conda_forced_dll_order.json
  • benchmarks/env_venv_runtime.json
  • benchmarks/env_conda_runtime.json

Notes:

  • ab_*.json are benchmark outputs from scripts/benchmark_video.py (same flags, different environments).
  • env_*.json are runtime captures from scripts/capture_runtime_env.py.

File-input repro command (primary)

# Optimized compositing path
python scripts/benchmark_video.py --device cuda \
  --video benchmarks/6517471-hd_1920_1080_30fps.mp4 --video-frame-index 0 --video-frame-count 0 \
  --input-size 512 --downsample 0.25 --width 1280 --height 720 --duration 30 \
  --blur --blur-scale 0.5 --blur-sigma 8 --comp-mode soft --alpha-path optimized --backend dshow \
  --out benchmarks/rvm_512_ds025_720p_blur_soft_profile_video6517471_full10s_opt.json

# Legacy compositing path (A/B against optimized)
python scripts/benchmark_video.py --device cuda \
  --video benchmarks/6517471-hd_1920_1080_30fps.mp4 --video-frame-index 0 --video-frame-count 0 \
  --input-size 512 --downsample 0.25 --width 1280 --height 720 --duration 30 \
  --blur --blur-scale 0.5 --blur-sigma 8 --comp-mode soft --alpha-path legacy --backend dshow \
  --out benchmarks/rvm_512_ds025_720p_blur_soft_profile_video6517471_full10s_legacy.json

Runtime stack evidence (Windows)

Captured with:

  • benchmarks/env_venv_runtime.json
  • benchmarks/env_conda_runtime.json

Control checks (both environments):

  • onnxruntime==1.24.1
  • providers_active=["CUDAExecutionProvider","CPUExecutionProvider"]
  • Python executable differs by environment (venv vs conda)

Both environments report GPU execution active, but CUDA runtime DLLs are loaded from different locations:

DLL venv (pip CUDA wheels) conda/system CUDA
onnxruntime_providers_cuda.dll ...\.venv-video\Lib\site-packages\onnxruntime\capi\... ...\miniconda3\envs\edgescope-cuda\Lib\site-packages\onnxruntime\capi\...
cudnn64_9.dll ...\.venv-video\Lib\site-packages\nvidia\cudnn\bin\... ...\miniconda3\envs\edgescope-cuda\Library\bin\...
cublas64_12.dll ...\.venv-video\Lib\site-packages\nvidia\cublas\bin\... ...\miniconda3\envs\edgescope-cuda\Library\bin\...
cudart64_12.dll ...\.venv-video\Lib\site-packages\nvidia\cuda_runtime\bin\... ...\CUDA\v12.1\bin\...

Conclusion: same model + same benchmark flags can yield different latency distributions because ORT loads different CUDA/cuDNN/cuBLAS runtime DLLs depending on environment and DLL search order (PATH), affecting kernel selection and scheduling.

Repro:

  1. Run python scripts/capture_runtime_env.py --model models/rvm_mobilenetv3_fp32.onnx --out ... from each environment.
  2. Compare dlls.*.loaded_module_path and session_probe.providers_active.
  3. Run the same scripts/benchmark_video.py command in both environments and compare output JSONs.

One-command repro entry point: scripts/repro_video_bench.ps1 (clean PATH mode enabled by default).

Repro commands (webcam, secondary):

# Blur ON (backend pinned for reproducibility)
python scripts/benchmark_video.py --device cuda --backend dshow --input-size 512 --downsample 0.25 \
  --width 1280 --height 720 --duration 30 --blur \
  --out benchmarks/rvm_512_ds025_720p_blur.json

# Blur OFF
python scripts/benchmark_video.py --device cuda --backend dshow --input-size 512 --downsample 0.25 \
  --width 1280 --height 720 --duration 30 \
  --out benchmarks/rvm_512_ds025_720p_no_blur.json

Appendix: Backend variability

Webcam capture (secondary)

Collected with scripts/benchmark_video.py (30s, --input-size 512 --downsample 0.25). Backend pinned to dshow (the cached winner on this machine).

Blur FPS (mean) Total mean (ms) Total p95 (ms) Infer mean (ms) Infer p95 (ms) Comp mean (ms) Comp p95 (ms)
ON 29.6 33.7 41.9 20.1 25.3 10.1 11.4
OFF 29.9 33.4 45.8 20.5 28.0 0.0 0.0

Known issues

  • Windows capture backend variability (camera/driver/virtual-cam). Use scripts/probe_backends.py and pin --backend when benchmarking.
  • Virtual cameras can change capture timing; probe with the virtual cam ON if that's your usage.
  • First-run warmup effects; the benchmark includes a warmup phase to reduce first-frame skew.
  • Trimap compositing (hard fg/bg + soft edge band) was tested as an optimization but measured slower due to mask construction overhead (see benchmarks/rvm_512_ds025_720p_blur_soft_profile.json vs benchmarks/rvm_512_ds025_720p_blur_trimap_profile.json).

What's implemented

  • Video mode (RVM): real-time portrait matting + background blur with temporal stability, backend auto-probe/caching, and headless benchmarking.
  • Image demo with RTMDet Tiny (COCO) for boxes + labels.
  • Segment Anything (SAM ViT-B) turns those boxes into masks; toggleable in the UI.
  • Class whitelist + aliases in config/classes.yaml (single source of truth).
  • Gradio UI (scripts/run_image_app.py) with confidence slider and "Show SAM masks".

Setup

Use Python 3.10 and the provided requirements. CUDA builds are pinned; adjust if needed. (Video mode uses ONNX Runtime; see the Video quick start above.)

  1. Install deps (in your env, e.g. conda activate edgescope-cuda):
pip install -r requirements.txt
  1. Download checkpoints:
  • RTMDet: already in rtmdet/ (rtmdet_tiny_8xb32-300e_coco_20220902_112414-78e30dcc.pth)
  • SAM ViT-B: place at sam/sam_vit_b_01ec64.pth (fallback name sam_vit_b.pth)
  1. Run the app:
python scripts/run_image_app.py

Open http://127.0.0.1:7860, upload an image, set confidence, and toggle "Show SAM masks".

Notes

  • Detector is COCO-trained; class filtering/aliasing is controlled by config/classes.yaml.
  • SAM is class-agnostic; we prompt it with RTMDet boxes so we only segment detected objects (faster than running SAM across the whole image and it carries the detector's class labels).
  • Why detection first: without detector boxes you'd have to run SAM's auto-segmentation over the whole image (more masks, higher latency) and then classify each mask with another model to know the class--slower and less reliable than detect -> segment.
  • If the default port is busy, change server_port in scripts/run_image_app.py.
  • Performance snapshot on RTX 4060 Ti (1024x1536 image): first-run init ~33.4s (detector) + ~3.1s (SAM); per-image after init ~1.5s detector + ~1.1s SAM.

Benchmark snapshots (steady state, RTX 4060 Ti)

Resolution Detections Masks Detector (s) SAM (s) Total (s)
1024x1536 (orig) 39 39 0.746 0.764 1.511
640x426 (downscale) 38 38 0.125 0.594 0.718

Notes: models are already loaded; numbers exclude one-time init.

Temporal stability (RVM vs no temporal state)

Jitter metric: mean(abs(alpha_t - alpha_{t-1})) over frames (lower is better). Jitter measures frame-to-frame matte instability: mean absolute change in alpha between consecutive frames. Reported for all pixels and for edge regions (alpha in [0.1, 0.9] or |grad alpha| > 0.02). Attribution: Pexels video "A woman talking in front of the computer while drinking" (ID 6517471). Downloaded locally for benchmarking; not redistributed. Primary edge definition: Sobel gradient edges (--edge-mode grad, threshold 0.02). Auxiliary: alpha band 0.1-0.9. We use |Sobel(alpha)| > 0.02 as the edge set; this was chosen to produce a stable edge fraction (~6-8%) on 720p portrait clips. Headline numbers are in the Results (frozen config) table above.

How to reproduce (temporal jitter)

python scripts/compare_temporal.py --device cuda --backend dshow --input-size 512 --downsample 0.25 \
  --width 1280 --height 720 --duration 30 --edge-mode grad --edge-grad-thresh 0.02 \
  --video benchmarks/6517471-hd_1920_1080_30fps.mp4 \
  --out-on benchmarks/temporal_on_512_ds025_720p_dshow_video6517471_grad.json \
  --out-off benchmarks/temporal_off_512_ds025_720p_dshow_video6517471_grad.json

Note: This metric is scene-dependent; rerun with real motion to see temporal benefits.

Video roadmap (optional)

  • Add a simple quality knob for blur (downscale/sigma) and document tradeoffs.
  • Add a temporal-stability comparison mode (reset recurrent states) + jitter metric.
  • Optional: integrate video into a UI (Gradio) once the core pipeline is rock-solid.

Acronyms

Acronym Meaning
RTMDet Real-Time Multi-Object Detection
RVM Robust Video Matting
SAM Segment Anything Model

About

Real-time on-device video matting + background blur with temporal stability (RVM recurrent states) - benchmarked on RTX 4060 Ti (ONNX Runtime).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published