GitHub - mcherif/edgescope-studio: Real-time on-device video matting + background blur with temporal stability (RVM recurrent states) - benchmarked on RTX 4060 Ti (ONNX Runtime).

EdgeScope Studio is a local-first computer vision lab for prototyping image and video pipelines on your own machine (offline).

It has two modes:

Image mode (RTMDet + SAM): general object detection + promptable segmentation for still images.
Video mode (RVM): real-time portrait matting + background blur with temporal stability (recurrent states).

The core idea:

Image mode: load your images -> run a permissive detector (RTMDet Tiny on COCO) + SAM -> inspect boxes & masks -> iterate on thresholds, models, and logic without touching the cloud.
Video mode: RVM recurrent states -> alpha matte -> blur compositor.

Video mode uses RVM to produce a temporally-stable alpha matte per frame; no detector/SAM in the loop.

Why RVM vs detect+SAM for video: RVM is video-native and keeps recurrent state, so edges stay stable frame-to-frame and inference is faster than running detector + SAM on every frame.

This is designed as a general CV tool, but with a strong focus on on-device and privacy-preserving use cases (e.g. ergonomics / digital wellbeing, industrial inspection, etc.).

For real-time portrait effects we use Robust Video Matting (RVM) (video-native, recurrent temporal states). For general object segmentation in still images we use detect -> segment (RTMDet + SAM).

RVM is a video matting model that keeps recurrent state across frames, which stabilizes edges and reduces flicker compared to per-frame-only inference. Those temporal states let the model "remember" motion and fine hair detail so the mask stays coherent over time.

Quick start: Video mode (RVM background blur)

Requirements: Windows + NVIDIA GPU recommended (ONNX Runtime CUDA).
Model: Robust Video Matting (RVM) MobileNetV3 ONNX (downloaded locally; not committed).

1) Setup (download/verify model, write IO metadata)

python scripts/setup_video.py

2) Run webcam demo (OpenCV window)

python scripts/run_video.py --device cuda --input-size 512 --downsample 0.25
# CPU fallback (slower)
python scripts/run_video.py --device cpu --input-size 512 --downsample 0.25

Controls: q quit, b toggle blur/debug, r reset temporal state

3) Windows capture backend (auto-select + caching)

Default backend is auto (probe + cache). Auto mode will:

run a short blur-off probe (msmf vs dshow)
select the best stable backend and cache the decision
define stable as: passed health check (frames flowing / not stuck) + warning_count == 0

Override / reprobe:

python scripts/run_video.py --backend dshow ...
python scripts/run_video.py --backend msmf ...
python scripts/run_video.py --backend auto --reprobe ...

Probe directly:

python scripts/probe_backends.py --device cuda --input-size 512 --downsample 0.25 --width 1280 --height 720 --duration 20

4) Benchmark (headless)

# Blur ON (pin backend for reproducibility)
python scripts/benchmark_video.py --device cuda --backend dshow --input-size 512 --downsample 0.25 \
  --width 1280 --height 720 --duration 30 --blur \
  --out benchmarks/rvm_512_ds025_720p_blur.json

# Blur OFF
python scripts/benchmark_video.py --device cuda --backend dshow --input-size 512 --downsample 0.25 \
  --width 1280 --height 720 --duration 30 \
  --out benchmarks/rvm_512_ds025_720p_no_blur.json

Note: Results can vary by camera/driver/virtual-cam; run scripts/probe_backends.py to pick the best backend on your system.

Related scripts: scripts/run_video.py, scripts/benchmark_video.py, scripts/compare_compositing_precision.py.

Pipeline overview: Capture -> Preprocess -> ORT RVM -> Alpha Matte -> Compositing -> Output.

Results (frozen config)

Metric	Value
Input	Pexels clip (ID 6517471), 720p file input
Config	`512 / 0.25 / dshow / blur_scale=0.5 / blur_sigma=8`
Throughput (optimized)	FPS mean 39.02
Latency (optimized)	total p95 29.96 ms
Temporal stability (Sobel gradient edges jitter mean)	OFF 0.08246 -> ON 0.06315 (-23.4%)
Compositing optimization	comp mean 9.51 -> 5.72 ms; FPS 32.67 -> 39.02 (+19.4%)

File-input benchmarks are not real-time limited (no 30 FPS camera cadence), so FPS can exceed 30 and represents pipeline throughput.

Webcam note: webcam numbers vary with capture backend and scene motion; see Appendix.

Repro commands: see Runtime stacks matter for file-input throughput commands and Temporal stability (RVM vs no temporal state) for temporal jitter commands.

Reproducibility / Environment

Capture environment provenance for both venv and conda before comparing benchmark numbers.

This captures:

Python version/executable/platform
pip freeze (and conda list when applicable)
ONNX Runtime version/device/providers
OpenCV version + full build info
DLL resolution and loaded-module paths for: onnxruntime_providers_cuda.dll, cudnn64_9.dll, cublas64_12.dll, cudart64_12.dll

# Venv runtime snapshot (example path)
C:\Users\msi\AppData\Local\Temp\edgescope-studio-main\.venv-video\Scripts\python.exe `
  scripts/capture_runtime_env.py `
  --model models/rvm_mobilenetv3_fp32.onnx `
  --out benchmarks/env_venv_runtime.json

# Conda runtime snapshot
python scripts/capture_runtime_env.py `
  --model models/rvm_mobilenetv3_fp32.onnx `
  --out benchmarks/env_conda_runtime.json

Compare:

onnxruntime.providers_available
session_probe.providers_active
dlls.*.loaded_module_path

If DLL loaded-module paths differ between environments, benchmark differences are expected even on the same machine/driver.

Runtime stacks matter

Same code/model/clip can yield different throughput depending on Python runtime and DLL search order.

Stack	FPS mean	Infer mean (ms)	Comp mean (ms)	Total mean (ms)
Venv stack (`ab_venv.json`)	46.86	16.09	4.43	21.32
Conda stack (`ab_conda_forced_dll_order.json`)	42.63	17.04	4.83	23.44
Delta (venv - conda)	+4.23 (+9.9%)	-0.95	-0.40	-2.12

Artifacts:

benchmarks/ab_venv.json
benchmarks/ab_conda_forced_dll_order.json
benchmarks/env_venv_runtime.json
benchmarks/env_conda_runtime.json

Notes:

ab_*.json are benchmark outputs from scripts/benchmark_video.py (same flags, different environments).
env_*.json are runtime captures from scripts/capture_runtime_env.py.

File-input repro command (primary)

# Optimized compositing path
python scripts/benchmark_video.py --device cuda \
  --video benchmarks/6517471-hd_1920_1080_30fps.mp4 --video-frame-index 0 --video-frame-count 0 \
  --input-size 512 --downsample 0.25 --width 1280 --height 720 --duration 30 \
  --blur --blur-scale 0.5 --blur-sigma 8 --comp-mode soft --alpha-path optimized --backend dshow \
  --out benchmarks/rvm_512_ds025_720p_blur_soft_profile_video6517471_full10s_opt.json

# Legacy compositing path (A/B against optimized)
python scripts/benchmark_video.py --device cuda \
  --video benchmarks/6517471-hd_1920_1080_30fps.mp4 --video-frame-index 0 --video-frame-count 0 \
  --input-size 512 --downsample 0.25 --width 1280 --height 720 --duration 30 \
  --blur --blur-scale 0.5 --blur-sigma 8 --comp-mode soft --alpha-path legacy --backend dshow \
  --out benchmarks/rvm_512_ds025_720p_blur_soft_profile_video6517471_full10s_legacy.json

Runtime stack evidence (Windows)

Captured with:

benchmarks/env_venv_runtime.json
benchmarks/env_conda_runtime.json

Control checks (both environments):

onnxruntime==1.24.1
providers_active=["CUDAExecutionProvider","CPUExecutionProvider"]
Python executable differs by environment (venv vs conda)

Both environments report GPU execution active, but CUDA runtime DLLs are loaded from different locations:

DLL	venv (pip CUDA wheels)	conda/system CUDA
`onnxruntime_providers_cuda.dll`	`...\.venv-video\Lib\site-packages\onnxruntime\capi\...`	`...\miniconda3\envs\edgescope-cuda\Lib\site-packages\onnxruntime\capi\...`
`cudnn64_9.dll`	`...\.venv-video\Lib\site-packages\nvidia\cudnn\bin\...`	`...\miniconda3\envs\edgescope-cuda\Library\bin\...`
`cublas64_12.dll`	`...\.venv-video\Lib\site-packages\nvidia\cublas\bin\...`	`...\miniconda3\envs\edgescope-cuda\Library\bin\...`
`cudart64_12.dll`	`...\.venv-video\Lib\site-packages\nvidia\cuda_runtime\bin\...`	`...\CUDA\v12.1\bin\...`

Conclusion: same model + same benchmark flags can yield different latency distributions because ORT loads different CUDA/cuDNN/cuBLAS runtime DLLs depending on environment and DLL search order (PATH), affecting kernel selection and scheduling.

Repro:

Run python scripts/capture_runtime_env.py --model models/rvm_mobilenetv3_fp32.onnx --out ... from each environment.
Compare dlls.*.loaded_module_path and session_probe.providers_active.
Run the same scripts/benchmark_video.py command in both environments and compare output JSONs.

One-command repro entry point: scripts/repro_video_bench.ps1 (clean PATH mode enabled by default).

Repro commands (webcam, secondary):

# Blur ON (backend pinned for reproducibility)
python scripts/benchmark_video.py --device cuda --backend dshow --input-size 512 --downsample 0.25 \
  --width 1280 --height 720 --duration 30 --blur \
  --out benchmarks/rvm_512_ds025_720p_blur.json

# Blur OFF
python scripts/benchmark_video.py --device cuda --backend dshow --input-size 512 --downsample 0.25 \
  --width 1280 --height 720 --duration 30 \
  --out benchmarks/rvm_512_ds025_720p_no_blur.json

Appendix: Backend variability

Webcam capture (secondary)

Collected with scripts/benchmark_video.py (30s, --input-size 512 --downsample 0.25). Backend pinned to dshow (the cached winner on this machine).

Blur	FPS (mean)	Total mean (ms)	Total p95 (ms)	Infer mean (ms)	Infer p95 (ms)	Comp mean (ms)	Comp p95 (ms)
ON	29.6	33.7	41.9	20.1	25.3	10.1	11.4
OFF	29.9	33.4	45.8	20.5	28.0	0.0	0.0

Known issues

Windows capture backend variability (camera/driver/virtual-cam). Use scripts/probe_backends.py and pin --backend when benchmarking.
Virtual cameras can change capture timing; probe with the virtual cam ON if that's your usage.
First-run warmup effects; the benchmark includes a warmup phase to reduce first-frame skew.
Trimap compositing (hard fg/bg + soft edge band) was tested as an optimization but measured slower due to mask construction overhead (see benchmarks/rvm_512_ds025_720p_blur_soft_profile.json vs benchmarks/rvm_512_ds025_720p_blur_trimap_profile.json).

What's implemented

Video mode (RVM): real-time portrait matting + background blur with temporal stability, backend auto-probe/caching, and headless benchmarking.
Image demo with RTMDet Tiny (COCO) for boxes + labels.
Segment Anything (SAM ViT-B) turns those boxes into masks; toggleable in the UI.
Class whitelist + aliases in config/classes.yaml (single source of truth).
Gradio UI (scripts/run_image_app.py) with confidence slider and "Show SAM masks".

Setup

Use Python 3.10 and the provided requirements. CUDA builds are pinned; adjust if needed. (Video mode uses ONNX Runtime; see the Video quick start above.)

Install deps (in your env, e.g. conda activate edgescope-cuda):

pip install -r requirements.txt

Download checkpoints:

RTMDet: already in rtmdet/ (rtmdet_tiny_8xb32-300e_coco_20220902_112414-78e30dcc.pth)
SAM ViT-B: place at sam/sam_vit_b_01ec64.pth (fallback name sam_vit_b.pth)

Run the app:

python scripts/run_image_app.py

Open http://127.0.0.1:7860, upload an image, set confidence, and toggle "Show SAM masks".

Notes

Detector is COCO-trained; class filtering/aliasing is controlled by config/classes.yaml.
SAM is class-agnostic; we prompt it with RTMDet boxes so we only segment detected objects (faster than running SAM across the whole image and it carries the detector's class labels).
Why detection first: without detector boxes you'd have to run SAM's auto-segmentation over the whole image (more masks, higher latency) and then classify each mask with another model to know the class--slower and less reliable than detect -> segment.
If the default port is busy, change server_port in scripts/run_image_app.py.
Performance snapshot on RTX 4060 Ti (1024x1536 image): first-run init ~33.4s (detector) + ~3.1s (SAM); per-image after init ~1.5s detector + ~1.1s SAM.

Benchmark snapshots (steady state, RTX 4060 Ti)

Resolution	Detections	Masks	Detector (s)	SAM (s)	Total (s)
1024x1536 (orig)	39	39	0.746	0.764	1.511
640x426 (downscale)	38	38	0.125	0.594	0.718

Notes: models are already loaded; numbers exclude one-time init.

Temporal stability (RVM vs no temporal state)

Jitter metric: mean(abs(alpha_t - alpha_{t-1})) over frames (lower is better). Jitter measures frame-to-frame matte instability: mean absolute change in alpha between consecutive frames. Reported for all pixels and for edge regions (alpha in [0.1, 0.9] or |grad alpha| > 0.02). Attribution: Pexels video "A woman talking in front of the computer while drinking" (ID 6517471). Downloaded locally for benchmarking; not redistributed. Primary edge definition: Sobel gradient edges (--edge-mode grad, threshold 0.02). Auxiliary: alpha band 0.1-0.9. We use |Sobel(alpha)| > 0.02 as the edge set; this was chosen to produce a stable edge fraction (~6-8%) on 720p portrait clips. Headline numbers are in the Results (frozen config) table above.

How to reproduce (temporal jitter)

python scripts/compare_temporal.py --device cuda --backend dshow --input-size 512 --downsample 0.25 \
  --width 1280 --height 720 --duration 30 --edge-mode grad --edge-grad-thresh 0.02 \
  --video benchmarks/6517471-hd_1920_1080_30fps.mp4 \
  --out-on benchmarks/temporal_on_512_ds025_720p_dshow_video6517471_grad.json \
  --out-off benchmarks/temporal_off_512_ds025_720p_dshow_video6517471_grad.json

Note: This metric is scene-dependent; rerun with real motion to see temporal benefits.

Video roadmap (optional)

Add a simple quality knob for blur (downscale/sigma) and document tradeoffs.
Add a temporal-stability comparison mode (reset recurrent states) + jitter metric.
Optional: integrate video into a UI (Gradio) once the core pipeline is rock-solid.

Acronyms

Acronym	Meaning
RTMDet	Real-Time Multi-Object Detection
RVM	Robust Video Matting
SAM	Segment Anything Model

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
config		config
docs		docs
edgescope		edgescope
models		models
scripts		scripts
tests/test_video		tests/test_video
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick start: Video mode (RVM background blur)

1) Setup (download/verify model, write IO metadata)

2) Run webcam demo (OpenCV window)

3) Windows capture backend (auto-select + caching)

4) Benchmark (headless)

Results (frozen config)

Reproducibility / Environment

Runtime stacks matter

File-input repro command (primary)

Runtime stack evidence (Windows)

Appendix: Backend variability

Webcam capture (secondary)

What's implemented

Setup

Notes

Benchmark snapshots (steady state, RTX 4060 Ti)

Temporal stability (RVM vs no temporal state)

How to reproduce (temporal jitter)

Video roadmap (optional)

Acronyms

About

Uh oh!

Releases

Packages

Languages

License

mcherif/edgescope-studio

Folders and files

Latest commit

History

Repository files navigation

Quick start: Video mode (RVM background blur)

1) Setup (download/verify model, write IO metadata)

2) Run webcam demo (OpenCV window)

3) Windows capture backend (auto-select + caching)

4) Benchmark (headless)

Results (frozen config)

Reproducibility / Environment

Runtime stacks matter

File-input repro command (primary)

Runtime stack evidence (Windows)

Appendix: Backend variability

Webcam capture (secondary)

What's implemented

Setup

Notes

Benchmark snapshots (steady state, RTX 4060 Ti)

Temporal stability (RVM vs no temporal state)

How to reproduce (temporal jitter)

Video roadmap (optional)

Acronyms

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages