EdgeScope Studio is a local-first computer vision lab for prototyping image and video pipelines on your own machine (offline).
It has two modes:
- Image mode (RTMDet + SAM): general object detection + promptable segmentation for still images.
- Video mode (RVM): real-time portrait matting + background blur with temporal stability (recurrent states).
The core idea:
- Image mode: load your images -> run a permissive detector (RTMDet Tiny on COCO) + SAM -> inspect boxes & masks -> iterate on thresholds, models, and logic without touching the cloud.
- Video mode: RVM recurrent states -> alpha matte -> blur compositor.
Video mode uses RVM to produce a temporally-stable alpha matte per frame; no detector/SAM in the loop.
Why RVM vs detect+SAM for video: RVM is video-native and keeps recurrent state, so edges stay stable frame-to-frame and inference is faster than running detector + SAM on every frame.
This is designed as a general CV tool, but with a strong focus on on-device and privacy-preserving use cases (e.g. ergonomics / digital wellbeing, industrial inspection, etc.).
For real-time portrait effects we use Robust Video Matting (RVM) (video-native, recurrent temporal states). For general object segmentation in still images we use detect -> segment (RTMDet + SAM).
RVM is a video matting model that keeps recurrent state across frames, which stabilizes edges and reduces flicker compared to per-frame-only inference. Those temporal states let the model "remember" motion and fine hair detail so the mask stays coherent over time.
Quick start: Video mode (RVM background blur)
Requirements: Windows + NVIDIA GPU recommended (ONNX Runtime CUDA).
Model: Robust Video Matting (RVM) MobileNetV3 ONNX (downloaded locally; not committed).
python scripts/setup_video.pypython scripts/run_video.py --device cuda --input-size 512 --downsample 0.25
# CPU fallback (slower)
python scripts/run_video.py --device cpu --input-size 512 --downsample 0.25Controls: q quit, b toggle blur/debug, r reset temporal state
Default backend is auto (probe + cache). Auto mode will:
- run a short blur-off probe (
msmfvsdshow) - select the best stable backend and cache the decision
- define stable as: passed health check (frames flowing / not stuck) +
warning_count == 0
Override / reprobe:
python scripts/run_video.py --backend dshow ...
python scripts/run_video.py --backend msmf ...
python scripts/run_video.py --backend auto --reprobe ...Probe directly:
python scripts/probe_backends.py --device cuda --input-size 512 --downsample 0.25 --width 1280 --height 720 --duration 20# Blur ON (pin backend for reproducibility)
python scripts/benchmark_video.py --device cuda --backend dshow --input-size 512 --downsample 0.25 \
--width 1280 --height 720 --duration 30 --blur \
--out benchmarks/rvm_512_ds025_720p_blur.json
# Blur OFF
python scripts/benchmark_video.py --device cuda --backend dshow --input-size 512 --downsample 0.25 \
--width 1280 --height 720 --duration 30 \
--out benchmarks/rvm_512_ds025_720p_no_blur.jsonNote: Results can vary by camera/driver/virtual-cam; run scripts/probe_backends.py to pick the best backend on your system.
Related scripts: scripts/run_video.py, scripts/benchmark_video.py, scripts/compare_compositing_precision.py.
Pipeline overview: Capture -> Preprocess -> ORT RVM -> Alpha Matte -> Compositing -> Output.
| Metric | Value |
|---|---|
| Input | Pexels clip (ID 6517471), 720p file input |
| Config | 512 / 0.25 / dshow / blur_scale=0.5 / blur_sigma=8 |
| Throughput (optimized) | FPS mean 39.02 |
| Latency (optimized) | total p95 29.96 ms |
| Temporal stability (Sobel gradient edges jitter mean) | OFF 0.08246 -> ON 0.06315 (-23.4%) |
| Compositing optimization | comp mean 9.51 -> 5.72 ms; FPS 32.67 -> 39.02 (+19.4%) |
File-input benchmarks are not real-time limited (no 30 FPS camera cadence), so FPS can exceed 30 and represents pipeline throughput.
Webcam note: webcam numbers vary with capture backend and scene motion; see Appendix.
Repro commands: see Runtime stacks matter for file-input throughput commands and Temporal stability (RVM vs no temporal state) for temporal jitter commands.
Capture environment provenance for both venv and conda before comparing benchmark numbers.
This captures:
- Python version/executable/platform
pip freeze(andconda listwhen applicable)- ONNX Runtime version/device/providers
- OpenCV version + full build info
- DLL resolution and loaded-module paths for:
onnxruntime_providers_cuda.dll,cudnn64_9.dll,cublas64_12.dll,cudart64_12.dll
# Venv runtime snapshot (example path)
C:\Users\msi\AppData\Local\Temp\edgescope-studio-main\.venv-video\Scripts\python.exe `
scripts/capture_runtime_env.py `
--model models/rvm_mobilenetv3_fp32.onnx `
--out benchmarks/env_venv_runtime.json
# Conda runtime snapshot
python scripts/capture_runtime_env.py `
--model models/rvm_mobilenetv3_fp32.onnx `
--out benchmarks/env_conda_runtime.jsonCompare:
onnxruntime.providers_availablesession_probe.providers_activedlls.*.loaded_module_path
If DLL loaded-module paths differ between environments, benchmark differences are expected even on the same machine/driver.
Same code/model/clip can yield different throughput depending on Python runtime and DLL search order.
| Stack | FPS mean | Infer mean (ms) | Comp mean (ms) | Total mean (ms) |
|---|---|---|---|---|
Venv stack (ab_venv.json) |
46.86 | 16.09 | 4.43 | 21.32 |
Conda stack (ab_conda_forced_dll_order.json) |
42.63 | 17.04 | 4.83 | 23.44 |
| Delta (venv - conda) | +4.23 (+9.9%) | -0.95 | -0.40 | -2.12 |
Artifacts:
benchmarks/ab_venv.jsonbenchmarks/ab_conda_forced_dll_order.jsonbenchmarks/env_venv_runtime.jsonbenchmarks/env_conda_runtime.json
Notes:
ab_*.jsonare benchmark outputs fromscripts/benchmark_video.py(same flags, different environments).env_*.jsonare runtime captures fromscripts/capture_runtime_env.py.
# Optimized compositing path
python scripts/benchmark_video.py --device cuda \
--video benchmarks/6517471-hd_1920_1080_30fps.mp4 --video-frame-index 0 --video-frame-count 0 \
--input-size 512 --downsample 0.25 --width 1280 --height 720 --duration 30 \
--blur --blur-scale 0.5 --blur-sigma 8 --comp-mode soft --alpha-path optimized --backend dshow \
--out benchmarks/rvm_512_ds025_720p_blur_soft_profile_video6517471_full10s_opt.json
# Legacy compositing path (A/B against optimized)
python scripts/benchmark_video.py --device cuda \
--video benchmarks/6517471-hd_1920_1080_30fps.mp4 --video-frame-index 0 --video-frame-count 0 \
--input-size 512 --downsample 0.25 --width 1280 --height 720 --duration 30 \
--blur --blur-scale 0.5 --blur-sigma 8 --comp-mode soft --alpha-path legacy --backend dshow \
--out benchmarks/rvm_512_ds025_720p_blur_soft_profile_video6517471_full10s_legacy.jsonCaptured with:
benchmarks/env_venv_runtime.jsonbenchmarks/env_conda_runtime.json
Control checks (both environments):
onnxruntime==1.24.1providers_active=["CUDAExecutionProvider","CPUExecutionProvider"]- Python executable differs by environment (venv vs conda)
Both environments report GPU execution active, but CUDA runtime DLLs are loaded from different locations:
| DLL | venv (pip CUDA wheels) | conda/system CUDA |
|---|---|---|
onnxruntime_providers_cuda.dll |
...\.venv-video\Lib\site-packages\onnxruntime\capi\... |
...\miniconda3\envs\edgescope-cuda\Lib\site-packages\onnxruntime\capi\... |
cudnn64_9.dll |
...\.venv-video\Lib\site-packages\nvidia\cudnn\bin\... |
...\miniconda3\envs\edgescope-cuda\Library\bin\... |
cublas64_12.dll |
...\.venv-video\Lib\site-packages\nvidia\cublas\bin\... |
...\miniconda3\envs\edgescope-cuda\Library\bin\... |
cudart64_12.dll |
...\.venv-video\Lib\site-packages\nvidia\cuda_runtime\bin\... |
...\CUDA\v12.1\bin\... |
Conclusion: same model + same benchmark flags can yield different latency distributions because ORT loads different CUDA/cuDNN/cuBLAS runtime DLLs depending on environment and DLL search order (PATH), affecting kernel selection and scheduling.
Repro:
- Run
python scripts/capture_runtime_env.py --model models/rvm_mobilenetv3_fp32.onnx --out ...from each environment. - Compare
dlls.*.loaded_module_pathandsession_probe.providers_active. - Run the same
scripts/benchmark_video.pycommand in both environments and compare output JSONs.
One-command repro entry point: scripts/repro_video_bench.ps1 (clean PATH mode enabled by default).
Repro commands (webcam, secondary):
# Blur ON (backend pinned for reproducibility)
python scripts/benchmark_video.py --device cuda --backend dshow --input-size 512 --downsample 0.25 \
--width 1280 --height 720 --duration 30 --blur \
--out benchmarks/rvm_512_ds025_720p_blur.json
# Blur OFF
python scripts/benchmark_video.py --device cuda --backend dshow --input-size 512 --downsample 0.25 \
--width 1280 --height 720 --duration 30 \
--out benchmarks/rvm_512_ds025_720p_no_blur.jsonCollected with scripts/benchmark_video.py (30s, --input-size 512 --downsample 0.25). Backend pinned to dshow (the cached winner on this machine).
| Blur | FPS (mean) | Total mean (ms) | Total p95 (ms) | Infer mean (ms) | Infer p95 (ms) | Comp mean (ms) | Comp p95 (ms) |
|---|---|---|---|---|---|---|---|
| ON | 29.6 | 33.7 | 41.9 | 20.1 | 25.3 | 10.1 | 11.4 |
| OFF | 29.9 | 33.4 | 45.8 | 20.5 | 28.0 | 0.0 | 0.0 |
Known issues
- Windows capture backend variability (camera/driver/virtual-cam). Use
scripts/probe_backends.pyand pin--backendwhen benchmarking. - Virtual cameras can change capture timing; probe with the virtual cam ON if that's your usage.
- First-run warmup effects; the benchmark includes a warmup phase to reduce first-frame skew.
- Trimap compositing (hard fg/bg + soft edge band) was tested as an optimization but measured slower due to mask construction overhead (see
benchmarks/rvm_512_ds025_720p_blur_soft_profile.jsonvsbenchmarks/rvm_512_ds025_720p_blur_trimap_profile.json).
- Video mode (RVM): real-time portrait matting + background blur with temporal stability, backend auto-probe/caching, and headless benchmarking.
- Image demo with RTMDet Tiny (COCO) for boxes + labels.
- Segment Anything (SAM ViT-B) turns those boxes into masks; toggleable in the UI.
- Class whitelist + aliases in
config/classes.yaml(single source of truth). - Gradio UI (
scripts/run_image_app.py) with confidence slider and "Show SAM masks".
Use Python 3.10 and the provided requirements. CUDA builds are pinned; adjust if needed. (Video mode uses ONNX Runtime; see the Video quick start above.)
- Install deps (in your env, e.g.
conda activate edgescope-cuda):
pip install -r requirements.txt- Download checkpoints:
- RTMDet: already in
rtmdet/(rtmdet_tiny_8xb32-300e_coco_20220902_112414-78e30dcc.pth) - SAM ViT-B: place at
sam/sam_vit_b_01ec64.pth(fallback namesam_vit_b.pth)
- Run the app:
python scripts/run_image_app.pyOpen http://127.0.0.1:7860, upload an image, set confidence, and toggle "Show SAM masks".
- Detector is COCO-trained; class filtering/aliasing is controlled by
config/classes.yaml. - SAM is class-agnostic; we prompt it with RTMDet boxes so we only segment detected objects (faster than running SAM across the whole image and it carries the detector's class labels).
- Why detection first: without detector boxes you'd have to run SAM's auto-segmentation over the whole image (more masks, higher latency) and then classify each mask with another model to know the class--slower and less reliable than detect -> segment.
- If the default port is busy, change
server_portinscripts/run_image_app.py. - Performance snapshot on RTX 4060 Ti (1024x1536 image): first-run init ~33.4s (detector) + ~3.1s (SAM); per-image after init ~1.5s detector + ~1.1s SAM.
| Resolution | Detections | Masks | Detector (s) | SAM (s) | Total (s) |
|---|---|---|---|---|---|
| 1024x1536 (orig) | 39 | 39 | 0.746 | 0.764 | 1.511 |
| 640x426 (downscale) | 38 | 38 | 0.125 | 0.594 | 0.718 |
Notes: models are already loaded; numbers exclude one-time init.
Temporal stability (RVM vs no temporal state)
Jitter metric: mean(abs(alpha_t - alpha_{t-1})) over frames (lower is better).
Jitter measures frame-to-frame matte instability: mean absolute change in alpha between consecutive frames.
Reported for all pixels and for edge regions (alpha in [0.1, 0.9] or |grad alpha| > 0.02).
Attribution: Pexels video "A woman talking in front of the computer while drinking" (ID 6517471). Downloaded locally for benchmarking; not redistributed.
Primary edge definition: Sobel gradient edges (--edge-mode grad, threshold 0.02). Auxiliary: alpha band 0.1-0.9.
We use |Sobel(alpha)| > 0.02 as the edge set; this was chosen to produce a stable edge fraction (~6-8%) on 720p portrait clips.
Headline numbers are in the Results (frozen config) table above.
python scripts/compare_temporal.py --device cuda --backend dshow --input-size 512 --downsample 0.25 \
--width 1280 --height 720 --duration 30 --edge-mode grad --edge-grad-thresh 0.02 \
--video benchmarks/6517471-hd_1920_1080_30fps.mp4 \
--out-on benchmarks/temporal_on_512_ds025_720p_dshow_video6517471_grad.json \
--out-off benchmarks/temporal_off_512_ds025_720p_dshow_video6517471_grad.jsonNote: This metric is scene-dependent; rerun with real motion to see temporal benefits.
- Add a simple quality knob for blur (downscale/sigma) and document tradeoffs.
- Add a temporal-stability comparison mode (reset recurrent states) + jitter metric.
- Optional: integrate video into a UI (Gradio) once the core pipeline is rock-solid.
| Acronym | Meaning |
|---|---|
| RTMDet | Real-Time Multi-Object Detection |
| RVM | Robust Video Matting |
| SAM | Segment Anything Model |

