Skip to content

De-Par/Text_Detector

Repository files navigation

Text Detector 🚀 (DBNet / PP-OCR det on ONNX Runtime, CPU-only)

C++17 OpenCV 4 ONNX Runtime OpenMP Meson + Ninja macOS / Linux

plot

A fast, CPU-only text detector powered by ONNX Runtime. It supports tiled inference, polygon NMS, IOBinding (to eliminate per-frame allocations), and a benchmark mode with p50/p90/p99 latency reporting. Designed for production: clean code, robust shape handling (NCHW/NHWC/2D/3D) and safe defaults for multi-core servers. Output contains image with quadrilateral boxes + 4 points (x, y) of each box printed to stdout.

Table of Contents

Highlights

  • Fast CPU inference (x86 / ARM, macOS & Linux)
  • 🧩 Tiled inference (RxC grid) with overlap + polygonal NMS
  • 💾 IOBinding: reuse input/output buffers, zero allocations per frame
  • 📈 Bench mode: p50/p90/p99 latency, warmup, optional no-draw
  • 🧠 Robust output shape support: [1,1,H,W], [1,H,W,1], [1,H,W], [H,W]
  • 🔒 Threading done right: separate knobs for OpenMP (tiles) and ORT (intra-op)
  • 🧪 Clean logging: detections to stdout, performance to stderr

How it works

File → OpenCV decode (BGR8)
     → Resize (dynamic --side or fixed --fixed_hw)
     → Normalize (RGB float32, CHW)
     → ONNX Runtime (backbone/neck/head) → probability map (or logits)
     → (optional) Sigmoid (--apply_sigmoid 1)
     → Threshold + morphology (--unclip)
     → Contours → minAreaRect → ordered quad
     → Map coords back to original image size
     → (Tiles) offset + polygon NMS
     → Draw boxes, print coordinates to stdout

💡 Why separate thread knobs?

  • --tile_omp (OpenMP) parallelizes across tiles (outer level).
  • --threads (ONNX Runtime) parallelizes within a single tile (intra-op).
    On big CPUs, use many OMP threads and few ORT threads (often 1–2) to avoid oversubscription.

Requirements

  • C++17, Meson (≥ 1.0), Ninja, pkg-config
  • OpenCV 4.x (core, imgproc, imgcodecs)
  • ONNX Runtime (CPU EP)
  • OpenMP (recommended for tiling)

Linux (Ubuntu example)

sudo apt-get update
sudo apt-get install -y build-essential ninja-build meson cmake cmake-data pkg-config libopencv-dev python3 python3-pip libomp-dev 

Install ONNX Runtime:

Either use official binaries (copy headers+libs into /usr/local) or build from source (Release, CPU only):

git clone --recursive https://github.com/microsoft/onnxruntime.git
cd onnxruntime
./build.sh --config Release --build_shared_lib --parallel

After build finishes, copy headers+libs to /usr/local (adjust paths if needed):

sudo cp -r include/onnxruntime /usr/local/include/
sudo cp -d build/Linux/Release/libonnxruntime.so* /usr/local/lib/
sudo cp -d build/Linux/Release/libonnxruntime_providers_shared.so /usr/local/lib/
sudo ldconfig

MacOS (Apple Silicon)

brew install meson ninja opencv onnxruntime libomp

Headers typically at /opt/homebrew/Cellar/onnxruntime/<version>/include and libraries at /opt/homebrew/Cellar/onnxruntime/<version>/lib

If you face Symbol not found: ___kmpc_barrier error, so that means your binary was compiled with OpenMP, but the OpenMP runtime library isn’t being found/linked at launch. Try this:

brew install llvm

Install & Build

meson setup build
meson compile -C build 

Or run from project root such script scripts/build.sh:

chmod +x ./scripts/build.sh
./scripts/build.sh

💡 If you see onnxruntime_cxx_api.h: No such file or directory, verify that ORT headers are discoverable by Meson (e.g., Homebrew path /opt/homebrew/Cellar/onnxruntime/<version>/include on MacOS).

Model Zoo

This project is model-agnostic as long as your detector exports a single-channel probability (or logit) map. Below are two practical sources of ready-to-use models.

1) MMOCR (PyTorch) models → ONNX

MMOCR provides many DBNet-based detectors (R50, MobileNet, DCN variants, etc.). You can export them to ONNX and use them directly with this tool. Detailed information about available models you can find there: mmocr_models. Also, take a look on support in ONNX Runtime: mmocr_support.

Export with MMOCR’s pytorch2onnx.py

  1. Clone and install MMOCR (use versions compatible with your checkpoint):
git clone https://github.com/open-mmlab/mmocr.git
cd mmocr
python3.11 -m venv mvenv
source ./mvenv/bin/activate
pip install -r requirements.txt
pip install onnx onnxsim 
  1. Export to ONNX:
python tools/deployment/pytorch2onnx.py <CONFIG.py> --checkpoint <MODEL.pth> --output-file <OUT.onnx> --opset 11 --dynamic-export
  1. (Optional) Simplify the graph:
python -m onnxsim <OUT.onnx> <OUT-sim.onnx>

Notes & tips

  • Prefer opset ≥ 11. For CPU inference, 11–13 is typically safe.
  • If you need dynamic spatial sizes, keep --dynamic-export; otherwise static shapes plus --fixed_hw may be faster/stabler.
  • Some MMOCR configs already include the final Sigmoid in the head. If your output looks like logits, run with --apply_sigmoid 1.
  • Keep input channels at 3 unless you change the first conv to 1-channel and re-train/fine-tune (grayscale alone rarely gives a big speedup).

If you prefer MMDeploy, you can export via MMDeploy’s ONNX pipeline as well: just ensure the resulting model outputs a single-channel map and that pre/post-processing matches what this app expects.

2) PaddleOCR ONNX

There are pre-converted PaddleOCR detectors on the Hugging Face Hub: deepghs/paddleocr, including lightweight PP-OCR mobile variants. Typical model names you can find in models directory of project:

  • ch_PP-OCRv2_det.onnx
  • ch_PP-OCRv3_det.onnx
  • ch_PP-OCRv4_det.onnx
  • ch_PP-OCRv4_server_det.onnx
  • ch_ppocr_mobile_slim_v2.0_det.onnx
  • ch_ppocr_mobile_v2.0_det.onnx
  • ch_ppocr_server_v2.0_det.onnx
  • en_PP-OCRv3_det.onnx

Important compatibility notes

  • Output often contains logits → run with --apply_sigmoid 1.
  • Normalization differs from ImageNet: PaddleOCR commonly uses img = (img/255.0 - 0.5) / 0.5 (i.e., mean=(0.5,0.5,0.5), std=(0.5,0.5,0.5)).
    The current code uses ImageNet stats (mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225)). For best accuracy with Paddle models, adjust the normalization in code to Paddle’s scheme or re-export to match ImageNet stats.
  • Input sizes are typically dynamic with the constraint H,W % 32 == 0. Use --fixed_hw (e.g., 640x640) or --side to meet that requirement.
  • If you see Unexpected output shape, your detector might output a different tensor layout. This app handles [1,1,H,W], [1,H,W,1], [1,H,W], and [H,W]. If yours differs, inspect the model head or adjust the post-processing accordingly.

💡 If you switch to Paddle normalization, update mean / std in code accordingly.

💡 For highest stability in batch/production (hundreds of images): combine IOBinding (--bind_io 1) with a fixed input size (--fixed_hw WxH) and keep ORT threads small (--threads 1–2) while scaling tiles via OpenMP (--tile_omp).

Command-line Options

Flag Type Default Description
--model string Path to ONNX detector (DBNet / PP-OCR det).
--image string Path to input image.
--out string out.png Output image with drawn boxes.
--bin_thresh float 0.3 Threshold for binarizing probability map (0..1).
--box_thresh float 0.6 Filter boxes by mean probability inside polygon.
--side int 960 Max side length (dynamic resize, keep aspect; rounded to multiple of 32). Ignored if --fixed_hw is set.
--threads int 0→1 ONNX Runtime intra-op threads per tile. Use 1–2 with tiling.
--unclip float 1.5 Morphological “inflate” before contours (DB-style).
--apply_sigmoid 0/1 0 Apply sigmoid if model outputs logits (not in [0,1]).
--tiles RxC Enable tiling (e.g., 3x3). Each tile runs inference separately.
--tile_overlap float 0.10 Fractional overlap for tiles (0..0.5) to avoid cut words.
--nms_iou float 0.30 Polygon NMS IoU threshold to drop duplicates between tiles.
--tile_omp int 0→env/auto OpenMP threads for tile-level parallelism.
--omp_places string cores Sets OMP_PLACES (e.g., cores, threads, sockets, or custom {…}).
--omp_bind string close Sets OMP_PROC_BIND (close, spread, master, true, false).
--bind_io 0/1 0 Enable IOBinding (reuses buffers; no per-frame allocations).
--fixed_hw WxH Fixed input size (e.g., 640x640, rounded to /32). Great with --bind_io.
--bench int Run benchmark for N iterations (p50/p90/p99).
--warmup int 20 Warmup iterations (excluded from stats).
--no_draw 0/1 0 In bench mode, disable drawing/saving to keep timings clean.
-h, --help Show usage.

⚠️ Output format (stdout) is one line per detection (vertices are in consistent clockwise order):

x0,y0 x1,y1 x2,y2 x3,y3

Quick Start

Demo script:

chmod +x ./scripts/run.sh
./scripts/run.sh

Basic (no tiling):

./build/text_det --model ./models/ch_PP-OCRv4_det.onnx --image ./images/test.jpg --threads 4 --side 640 --bin_thresh 0.3 --box_thresh 0.6

Model that outputs logits (no final Sigmoid):

./build/text_det --model ./models/ch_PP-OCRv4_det.onnx --image ./images/test.jpg --threads 4 --apply_sigmoid 1 --bin_thresh 0.3 --box_thresh 0.3

Common Recipes

Tiling on a big server (e.g., 96 cores)

./build/text_det --model ./models/ch_PP-OCRv4_det.onnx --image ./images/test.jpg --tiles 3x3 --tile_overlap 0.15 --nms_iou 0.3 --threads 2 --tile_omp 8 --omp_places cores --omp_bind close
  • Keep ORT intra-op small (--threads 1–2).
  • Use lots of OpenMP threads for tiles (--tile_omp).

IOBinding + fixed size (best reuse, hundreds of images)

./build/text_det --model ./models/ch_PP-OCRv4_det.onnx --image ./images/test.jpg --bind_io 1 --fixed_hw 640x640 --threads 4

Tiling + IOBinding + fixed size (stable latency under load)

./build/text_det --model ./models/ch_PP-OCRv4_det.onnx --image ./images/test.jpg --tiles 3x3 --tile_overlap 0.15 --nms_iou 0.3 --bind_io 1 --fixed_hw 640x640 --threads 2 --tile_omp 8 --omp_places cores --omp_bind close

Performance Tuning Guide

  • Two levels of parallelism:
    • OpenMP (outer) = --tile_omp (or OMP_NUM_THREADS) → parallel tiles.
    • ONNX Runtime (inner) = --threads → parallel inside a tile.
  • Avoid oversubscription: on large CPUs, prefer many tiles (--tile_omp) and few ORT threads (--threads 1–2).
  • Pin threads for cache locality:
    • --omp_places cores + --omp_bind close is a safe default.
    • Dual-socket NUMA? Try --omp_bind spread.
  • IOBinding:
    • Enable --bind_io 1; ideally combine with --fixed_hw WxH (multiple of 32) to never re-bind.
  • Thresholds:
    • --bin_thresh usually 0.2–0.4, --box_thresh 0.5–0.7.
    • For small text, increase --side or use tiling with overlap 0.10–0.20.

Benchmark Mode

Measure end-to-end latency with warmup and tail-latency percentiles:

./build/text_det --model ./models/ch_PP-OCRv4_det.onnx --image ./images/test.jpg --tiles 3x3 --tile_overlap 0.15 --nms_iou 0.3 --bind_io 1 --fixed_hw 640x640 --threads 2 --tile_omp 8 --bench 200 --warmup 50 --no_draw 1

Report includes (stderr):

  • total_ms: avg, p50, p90, p99 (entire pipeline),
  • infer_ms: p50, p90, p99 (sum of ORT time across tiles),
  • fps@p50: quick throughput estimate at median.

💡 Tip: For consistent numbers, disable drawing/saving (--no_draw 1) and keep shapes fixed (--fixed_hw).

IOBinding Deep-Dive

What it is: binding ONNX input / output tensors directly to your pre-allocated buffers.
Why it matters: eliminates per-frame allocations & copies, improving latency stability.

Best practice:

  • Set --bind_io 1.
  • Use fixed shapes with --fixed_hw WxH (rounded to /32).
  • With tiling, each OpenMP worker gets its own binding context (no locks).

💡 Without --fixed_hw, the code will probe once per new size (first call), bind, and then reuse for that WxH in that worker.

Tiling & NMS

  • --tiles RxC splits the image into a grid and runs inference per tile.
  • --tile_overlap avoids cutting words at tile borders.
  • After stitching, polygon NMS removes duplicate boxes across tiles using IoU (typical 0.2–0.4).

💡 For heavy servers: tiling scales extremely well with OpenMP (outer) threads. Keep ORT threads small.

Troubleshooting

  • onnxruntime_cxx_api.h: No such file or directory
    Make sure ONNX Runtime is installed and headers are visible to Meson (e.g., /usr/local/include on Linux, /opt/homebrew/opt/onnxruntime/include on macOS).

  • Unexpected output shape
    This tool supports [1,1,H,W], [1,H,W,1], [1,H,W], [H,W]. If your model differs, verify your export and the final layers. If outputs are logits (not in [0,1]), pass --apply_sigmoid 1.

  • Performance flatlines when increasing threads
    Likely oversubscription. Lower --threads (ORT) to 1–2; increase --tile_omp; pin threads: --omp_places cores --omp_bind close.

  • Boxes are weak or too many false positives
    Tune --bin_thresh, --box_thresh, --unclip. If model lacks final sigmoid, set --apply_sigmoid 1.

FAQ

Q: Can I speed up by feeding grayscale instead of RGB?
Not unless the model itself is changed to accept [1,1,H,W]. Feeding one channel into [1,3,H,W] doesn’t reduce compute. Changing the first conv to 1-channel helps only a little overall; accuracy may drop.

Q: How are coordinates printed?
Each detection line on stdout: x0,y0 x1,y1 x2,y2 x3,y3 (ordered clockwise).

Q: Does the tool support dynamic sizes?
Yes. Dynamic path uses --side. For best latency and zero re-binding, prefer --fixed_hw WxH with --bind_io 1.

Roadmap

  • Optional AABB/connected-components fast postprocess mode
  • Optional micro-batch tiling (pack multiple tiles into a single N×C×H×W run)
  • Built-in accuracy eval (precision/recall/F1) against custom annotation formats
  • ...

License

MIT - feel free to change for your repo’s needs.

Credits

This project uses OpenCV, OpenMP and ONNX Runtime. Model families supported include DBNet and PP-OCR det models exported to ONNX.

👾 Happy detecting! 👾

About

Fast CPU-only text detector based on ONNX Runtime and DBNet

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published