A fast, CPU-only text detector powered by ONNX Runtime. It supports tiled inference, polygon NMS, IOBinding (to eliminate per-frame allocations), and a benchmark mode with p50/p90/p99 latency reporting. Designed for production: clean code, robust shape handling (NCHW/NHWC/2D/3D) and safe defaults for multi-core servers. Output contains image with quadrilateral boxes + 4 points (x, y) of each box printed to stdout.
- Highlights
- How it works
- Requirements
- Install & Build
- Model Zoo
- Command-line Options
- Quick Start
- Common Recipes
- Performance Tuning Guide
- Benchmark Mode
- IOBinding Deep-Dive
- Tiling & NMS
- Troubleshooting
- FAQ
- Roadmap
- License
- ⚡ Fast CPU inference (x86 / ARM, macOS & Linux)
- 🧩 Tiled inference (RxC grid) with overlap + polygonal NMS
- 💾 IOBinding: reuse input/output buffers, zero allocations per frame
- 📈 Bench mode: p50/p90/p99 latency, warmup, optional no-draw
- 🧠 Robust output shape support:
[1,1,H,W],[1,H,W,1],[1,H,W],[H,W] - 🔒 Threading done right: separate knobs for OpenMP (tiles) and ORT (intra-op)
- 🧪 Clean logging: detections to stdout, performance to stderr
File → OpenCV decode (BGR8)
→ Resize (dynamic --side or fixed --fixed_hw)
→ Normalize (RGB float32, CHW)
→ ONNX Runtime (backbone/neck/head) → probability map (or logits)
→ (optional) Sigmoid (--apply_sigmoid 1)
→ Threshold + morphology (--unclip)
→ Contours → minAreaRect → ordered quad
→ Map coords back to original image size
→ (Tiles) offset + polygon NMS
→ Draw boxes, print coordinates to stdout
💡 Why separate thread knobs?
--tile_omp(OpenMP) parallelizes across tiles (outer level).--threads(ONNX Runtime) parallelizes within a single tile (intra-op).
On big CPUs, use many OMP threads and few ORT threads (often 1–2) to avoid oversubscription.
- C++17, Meson (≥ 1.0), Ninja, pkg-config
- OpenCV 4.x (core, imgproc, imgcodecs)
- ONNX Runtime (CPU EP)
- OpenMP (recommended for tiling)
sudo apt-get update
sudo apt-get install -y build-essential ninja-build meson cmake cmake-data pkg-config libopencv-dev python3 python3-pip libomp-dev Either use official binaries (copy headers+libs into /usr/local) or build from source (Release, CPU only):
git clone --recursive https://github.com/microsoft/onnxruntime.git
cd onnxruntime
./build.sh --config Release --build_shared_lib --parallelAfter build finishes, copy headers+libs to /usr/local (adjust paths if needed):
sudo cp -r include/onnxruntime /usr/local/include/
sudo cp -d build/Linux/Release/libonnxruntime.so* /usr/local/lib/
sudo cp -d build/Linux/Release/libonnxruntime_providers_shared.so /usr/local/lib/
sudo ldconfigbrew install meson ninja opencv onnxruntime libompHeaders typically at
/opt/homebrew/Cellar/onnxruntime/<version>/includeand libraries at/opt/homebrew/Cellar/onnxruntime/<version>/lib
If you face Symbol not found: ___kmpc_barrier error, so that means your binary was compiled with OpenMP, but the OpenMP runtime library isn’t being found/linked at launch. Try this:
brew install llvmmeson setup build
meson compile -C build Or run from project root such script scripts/build.sh:
chmod +x ./scripts/build.sh
./scripts/build.sh💡 If you see
onnxruntime_cxx_api.h: No such file or directory, verify that ORT headers are discoverable by Meson (e.g., Homebrew path/opt/homebrew/Cellar/onnxruntime/<version>/includeon MacOS).
This project is model-agnostic as long as your detector exports a single-channel probability (or logit) map. Below are two practical sources of ready-to-use models.
MMOCR provides many DBNet-based detectors (R50, MobileNet, DCN variants, etc.). You can export them to ONNX and use them directly with this tool. Detailed information about available models you can find there: mmocr_models. Also, take a look on support in ONNX Runtime: mmocr_support.
Export with MMOCR’s pytorch2onnx.py
- Clone and install MMOCR (use versions compatible with your checkpoint):
git clone https://github.com/open-mmlab/mmocr.git
cd mmocr
python3.11 -m venv mvenv
source ./mvenv/bin/activate
pip install -r requirements.txt
pip install onnx onnxsim - Export to ONNX:
python tools/deployment/pytorch2onnx.py <CONFIG.py> --checkpoint <MODEL.pth> --output-file <OUT.onnx> --opset 11 --dynamic-export- (Optional) Simplify the graph:
python -m onnxsim <OUT.onnx> <OUT-sim.onnx>Notes & tips
- Prefer opset ≥ 11. For CPU inference, 11–13 is typically safe.
- If you need dynamic spatial sizes, keep
--dynamic-export; otherwise static shapes plus--fixed_hwmay be faster/stabler. - Some MMOCR configs already include the final Sigmoid in the head. If your output looks like logits, run with
--apply_sigmoid 1. - Keep input channels at 3 unless you change the first conv to 1-channel and re-train/fine-tune (grayscale alone rarely gives a big speedup).
If you prefer MMDeploy, you can export via MMDeploy’s ONNX pipeline as well: just ensure the resulting model outputs a single-channel map and that pre/post-processing matches what this app expects.
There are pre-converted PaddleOCR detectors on the Hugging Face Hub: deepghs/paddleocr, including lightweight PP-OCR mobile variants. Typical model names you can find in models directory of project:
ch_PP-OCRv2_det.onnxch_PP-OCRv3_det.onnxch_PP-OCRv4_det.onnxch_PP-OCRv4_server_det.onnxch_ppocr_mobile_slim_v2.0_det.onnxch_ppocr_mobile_v2.0_det.onnxch_ppocr_server_v2.0_det.onnxen_PP-OCRv3_det.onnx
Important compatibility notes
- Output often contains logits → run with
--apply_sigmoid 1. - Normalization differs from ImageNet: PaddleOCR commonly uses
img = (img/255.0 - 0.5) / 0.5(i.e.,mean=(0.5,0.5,0.5),std=(0.5,0.5,0.5)).
The current code uses ImageNet stats (mean=(0.485,0.456,0.406),std=(0.229,0.224,0.225)). For best accuracy with Paddle models, adjust the normalization in code to Paddle’s scheme or re-export to match ImageNet stats. - Input sizes are typically dynamic with the constraint H,W % 32 == 0. Use
--fixed_hw(e.g.,640x640) or--sideto meet that requirement. - If you see
Unexpected output shape, your detector might output a different tensor layout. This app handles[1,1,H,W],[1,H,W,1],[1,H,W], and[H,W]. If yours differs, inspect the model head or adjust the post-processing accordingly.
💡 If you switch to Paddle normalization, update mean / std in code accordingly.
💡 For highest stability in batch/production (hundreds of images): combine IOBinding (
--bind_io 1) with a fixed input size (--fixed_hw WxH) and keep ORT threads small (--threads 1–2) while scaling tiles via OpenMP (--tile_omp).
| Flag | Type | Default | Description |
|---|---|---|---|
--model |
string | — | Path to ONNX detector (DBNet / PP-OCR det). |
--image |
string | — | Path to input image. |
--out |
string | out.png |
Output image with drawn boxes. |
--bin_thresh |
float | 0.3 |
Threshold for binarizing probability map (0..1). |
--box_thresh |
float | 0.6 |
Filter boxes by mean probability inside polygon. |
--side |
int | 960 |
Max side length (dynamic resize, keep aspect; rounded to multiple of 32). Ignored if --fixed_hw is set. |
--threads |
int | 0→1 |
ONNX Runtime intra-op threads per tile. Use 1–2 with tiling. |
--unclip |
float | 1.5 |
Morphological “inflate” before contours (DB-style). |
--apply_sigmoid |
0/1 | 0 |
Apply sigmoid if model outputs logits (not in [0,1]). |
--tiles |
RxC |
— | Enable tiling (e.g., 3x3). Each tile runs inference separately. |
--tile_overlap |
float | 0.10 |
Fractional overlap for tiles (0..0.5) to avoid cut words. |
--nms_iou |
float | 0.30 |
Polygon NMS IoU threshold to drop duplicates between tiles. |
--tile_omp |
int | 0→env/auto |
OpenMP threads for tile-level parallelism. |
--omp_places |
string | cores |
Sets OMP_PLACES (e.g., cores, threads, sockets, or custom {…}). |
--omp_bind |
string | close |
Sets OMP_PROC_BIND (close, spread, master, true, false). |
--bind_io |
0/1 | 0 |
Enable IOBinding (reuses buffers; no per-frame allocations). |
--fixed_hw |
WxH |
— | Fixed input size (e.g., 640x640, rounded to /32). Great with --bind_io. |
--bench |
int | — | Run benchmark for N iterations (p50/p90/p99). |
--warmup |
int | 20 |
Warmup iterations (excluded from stats). |
--no_draw |
0/1 | 0 |
In bench mode, disable drawing/saving to keep timings clean. |
-h, --help |
— | — | Show usage. |
x0,y0 x1,y1 x2,y2 x3,y3
Demo script:
chmod +x ./scripts/run.sh
./scripts/run.shBasic (no tiling):
./build/text_det --model ./models/ch_PP-OCRv4_det.onnx --image ./images/test.jpg --threads 4 --side 640 --bin_thresh 0.3 --box_thresh 0.6Model that outputs logits (no final Sigmoid):
./build/text_det --model ./models/ch_PP-OCRv4_det.onnx --image ./images/test.jpg --threads 4 --apply_sigmoid 1 --bin_thresh 0.3 --box_thresh 0.3Tiling on a big server (e.g., 96 cores)
./build/text_det --model ./models/ch_PP-OCRv4_det.onnx --image ./images/test.jpg --tiles 3x3 --tile_overlap 0.15 --nms_iou 0.3 --threads 2 --tile_omp 8 --omp_places cores --omp_bind close- Keep ORT intra-op small (
--threads 1–2). - Use lots of OpenMP threads for tiles (
--tile_omp).
IOBinding + fixed size (best reuse, hundreds of images)
./build/text_det --model ./models/ch_PP-OCRv4_det.onnx --image ./images/test.jpg --bind_io 1 --fixed_hw 640x640 --threads 4Tiling + IOBinding + fixed size (stable latency under load)
./build/text_det --model ./models/ch_PP-OCRv4_det.onnx --image ./images/test.jpg --tiles 3x3 --tile_overlap 0.15 --nms_iou 0.3 --bind_io 1 --fixed_hw 640x640 --threads 2 --tile_omp 8 --omp_places cores --omp_bind close- Two levels of parallelism:
- OpenMP (outer) =
--tile_omp(orOMP_NUM_THREADS) → parallel tiles. - ONNX Runtime (inner) =
--threads→ parallel inside a tile.
- OpenMP (outer) =
- Avoid oversubscription: on large CPUs, prefer many tiles (
--tile_omp) and few ORT threads (--threads 1–2). - Pin threads for cache locality:
--omp_places cores+--omp_bind closeis a safe default.- Dual-socket NUMA? Try
--omp_bind spread.
- IOBinding:
- Enable
--bind_io 1; ideally combine with--fixed_hw WxH(multiple of 32) to never re-bind.
- Enable
- Thresholds:
--bin_threshusually 0.2–0.4,--box_thresh0.5–0.7.- For small text, increase
--sideor use tiling with overlap0.10–0.20.
Measure end-to-end latency with warmup and tail-latency percentiles:
./build/text_det --model ./models/ch_PP-OCRv4_det.onnx --image ./images/test.jpg --tiles 3x3 --tile_overlap 0.15 --nms_iou 0.3 --bind_io 1 --fixed_hw 640x640 --threads 2 --tile_omp 8 --bench 200 --warmup 50 --no_draw 1Report includes (stderr):
total_ms: avg, p50, p90, p99 (entire pipeline),infer_ms: p50, p90, p99 (sum of ORT time across tiles),fps@p50: quick throughput estimate at median.
💡 Tip: For consistent numbers, disable drawing/saving (
--no_draw 1) and keep shapes fixed (--fixed_hw).
What it is: binding ONNX input / output tensors directly to your pre-allocated buffers.
Why it matters: eliminates per-frame allocations & copies, improving latency stability.
Best practice:
- Set
--bind_io 1. - Use fixed shapes with
--fixed_hw WxH(rounded to /32). - With tiling, each OpenMP worker gets its own binding context (no locks).
💡 Without
--fixed_hw, the code will probe once per new size (first call), bind, and then reuse for that WxH in that worker.
--tiles RxCsplits the image into a grid and runs inference per tile.--tile_overlapavoids cutting words at tile borders.- After stitching, polygon NMS removes duplicate boxes across tiles using IoU (typical
0.2–0.4).
💡 For heavy servers: tiling scales extremely well with OpenMP (outer) threads. Keep ORT threads small.
-
onnxruntime_cxx_api.h: No such file or directory
Make sure ONNX Runtime is installed and headers are visible to Meson (e.g.,/usr/local/includeon Linux,/opt/homebrew/opt/onnxruntime/includeon macOS). -
Unexpected output shape
This tool supports[1,1,H,W],[1,H,W,1],[1,H,W],[H,W]. If your model differs, verify your export and the final layers. If outputs are logits (not in [0,1]), pass--apply_sigmoid 1. -
Performance flatlines when increasing threads
Likely oversubscription. Lower--threads(ORT) to 1–2; increase--tile_omp; pin threads:--omp_places cores --omp_bind close. -
Boxes are weak or too many false positives
Tune--bin_thresh,--box_thresh,--unclip. If model lacks final sigmoid, set--apply_sigmoid 1.
Q: Can I speed up by feeding grayscale instead of RGB?
Not unless the model itself is changed to accept [1,1,H,W]. Feeding one channel into [1,3,H,W] doesn’t reduce compute. Changing the first conv to 1-channel helps only a little overall; accuracy may drop.
Q: How are coordinates printed?
Each detection line on stdout: x0,y0 x1,y1 x2,y2 x3,y3 (ordered clockwise).
Q: Does the tool support dynamic sizes?
Yes. Dynamic path uses --side. For best latency and zero re-binding, prefer --fixed_hw WxH with --bind_io 1.
- Optional AABB/connected-components fast postprocess mode
- Optional micro-batch tiling (pack multiple tiles into a single
N×C×H×Wrun) - Built-in accuracy eval (precision/recall/F1) against custom annotation formats
- ...
MIT - feel free to change for your repo’s needs.
This project uses OpenCV, OpenMP and ONNX Runtime. Model families supported include DBNet and PP-OCR det models exported to ONNX.
👾 Happy detecting! 👾
