JCB-OCR is an interactive document-understanding platform that mixes layout-recognition and content-recognition models within a single workflow. Built with Streamlit, it ships with the local_vl_utils toolkit—unified clients, prompt templates, and post-processing helpers for multiple VLM providers—so you can run hybrid OCR pipelines locally or against remote APIs with minimal effort.
- Hybrid necessity: A single OCR model rarely excels at layout analysis, table/formula understanding, and high-quality text extraction simultaneously. JCB-OCR offers an extensible architecture that stitches different capabilities together through a consistent interface.
- Rapid validation & comparison: Interactive test/comparison panels let you switch configurations, observe logs, and save multiple runs without leaving the UI, dramatically shortening verification cycles.
- Standardized outputs: Built-in converters normalize diverse model outputs into a single JSON schema (plus visual previews), reducing integration effort for downstream business systems.
- Decoupled model pairing: Combine Qwen-VL, DeepSeek-VL, Hunyuan-VL, MinerU, or custom clients independently for layout vs. content extraction.
- Unified JSON representation:
format_converterplus post-processors transform raw outputs into a shared structure (blocks + metadata) ready for ingestion. - Visual & comparative analysis: Inspect logs, JSON, and overlays in real time; leverage comparison tabs to contrast configurations on identical inputs.
- Batch & async pipelines:
local_vl_utilswraps HTTP and vLLM clients with retry/async support, enabling large-scale processing or low-latency applications. - Traceable logging: API requests, diagnostics, and downloadable reports are built-in for troubleshooting and auditing.
- Interactive workbench: Upload images/PDFs, choose model combos instantly, watch logs, and export JSON or visualized results with a click.
- Recognition statistics dashboard: Automatically summarizes block counts, averages, and per-type distributions, with charts to monitor coverage and accuracy.
- Comparison mode: Preserve multiple runs, generate Spans PDFs, and view side-by-side metrics to identify the best-performing configuration quickly.
- Unified output adapters: Ensure every model’s layout/content data conforms to the same schema, simplifying downstream integration.
- Pluggable client system: Register custom OCR/VLM clients via
client_factory, configured through.envor runtime parameters. - Rich post-processing chain: Table reconstruction, equation fixes, handwriting handling, and more can be composed as needed.
- API/log monitoring: HTTP clients capture detailed request/response traces and store them under
logs/api_calls/for postmortem analysis.
- Text: Paragraphs, headings, lists, etc., with precise coordinates.
- Tables: Detects structures, merged cells, and exports JSON/HTML.
- Images & seals: Distinguishes pictures, stamps, and seals for targeted post-processing.
- Equations: Multiple LaTeX fixers address double subscripts, left/right mismatches, etc.
- Handwriting: Dedicated block type for signatures and handwritten notes.
All types can be previewed visually or consumed via the standardized JSON output.
.
├─ app/ # Streamlit interface
│ ├─ streamlit_app.py # UI + workflow logic
│ └─ requirements.txt # App dependencies
└─ local_vl_utils/ # Client toolkit & helpers
├─ base_client.py
├─ client_factory.py
├─ config_loader.py
├─ prompt_library.py
├─ vlm_client/ # HTTP/vLLM implementations
├─ post_process/ # Post-processing modules
└─ examples/ # JSON conversion demos
- Install Python 3.10+ (virtualenv recommended).
- Install dependencies:
pip install -r app/requirements.txt
- Configure model credentials in
vlm_client/.env(or the project.env):Repeat for DEEPSEEK/HUNYUAN/MINERU with their prefixes; each supports its own timeout/retry overrides.QWEN_MODEL_NAME=Qwen2-VL-72B-Instruct QWEN_SERVER_URL=https://api.example.com QWEN_API_KEY=sk-xxx DEFAULT_TIMEOUT=600 DEFAULT_MAX_RETRIES=3
- Ensure
.envis populated and target endpoints are reachable. - Launch Streamlit from the repo root:
streamlit run app/streamlit_app.py
- In the UI:
- Pick layout/content models from the sidebar.
- Upload images or PDFs; enable “Comparison Mode” to store multiple runs.
- Review logs, JSON output, and visual previews.
- Generate Spans PDFs from the comparison tab when needed.
- Client registration: Plug in custom OCR/VLM backends via
register_custom_client. - Batch/async flows:
BaseOCRClient.aio_batch_predictlets you script large-scale or concurrent jobs. - Post-processing pipeline: Extend or mix modules under
post_processfor domain-specific needs. - Logging: API call logs live in
logs/api_calls/, and the UI exposes download buttons for audits.
We welcome contributions! Before submitting:
- Follow the style guide: use
logurufor logging, avoid stray prints, and document non-trivial code. - Branch & commit cleanly: isolate each feature/fix and use descriptive messages (e.g.,
feat: add mineru helper). - Validate changes: run
streamlit run app/streamlit_app.py(or equivalent tests) and mention results in your PR. - Sync dependencies: update the appropriate
requirementsfile for any new libraries and ensure compatibility.
Open an Issue to discuss ideas or jump straight into a Pull Request—let’s build a robust document-understanding framework together.
This project uses the JCB-OCR Attribution License 1.0 (LICENSE). Any use, modification, or redistribution must retain the copyright notice and include a visible attribution such as “Derived from JCB-OCR” with a link to the repository.
For commercial collaborations or additional licensing terms, please contact the maintainers.





