Native Rust Inference Engine for GLiNER2
gliner2-rs is a high-performance, Zero-Python inference engine designed to execute GLiNER2 models using ONNX Runtime. It allows for extracting Named Entities (NER), Relations, and Global Classifications natively in Rust with maximum speed, supporting both CPU and NVIDIA GPU (CUDA) via hardware-accelerated Tensor operations.
This crate completely replicates the advanced sub-word tokenization and prompt-generation logic of GLiNER2's processor.py internally, using the official tokenizers crate for zero-overhead BPE tokenization.
Copyright 2026 Dario Finardi, Semplifica s.r.l.
Licensed under Apache License 2.0
- Zero-Copy PCIe bypass: Replaces CPU manipulations with
Gather,ArgMax, andMatMuloperations fused directly into the ONNX graphs. Data now stays inside GPU/NPU VRAM, speeding up performance by ~30% (currently tested on NVIDIA RTX GPUs and AMD Ryzen CPUs). - Automatic Engine Facade:
Gliner2Engineacts as an intelligent wrapper. It detects whether the model folder contains V1 or V2 files, automatically switching to the optimal execution pipeline. No code changes are required to use V2! - Smart HF Downloader:
Gliner2Engine::from_pretrainednow detects your OS. On CUDA/ROCm platforms it downloads the_iobindingvariants, while on macOS (Apple Silicon/CoreML) it safely downloads the standard_fp16fallback. This halves bandwidth and disk usage! - New V2 ONNX Exporter: We provide
export_gliner2_onnx_fragments_v2.pywhich automatically generatesfp32,fp16, andfp16_iobinding(Full IO Types) variants of the fusions.
- End-to-End Execution: Full recreation of the GLiNER2 inference loop natively in Rust.
- Multi-Task Extraction: Supports Entity Extraction, Relation Extraction, and Text Classifications in a single forward pass.
- Hardware Accelerated: Dynamically uses QNN (Qualcomm NPU), CoreML (Apple Silicon), OpenVINO (Intel/AMD), CUDA Execution Provider if an NVIDIA GPU is available, falling back to optimized XNNPACK/CPU execution.
- FP16 & FP32 Support: Fully compatible with Half-Precision (Float16) ONNX exports to cut memory footprints in half.
- Zero-Copy Tensor Flow: Direct injection of raw hidden states across multiple neural network slices without CPU-GPU memory swaps.
- Built-in NMS: Automatic Non-Maximum Suppression (NMS) to elegantly remove overlapping fictions entities based on their probabilities.
Tested on complex text extraction tasks spanning up to 62 classes. Total Inference Time per Sentence is the primary metric used for fair cross-framework comparison, allowing precise cross-device and cross-language comparisons.
Comparison of a 50-run continuous benchmark on x86_64 architecture with NVIDIA GPUs.
| Language | Engine (Hardware) | Total Time (50 runs) | Avg Time / Sentence | Avg Time / Entity (15-17) |
|---|---|---|---|---|
| Python 3.10 | PyTorch (RTX 4090) | ~0.88 s π | 4.40 ms | 1.17 ms |
| Python 3.10 | PyTorch (RTX 3090) | ~0.90 s π | 4.52 ms | 1.20 ms |
| Rust (V1) | ONNX Runtime CUDA (RTX 4090) | ~8.18 s | 40.90 ms | 10.90 ms |
| *Rust (V2) ** | ONNX Runtime CUDA (RTX 4090) | ~5.91 s β‘ | 29.59 ms | 6.96 ms |
| Rust (V1) | ONNX Runtime CUDA (RTX 3090) | ~8.59 s | 42.97 ms | 11.45 ms |
| *Rust (V2) ** | ONNX Runtime CUDA (RTX 3090) | ~6.13 s β‘ | 30.68 ms | 7.21 ms |
| Python 3.10 | PyTorch (Ryzen 5900XT CPU) | ~7.26 s | 36.33 ms | 9.68 ms |
| Rust (V1) | ONNX Runtime (Ryzen 5900XT CPU) | ~13.75 s | 68.76 ms | 18.33 ms |
*( * ) V2 IOBinding Engine: The new V2 implementation eliminates the PCIe bottleneck by fusing operations (
Gather,ArgMax,MatMul) inside the ONNX graph and keeping tensors entirely in VRAM (Zero-Copy) using ORT'sIoBinding. This drastically drops the execution time.
Understanding the GPU Gap: Why is PyTorch still faster than V2? While V2 IOBinding successfully eliminates the PCIe data transfer bottleneck (tensors now stay in VRAM), Python/PyTorch remains ~6x faster on discrete GPUs. This is due to the Fragmentation Penalty:
- Kernel Launch & Orchestration Overhead: Because GLiNER2's architecture relies on dynamic loops (e.g. iterating over an unknown number of schema tasks and varying predicted entity counts), it cannot be exported as a single monolithic ONNX graph. It must be split into 8 separate ONNX sessions. The Rust host CPU must orchestrate the execution of these 8 fragments sequentially. Even though the data stays in VRAM, the control flow (calling
.run()multiple times per sentence) incurs severe CUDA kernel launch overhead and forces continuous CPU-GPU synchronization. - Lack of Global Graph Fusion: PyTorch executes the entire model inside a single unified context, allowing its backend to fuse kernels across the entire architecture. ONNX Runtime can only optimize and fuse operations within the hard boundaries of each individual fragment.
- Dynamic Shapes: ONNX Runtime achieves peak performance (e.g., via TensorRT) with static shapes. GLiNER2 is highly dynamic (varying sequence lengths, changing number of entities), which prevents ORT from locking in optimal execution pathsβa scenario where PyTorch's native dynamic execution naturally excels.
Conclusion: Rust ONNX V2 represents the upper limit of optimization for a fragmented pipeline. While PyTorch wins on raw continuous throughput on discrete GPUs, Rust ONNX completely dominates PyTorch in Cold Start scenarios (loading in ~2s vs ~10s) and is the absolute winner for Unified Memory Architectures (Apple Silicon / ARM Snapdragon NPU) and edge deployments.
Comparison between native Rust ONNX execution and standard Python PyTorch inference on the same ARM hardware. Note: Benchmarks executed plugged in (Max Performance profile). Testing 51 target entities extraction.
| Environment | Hardware (Backend) | Precision (Model) | Startup Time | Total Inference Time (Sentence) | Time / Entity |
|---|---|---|---|---|---|
| Rust (V1) | CPU ARM64 (Oryon) | fp32 |
~3.64 s | 0.43 s π | ~8.53 ms |
| *Rust (V2) ** | NPU (QNN) | fp16_v2 |
~2.28 s | 0.65 s β¨ | ~12.88 ms |
| *Rust (V2) ** | CPU ARM64 (Oryon) | fp16_v2 |
~1.96 s β‘ | 0.66 s | ~13.10 ms |
| Rust (V1) | CPU ARM64 (Oryon) | fp16 |
~1.82 s | 0.68 s | ~13.43 ms |
| Rust (V1) | NPU (QNN) | fp16 |
~2.12 s | 0.71 s | ~14.11 ms |
| Python 3.12 | CPU ARM64 (PyTorch) | SemplificaAI/gliner2-multi-v1 |
~12.74 s π’ | 0.31 s | ~15.03 ms |
| Python 3.12 | CPU ARM64 (PyTorch) | fastino/gliner2-multi-v1 |
~8.76 s π’ | 0.36 s | ~24.51 ms |
Takeaways:
- The FP32 Surprise: Instructing the Rust ONNX runtime to load full FP32 precision models allows the Snapdragon ARM64 Oryon CPU to skip expensive hardware/software downcasting. It slashes inference time to 0.43s per sentence, completely crushing FP16 times and heavily outperforming the limited NPU drivers.
- V2 IOBinding is Consistent: At matched FP16 precision, the fused V2 consistently beats the standard V1 architecture, both on CPU and NPU.
- Rust = Cold Start Speed & Reproducibility: Rust boots in ~1.8-3.6s (depending on precision) and flawlessly extracts the exact overlapping entities without implicit filtering. Python struggles for ~9-12s just to load tensors and forces unrequested NMS flat_ner filtering which artificially alters the output count.
- For this project, use
ort = 2.0.0-rc.9. - Newer release candidates tested during migration (
2.0.0-rc.11/2.0.0-rc.12) can hang during session initialization or inference on our target environments. - Keep the dependency pinned until upstream stability is confirmed.
Example (rust_component/Cargo.toml):
ort = { version = "=2.0.0-rc.9", features = ["load-dynamic", "qnn", "cuda", "rocm", "coreml", "openvino", "directml", "tensorrt", "xnnpack", "half"] }- Clone this repository or add
gliner2-rsto yourCargo.toml. - Ensure you have the
onnxruntimeC/C++ libraries available on your system path. - Export the GLiNER2 models to ONNX fragmented versions.
Because of GLiNER2's dynamic architecture (which cycles dynamically over a sequence of JSON prompts rather than acting as a static FeedForward layer), the PyTorch model must be exported into a fragmented pipeline. We provide two architectures:
Fuses data manipulation operations directly into the ONNX graph. Tensors stay inside the GPU/NPU VRAM, yielding a ~30% performance boost. Generates 8 files:
encoder...token_gather...span_rep...schema_gather...count_pred_argmax...count_lstm_fixed...scorer...classifier...tokenizer.json
(Export script: onnx_conversion_scripts/export_gliner2_onnx_fragments_v2.py)
Standard PyTorch export into 5 files. Slower on discrete GPUs due to PCIe transfers, but completely stable on older hardware.
encoder...span_rep...count_pred...count_lstm...classifier...tokenizer.json
(Export script: onnx_conversion_scripts/export_gliner2_onnx.py)
When downloading a model via Gliner2Engine::from_pretrained("SemplificaAI/gliner2-multi-v1-onnx", Some("fp16_v2"), ...), the Rust engine uses an OS-Aware Smart Downloader to fetch only the optimal variant:
- Windows/Linux: Downloads the
_fp16_iobinding.onnxvariants to maximize CUDA/ROCm/TensorRT performance. - macOS/iOS: Automatically falls back to standard
_fp16.onnxto ensure compatibility with Apple CoreML.
This mechanism cuts bandwidth and disk usage by ~50% while delivering the best possible performance out of the box!
use gliner2_inference::{Gliner2Engine, Gliner2Config, SchemaTask, ModelType};
fn main() -> anyhow::Result<()> {
// Initialize ONNX Runtime environment (automatically binds to available NPUs/GPUs)
ort::init().with_name("GLiNER2_Engine").commit()?;
// Configure engine
let config = Gliner2Config {
models_dir: "./models/fastino_gliner2_multi_v1_fp16".to_string(),
max_width: 8, // Maximum tokens per span
model_type: ModelType::HuggingFace, // Automatically routes tensors correctly
};
// Load and build session
let engine = Gliner2Engine::new(config)?;
let text = "Mario Rossi works at Apple in Cupertino.";
// Create schema tasks dynamically
let tasks = vec![
SchemaTask::Entities(vec![
"person".to_string(),
"organization".to_string(),
"location".to_string()
]),
SchemaTask::Relations("works_at".to_string(), vec![
"head".to_string(),
"tail".to_string()
]),
SchemaTask::Classifications("sentiment".to_string(), vec![
"positive".to_string(),
"negative".to_string()
])
];
// Extract features
let (entities, relations, classifications) = engine.extract(text, &tasks)?;
for entity in entities {
println!("Found: {} (Label: {} - Score: {:.2}%)", entity.text, entity.label, entity.score * 100.0);
}
Ok(())
}- Target model:
SemplificaAI/gliner2-privacy-filter-PII-multi - For this repository, the model is served as ONNX V2 fragments under
fp16_v2/fp32_v2. - To load from HuggingFace with this crate, use:
let engine = Gliner2Engine::from_pretrained(
"SemplificaAI/gliner2-privacy-filter-PII-multi",
Some("fp16_v2"),
ModelType::HuggingFace,
)?;- A dedicated gate example is available in
rust_component/examples/test_pii_anonymization_gate.rs. It emitsneeds_anonymizationand aredacted_textgenerated from detected PII spans.
- Type:
ModelType::HuggingFace - Source:
fastino/gliner2-multi-v1from HuggingFace - Usage: Free for testing and development
- Performance: Good baseline, trained on general data
- Type:
ModelType::PyTorch - Access: Proprietary fine-tuned weights
- Performance: Superior accuracy on domain-specific entities
Licensed under the Apache License, Version 2.0.
This project was developed by Dario Finardi at Semplifica s.r.l.
Introduced the InferenceParams struct to the extract() function, allowing per-request control over inference behavior without rebuilding the engine:
threshold: Controls the confidence score threshold (default0.5).flat_ner: Whenfalse(default), overlapping entities with different labels are allowed (e.g. "Apple Inc." asorganizationand "Apple" ascompany). Whentrue, strict greedy NMS removes any overlap, regardless of label.
You may notice that max_width (the maximum length of an entity in tokens) is not part of InferenceParams but remains in Gliner2Config at engine initialization.
Why isn't it dynamic? In the high-performance V2 IOBinding architecture, the span representation layer is fused directly into the ONNX computational graph. During export, the dimension for max_width is hard-baked into the model tensors (e.g., [batch, num_words, 8, hidden_size]). Changing max_width at runtime in V2 would cause an immediate ONNX shape mismatch error. Thus, it remains a structural configuration parameter.
- OS-Aware Model Downloader:
from_pretrainedlogic has been heavily optimized. It now parsesstd::env::consts::OSto selectively download only the_fp16_iobindingvariants for Linux/Windows (CUDA/ROCm) and standard_fp16for macOS (CoreML). This drops the V2 download size from 1.2GB to ~600MB. - Manual IOBinding Override: Introduced
GLINER2_NO_IOBINDING=1environment variable to force fallback to standard FP16 execution even on supported hardware. - Hugging Face Model Card: Generated the optimal
README_HF.mdto properly showcase the V2 capabilities on the Hub. - Automated V2 Uploads: Included
upload_v2_to_hf.pyinsideonnx_conversion_scriptsto streamline uploading the double V2 variants (fp16_v2andfp32_v2) to the Hugging Face ecosystem.
- Performance: Up to 30% reduction in inference latency (currently tested and verified on NVIDIA RTX GPUs and AMD Ryzen CPUs).
- ONNX Graph Fusion: Ported previously CPU-bound operations (
Gatherfor Token/Schema representations,ArgMaxfor prediction counts, andMatMulreplacing Einsum for the Scorer) directly into the ONNX session. - IOBinding Bypass: Data now remains fully encapsulated within the VRAM buffer avoiding expensive PCIe bus transactions.
- Facade Auto-detect: Built an intelligent
Gliner2Enginewrapper to automatically detect whether to use V1 CPU-slicing logic or V2 IOBinding without breaking changes to the consumer code.
- Advanced Multitask Extraction: Expanded
test_hf_download.rsto demonstrate concurrent extraction of Entities, Relations, and Classifications (Sentiment/Topic). - Relations Schema Fix: Corrected the relations schema mapping to properly use
headandtailnode identifiers. - Internationalization: Translated remaining Italian logs and comments to English for broader accessibility.
- HuggingFace Hub Auto-Download: Added
Gliner2Engine::from_pretrained()to dynamically download ONNX models (FP16/FP32) directly from HuggingFace via the officialhf-hubcrate. - Download Stats Tracking: Native API calls inject the required
User-AgentHTTP headers (<library_name>/<version>; rust/unknown; <os_name>/unknown) directly respecting HuggingFace's model download statistics policies. - Dynamic Execution Lengths (CountLSTM): Replaced
CompileSafeGRUloop unrolling in PyTorch with a fully dynamic nativenn.GRUduring ONNX export. TheGatherout-of-bounds error on variable-length texts is now permanently resolved!
- Removed obsolete dependencies and hardcoded references to the old
lmo3checkpoints. - Removed arbitrary length caps and fixed Python export logic for sequence counts, avoiding invalid loop unrolling.
- Optimized and refactored standard examples (
test_simple.rs,run_inference.rs) and addedtest_hf_download.rs.
- Initial functional release supporting basic Pytorch-converted fragments with local paths.