Production-grade ASP.NET Core middleware for AI safety — a zero-allocation YARP reverse proxy with local ONNX inference for real-time LLM payload inspection, prompt injection detection, and semantic sanitization.
TensorGate is an out-of-process containerized sidecar that intercepts, evaluates, and sanitizes Large Language Model (LLM) traffic in real time. It sits between your application and upstream LLM providers as a YARP reverse proxy, running local INT8-quantized ONNX classification models to detect prompt injections and adversarial payloads within a strict sub-50ms latency budget on pure CPU hardware.
- Zero-Allocation Pipeline — From raw HTTP bytes to ONNX tensor evaluation,
the hot path avoids managed heap allocations using
Span<T>,ArrayPool<T>, andUtf8JsonReader/Utf8JsonWriterto eliminate GC pauses under high concurrency. - CPU-Only Inference — INT8 statically quantized
all-MiniLM-L6-v2achieves 8–12ms classification latency via AVX-512 VNNI, fitting entirely within L3 cache (~23 MB). - SSE Stream Preservation — Transparent forwarding of
text/event-streamresponses without buffering, maintaining real-time token streaming from upstream providers. - Lock-Free Hot Reload — Atomic reference-counted model swapping via
RefCountDisposablepattern enables zero-downtime weight updates without race conditions or access violations. - NIST AI RMF Alignment — Architecture maps directly to the Govern, Map, Measure, and Manage pillars of NIST AI 600-1.
┌─────────────┐ ┌──────────────────────────────────────────┐ ┌──────────────┐
│ Application │────▶│ TensorGate Sidecar │────▶│ LLM Provider │
│ (Internal) │◀────│ │◀────│ (Upstream) │
└─────────────┘ │ ┌────────┐ ┌───────────┐ ┌─────────┐ │ └──────────────┘
│ │ YARP │─▶│ Tokenizer │─▶│ ONNX │ │
│ │ Proxy │ │ (Zero- │ │ Runtime │ │
│ │ │ │ Alloc) │ │ (INT8) │ │
│ └────────┘ └───────────┘ └─────────┘ │
└──────────────────────────────────────────┘
- Network Interception — YARP captures outbound LLM API traffic via
AddRequestTransform - Zero-Alloc JSON Parsing —
Utf8JsonReaderstate machine extracts prompt fields directly from the byte stream - Tokenization —
Microsoft.ML.Tokenizers(BertTokenizer/WordPiece) encodes overReadOnlySpan<char>without intermediate string allocations - Tensor Binding —
ArrayPool<long>leased buffers are pinned and bound toOrtValue.CreateTensorValueFromMemory - Classification — Single forward pass through INT8 MiniLM yields Safe/Malicious probability in 8–12ms
- Decision Gate — Malicious payloads are blocked synchronously; safe payloads stream through unmodified
| Layer | Technology | Purpose |
|---|---|---|
| Reverse Proxy | YARP | Traffic interception and SSE stream forwarding |
| JSON Processing | Utf8JsonReader / Utf8JsonWriter |
Zero-allocation payload parsing |
| Tokenization | Microsoft.ML.Tokenizers | Allocation-free BPE/WordPiece encoding |
| Inference | ONNX Runtime | INT8 quantized CPU inference |
| Model | all-MiniLM-L6-v2 | Sequence classification (22.7M params) |
| Concurrency | Interlocked / Volatile / CAS loops |
Lock-free reference counting |
| Validation | HarmBench | Adversarial red-team evaluation |
| Metric | Target | Mechanism |
|---|---|---|
| End-to-end latency | < 50ms | INT8 quantization + AVX-512 VNNI |
| Inference latency | 8–12.3ms | Static quantization, L3 cache residency |
| Heap allocations | 0 bytes on hot path | Span<T>, ArrayPool<T>, Utf8JsonReader |
| Model memory | ~23 MB | INT8 weight compression |
| Model hot-reload | Zero downtime | Atomic RefCountDisposable double buffering |
Sprint 1 scaffolding is in progress: the solution builds, YARP proxies /v1/* to a
configurable OpenAI-compatible upstream, and /health is exposed for orchestration probes.
cd ~/TensorGate
./scripts/setup-local-dev.sh
./scripts/smoke-yarp.sh # mock upstream on :9090, proxy on :8080
dotnet run --project src/TensorGate.ProxyThis project is under active development following a structured sprint cadence:
| Sprint | Focus | Duration |
|---|---|---|
| Sprint 1 | Foundational Scaffolding & Proxy Mechanics | Days 1–14 |
| Sprint 2 | Memory Optimization & Inference Engines | Days 15–28 |
| Sprint 3 | Concurrency, Hot-Swapping & Validation | Days 29–42 |
Track progress on the TensorGate Project Board.
Prerequisites: .NET 10.0 SDK (LTS), Docker (optional for sidecar deployment)
Language policy: TensorGate tracks the latest stable C# language version via central build settings.
# Clone the repository
git clone https://github.com/TensorGateLabs/TensorGate.git
cd TensorGate
# Build
dotnet build
# Run tests
dotnet test
# Run the sidecar
dotnet run --project src/TensorGate.ProxyContributions are welcome. Please read the Contributing Guidelines before submitting a pull request.
Engineering runbooks in this repository:
- Operating Pipeline — issue/PR flow and quality gates
- Phase 2 Intelligent Orchestration — issue/PR workflow design
- Agentic Shipping Research and Playbook — local-first shipping practices
- Phase 2.3 Traceability and Evidence — requirement-to-merge automation
- Agentic Development Research 2026 — maturity recommendations
- Workflow Cost Optimization Policy — CI minute policy
- Repository Organization — org board and automation
This project is licensed under the MIT License — see the LICENSE file for details.