LLM Transport & Efficiency Layer
Make every LLM call cheaper, faster, and safe — without changing your model.
LATTICE sits between your application and any LLM provider. It compresses prompts, caches responses, manages concurrency (TACC), supports a native binary protocol, and routes coding agents through one self-hosted proxy. Your app sends standard OpenAI-format requests; LATTICE makes them smaller, faster, and cache-friendlier.
It is not a router. LATTICE never changes your model, never falls back between providers, never guesses. One provider per request. LATTICE optimises transport and execution.
- Installation
- Quick Start
- Architecture
- Novel Technology
- Compression Pipeline
- Safety
- Observability
- Supported Providers
- CLI Reference
- Agent Integration
- Development
- Migrating from v0.x
- Documentation
- License
pip install lattice-transportOptional extras:
pip install "lattice-transport[redis]" # Multi-process session store
pip install "lattice-transport[mcp]" # MCP tool support
pip install "lattice-transport[all]" # EverythingRequirements: Python 3.10+. No external services required for single-process mode.
# Start the proxy
lattice proxy run --port 8787
# Point any OpenAI SDK at it
export OPENAI_BASE_URL=http://localhost:8787/v1
# Or route an agent through it
lattice lace claudefrom lattice import LatticeClient
client = LatticeClient()
response = client.chat.completions.create(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "Explain transport protocols"}],
)
print(response.choices[0].message.content)Every request is automatically compressed, cached, and optimized in proxy mode — zero application code changes.
Application
│ OpenAI / Anthropic API format
▼
LATTICE Proxy :8787
│
├── state/ Session, segments, SemanticCache
├── planner/ RequestClassifier → UnifiedPlanner → ExecutionPlan
├── pipeline/ Pipeline.compress() — IR-native transforms + gates
├── telemetry/ Metrics, downgrade, cost, agent stats
└── providers/ adapters/ (17) + transport/ (HTTP pool, TACC, streaming)
│
▼
LLM Provider (exactly one per request)
- Client sends
POST /v1/chat/completions(or Anthropic/v1/messages). SessionManagercreates or loads a session (CAS versioning).content_profilerbuilds a semantic profile;UnifiedPlannerproduces anExecutionPlan.Pipeline.compress()runs transforms with policy, budget, risk, and MILV gates.SemanticCachechecks exact hash, then approximate fingerprint.- On miss, the provider adapter sends via HTTP/2 pool; TACC manages admission.
- Response reverse pass +
x-lattice-*headers viaLatticeHeaderMiddleware.
LATTICE applies classical systems techniques to LLM workloads — transport and execution, not model features.
| Capability | Summary | Deep dive |
|---|---|---|
| TACC | Token-aware AIMD congestion control | docs/novel/tacc.md |
| Binary framing | 15-byte headers, 17 frame types, CRC32 | docs/novel/binary-framing.md |
| Delta encoding | Turn 2+ sends deltas only; CAS sessions | docs/novel/delta-encoding.md |
| Streaming | Stall detection, resume tokens, multiplex | docs/novel/streaming.md |
| Batching | Groups compatible requests | docs/novel/batching-speculation.md |
| Speculation | Rule-based next-turn precompute | docs/novel/batching-speculation.md |
Batching overhead reduction and long-conversation dedup savings are measured in the canonical benchmark suite — see Claim traceability 1 2.
LATTICE ships 20 transforms (list_transform_names() in lattice.transforms.registry). Six run in the default pipeline; three are execution-only (batching, speculative, delta); the rest are planner-selected or off by default. Every transform is safety-classified and risk-gated.
| P | Transform | Safety | What it does | Default |
|---|---|---|---|---|
| 1 | content_profiler | SAFE | Classifies content, computes semantic risk score | yes |
| 2 | runtime_contract | SAFE | Per-transform budget and timeout | yes |
| 2 | speculative | SAFE | Speculative token generation | exec |
| 3 | batching | SAFE | Request batching for multi-turn workloads | exec |
| 5 | delta_encoder | SAFE | Session-based delta encoding | exec |
| 9 | cache_arbitrage | SAFE | KV-cache alignment reorder | yes |
| 9 | causal_chain | SAFE | Causal chain extraction | no |
| 15 | message_dedup | CONDITIONAL | Exact/near-duplicate turn removal | no |
| 17 | diagnostic_rle | SAFE | Diagnostic repetition RLE | no |
| 18 | context_selector | SAFE | Submodular context selection | no |
| 19 | columnar_pack | SAFE | Columnar table packing | no |
| 20 | reference_sub | CONDITIONAL | UUID/URL/hash → short refs | yes |
| 21 | json_shape | SAFE | JSON shape factoring | no |
| 22 | extractive_compress | SAFE | Extractive compression | no |
| 22 | rate_distortion | CONDITIONAL | Rate-distortion semantic compression | no |
| 23 | path_prefix | SAFE | Filesystem path prefix compression | no |
| 25 | format_conversion | CONDITIONAL | Table/JSON format conversion | no |
| 29 | tool_projection | SAFE | Query-aware tool field projection | no |
| 30 | tool_filter | SAFE | Tool output filtering | yes |
| 40 | output_cleanup | SAFE | Response-side whitespace/JSON cleanup | yes |
exec = execution-only transform (outside default pipeline list).
Headline compression on the canonical feature suite (ollama-cloud / kimi-k2.6:cloud): 40.3% average reduction 3. Pipeline latency ~36 ms 4.
→ Transform reference · Claim traceability · Feature parity checklist (61 rows)
Transforms are classified SAFE, CONDITIONAL, or DANGEROUS. A 0–100 semantic risk score gates lossy transforms; expansion guards cap token growth.
→ Safety guide · SIG · RATS · PSG · MILV
curl http://localhost:8787/stats | jq
curl http://localhost:8787/metrics- /stats — transforms, sessions, pools, TACC, maintenance, downgrades
- /metrics — Prometheus counters and histograms
- Response headers —
x-lattice-compression,x-lattice-session-id,x-lattice-delta,x-lattice-cost-usd,x-lattice-provider,x-lattice-transforms-applied
17 direct adapters. No routing — one provider per request.
| Provider | Prefix | HTTP/2 | Streaming |
|---|---|---|---|
| OpenAI | openai/ |
yes | SSE |
| Anthropic | anthropic/, claude- |
yes | SSE |
| Azure | azure/ |
yes | SSE |
| Bedrock | bedrock/ |
yes | SSE |
| Gemini | gemini/, google/ |
yes | SSE |
| Vertex AI | vertex/ |
yes | SSE |
| Groq | groq/ |
yes | SSE |
| DeepSeek | deepseek/ |
yes | SSE |
| Mistral | mistral/ |
yes | SSE |
| Cohere | cohere/ |
yes | SSE |
| Ollama | ollama/ |
— | SSE |
| Ollama Cloud | ollama-cloud/ |
yes | SSE |
| OpenRouter | openrouter/ |
yes | SSE |
| Fireworks | fireworks/ |
yes | SSE |
| Together | together/ |
yes | SSE |
| Perplexity | perplexity/ |
yes | SSE |
| AI21 | ai21/ |
yes | SSE |
lattice proxy run --port 8787
lattice proxy start|stop|restart|status
lattice init
lattice lace|unlace <agent>
lattice info|config|status|health|doctor
lattice benchmark --suite feature # wraps benchmarks/evals/cli.pylattice lace claude # Claude Code
lattice lace codex # OpenAI Codex
lattice lace cursor # Cursor
lattice lace opencode # OpenCode
lattice lace copilot # GitHub Copilotlattice doctor (no args) checks all five agents. lattice init applies durable config; lattice lace uses transient routing + tunnel sidecar.
git clone https://github.com/Harsh-Daga/lattice
cd lattice
uv sync
uv run pytest tests/ -q # 2039 collected, 1824 passed (215 skipped)
uv run pytest tests/contract/ -q
uv run ruff check src/ tests/ benchmarks/
uv run ruff format --check src/ tests/ benchmarks/
uv run mypy src/lattice/
uv run python benchmarks/evals/cli.py --suite all \
--providers ollama-cloud \
--provider-model ollama-cloud=kimi-k2.6:cloud \
--iterations 3 --warmup 1→ AGENTS.md for AI agent contributors
Internal Python imports changed in 1.0.0. CLI and HTTP are stable.
| Section | Documents |
|---|---|
| Getting Started | Quick Start · Installation · CLI |
| Architecture | Runtime · Safety · SDK |
| Novel Tech | TACC · Binary framing · Delta · Streaming |
| Compression | Transforms · Caching |
| Providers | 17 providers |
| Operations | Agent integrations |
MIT © Harsh Daga
GitHub · Issues · PyPI · Changelog