JIT LLM Proxy

FastAPI-based proxy that implements cross-model speculative decoding. A fast draft model produces a full answer, then a larger verification model corrects (or accepts) the draft to cut cost and latency.

How it works

Draft model generates a complete response.
Verification model checks the draft and rewrites only if needed.
The proxy returns the corrected final response.

Features

OpenAI-compatible endpoints: POST /v1/chat/completions and POST /v1/responses
Draft + verify speculative decoding across providers
Provider adapters for OpenAI, Anthropic, Gemini, and OpenRouter
Optional Redis-backed draft cache
Basic in-memory metrics collection

Requirements

Python 3.10+
Provider API keys for the models you plan to use

Install

pip install fastapi uvicorn httpx pyyaml redis

Run locally

uvicorn app.main:app --reload --port 8080

Quick request (chat)

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai:gpt-5.1",
    "messages": [{"role": "user", "content": "Explain speculative decoding."}],
    "jit_draft_provider": "openrouter",
    "jit_draft_model": "zai-org/GLM-4.7-Flash",
    "jit_verify_provider": "openai",
    "jit_verify_model": "gpt-5.1"
  }'

Quick request (responses)

curl http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic:claude-opus-4.6",
    "input": "Summarize speculative decoding.",
    "jit_mode": "speculative",
    "jit_draft_provider": "openrouter",
    "jit_draft_model": "zai-org/GLM-4.7-Flash",
    "jit_verify_provider": "anthropic",
    "jit_verify_model": "claude-opus-4.6"
  }'

Model naming and routing

You can prefix the model to pick a provider explicitly:

openai:gpt-5
openai:gpt-5.1
anthropic:claude-opus-4.6
gemini:gemini-1.5-pro
openrouter:zai-org/GLM-4.7-Flash
openrouter:writer/palmyra-x5

If no prefix is provided, the default provider is openrouter unless overridden by config.

Request overrides (jit_*)

Use jit_* fields to control speculative decoding per request:

jit_mode: speculative or direct
jit_draft_provider, jit_draft_model
jit_verify_provider, jit_verify_model
jit_verify_mode: rewrite (default) or check_only (if supported)
jit_acceptance_min: minimum acceptance rate before falling back to full generation

Policy configuration

Edit config/policy.yaml to set defaults or per-model overrides:

default:
  mode: speculative
  draft_provider: openrouter
  draft_model: zai-org/GLM-4.7-Flash
  verify_provider: openai
  verify_mode: rewrite
  acceptance_min: 0.0
models:
  gpt-5.1:
    verify_provider: openai
  claude-opus-4.6:
    verify_provider: anthropic

Environment

Copy env.example and set keys as needed:

OPENAI_API_KEY
ANTHROPIC_API_KEY
GEMINI_API_KEY
OPENROUTER_API_KEY

Optional:

JIT_DRAFT_MODEL_DEFAULT
JIT_VERIFY_MODEL_DEFAULT
JIT_DRAFT_PROVIDER_DEFAULT
JIT_VERIFY_PROVIDER_DEFAULT
JIT_VERIFY_MODE_DEFAULT (e.g. rewrite or check_only)
JIT_POLICY_DEFAULT (e.g. speculative or direct)
JIT_POLICY_CONFIG (override policy file path)
JIT_REDIS_URL (enable Redis cache)
JIT_MAX_INPUT_CHARS (cap request size; default 20000)
JIT_MAX_MESSAGES (cap message count; default 50)
OPENROUTER_HTTP_REFERER, OPENROUTER_TITLE

Development

python3 -m pytest

Notes and limitations

Streaming is not supported yet.
True check-only verification depends on provider support. The default path uses a rewrite strategy.
Token alignment is a naive string-prefix match and can be improved with provider tokenizers.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
JIT		JIT
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JIT LLM Proxy

How it works

Features

Requirements

Install

Run locally

Quick request (chat)

Quick request (responses)

Model naming and routing

Request overrides (jit_*)

Policy configuration

Environment

Development

Notes and limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

JIT LLM Proxy

How it works

Features

Requirements

Install

Run locally

Quick request (chat)

Quick request (responses)

Model naming and routing

Request overrides (jit_*)

Policy configuration

Environment

Development

Notes and limitations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages