FastAPI-based proxy that implements cross-model speculative decoding. A fast draft model produces a full answer, then a larger verification model corrects (or accepts) the draft to cut cost and latency.
- Draft model generates a complete response.
- Verification model checks the draft and rewrites only if needed.
- The proxy returns the corrected final response.
- OpenAI-compatible endpoints:
POST /v1/chat/completionsandPOST /v1/responses - Draft + verify speculative decoding across providers
- Provider adapters for OpenAI, Anthropic, Gemini, and OpenRouter
- Optional Redis-backed draft cache
- Basic in-memory metrics collection
- Python 3.10+
- Provider API keys for the models you plan to use
pip install fastapi uvicorn httpx pyyaml redisuvicorn app.main:app --reload --port 8080curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai:gpt-5.1",
"messages": [{"role": "user", "content": "Explain speculative decoding."}],
"jit_draft_provider": "openrouter",
"jit_draft_model": "zai-org/GLM-4.7-Flash",
"jit_verify_provider": "openai",
"jit_verify_model": "gpt-5.1"
}'curl http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "anthropic:claude-opus-4.6",
"input": "Summarize speculative decoding.",
"jit_mode": "speculative",
"jit_draft_provider": "openrouter",
"jit_draft_model": "zai-org/GLM-4.7-Flash",
"jit_verify_provider": "anthropic",
"jit_verify_model": "claude-opus-4.6"
}'You can prefix the model to pick a provider explicitly:
openai:gpt-5openai:gpt-5.1anthropic:claude-opus-4.6gemini:gemini-1.5-proopenrouter:zai-org/GLM-4.7-Flashopenrouter:writer/palmyra-x5
If no prefix is provided, the default provider is openrouter unless overridden by config.
Use jit_* fields to control speculative decoding per request:
jit_mode:speculativeordirectjit_draft_provider,jit_draft_modeljit_verify_provider,jit_verify_modeljit_verify_mode:rewrite(default) orcheck_only(if supported)jit_acceptance_min: minimum acceptance rate before falling back to full generation
Edit config/policy.yaml to set defaults or per-model overrides:
default:
mode: speculative
draft_provider: openrouter
draft_model: zai-org/GLM-4.7-Flash
verify_provider: openai
verify_mode: rewrite
acceptance_min: 0.0
models:
gpt-5.1:
verify_provider: openai
claude-opus-4.6:
verify_provider: anthropicCopy env.example and set keys as needed:
OPENAI_API_KEYANTHROPIC_API_KEYGEMINI_API_KEYOPENROUTER_API_KEY
Optional:
JIT_DRAFT_MODEL_DEFAULTJIT_VERIFY_MODEL_DEFAULTJIT_DRAFT_PROVIDER_DEFAULTJIT_VERIFY_PROVIDER_DEFAULTJIT_VERIFY_MODE_DEFAULT(e.g.rewriteorcheck_only)JIT_POLICY_DEFAULT(e.g.speculativeordirect)JIT_POLICY_CONFIG(override policy file path)JIT_REDIS_URL(enable Redis cache)JIT_MAX_INPUT_CHARS(cap request size; default20000)JIT_MAX_MESSAGES(cap message count; default50)OPENROUTER_HTTP_REFERER,OPENROUTER_TITLE
python3 -m pytest- Streaming is not supported yet.
- True check-only verification depends on provider support. The default path uses a rewrite strategy.
- Token alignment is a naive string-prefix match and can be improved with provider tokenizers.