Skip to content

wassemgtk/Just-In-Time-JIT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

JIT LLM Proxy

FastAPI-based proxy that implements cross-model speculative decoding. A fast draft model produces a full answer, then a larger verification model corrects (or accepts) the draft to cut cost and latency.

How it works

  1. Draft model generates a complete response.
  2. Verification model checks the draft and rewrites only if needed.
  3. The proxy returns the corrected final response.

Features

  • OpenAI-compatible endpoints: POST /v1/chat/completions and POST /v1/responses
  • Draft + verify speculative decoding across providers
  • Provider adapters for OpenAI, Anthropic, Gemini, and OpenRouter
  • Optional Redis-backed draft cache
  • Basic in-memory metrics collection

Requirements

  • Python 3.10+
  • Provider API keys for the models you plan to use

Install

pip install fastapi uvicorn httpx pyyaml redis

Run locally

uvicorn app.main:app --reload --port 8080

Quick request (chat)

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai:gpt-5.1",
    "messages": [{"role": "user", "content": "Explain speculative decoding."}],
    "jit_draft_provider": "openrouter",
    "jit_draft_model": "zai-org/GLM-4.7-Flash",
    "jit_verify_provider": "openai",
    "jit_verify_model": "gpt-5.1"
  }'

Quick request (responses)

curl http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic:claude-opus-4.6",
    "input": "Summarize speculative decoding.",
    "jit_mode": "speculative",
    "jit_draft_provider": "openrouter",
    "jit_draft_model": "zai-org/GLM-4.7-Flash",
    "jit_verify_provider": "anthropic",
    "jit_verify_model": "claude-opus-4.6"
  }'

Model naming and routing

You can prefix the model to pick a provider explicitly:

  • openai:gpt-5
  • openai:gpt-5.1
  • anthropic:claude-opus-4.6
  • gemini:gemini-1.5-pro
  • openrouter:zai-org/GLM-4.7-Flash
  • openrouter:writer/palmyra-x5

If no prefix is provided, the default provider is openrouter unless overridden by config.

Request overrides (jit_*)

Use jit_* fields to control speculative decoding per request:

  • jit_mode: speculative or direct
  • jit_draft_provider, jit_draft_model
  • jit_verify_provider, jit_verify_model
  • jit_verify_mode: rewrite (default) or check_only (if supported)
  • jit_acceptance_min: minimum acceptance rate before falling back to full generation

Policy configuration

Edit config/policy.yaml to set defaults or per-model overrides:

default:
  mode: speculative
  draft_provider: openrouter
  draft_model: zai-org/GLM-4.7-Flash
  verify_provider: openai
  verify_mode: rewrite
  acceptance_min: 0.0
models:
  gpt-5.1:
    verify_provider: openai
  claude-opus-4.6:
    verify_provider: anthropic

Environment

Copy env.example and set keys as needed:

  • OPENAI_API_KEY
  • ANTHROPIC_API_KEY
  • GEMINI_API_KEY
  • OPENROUTER_API_KEY

Optional:

  • JIT_DRAFT_MODEL_DEFAULT
  • JIT_VERIFY_MODEL_DEFAULT
  • JIT_DRAFT_PROVIDER_DEFAULT
  • JIT_VERIFY_PROVIDER_DEFAULT
  • JIT_VERIFY_MODE_DEFAULT (e.g. rewrite or check_only)
  • JIT_POLICY_DEFAULT (e.g. speculative or direct)
  • JIT_POLICY_CONFIG (override policy file path)
  • JIT_REDIS_URL (enable Redis cache)
  • JIT_MAX_INPUT_CHARS (cap request size; default 20000)
  • JIT_MAX_MESSAGES (cap message count; default 50)
  • OPENROUTER_HTTP_REFERER, OPENROUTER_TITLE

Development

python3 -m pytest

Notes and limitations

  • Streaming is not supported yet.
  • True check-only verification depends on provider support. The default path uses a rewrite strategy.
  • Token alignment is a naive string-prefix match and can be improved with provider tokenizers.

About

Just-In-Time (JIT) Speculative Decoding across LLMs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages