Summary
POST /v1/chat/completions resolves every request through the legacy
slot resolver and dispatches to the primary slot, regardless of what
model the request body specifies.
Observed in production logs (2026-05-28):
2026-05-28T06:51:24 [info] dispatch.decision [hal0-dispatch]
cache_state=legacy
latency_ms=0.157
model=qwen3-coder-reap-25b-a3b-q5km ← caller asked for agent-hermes' model
resolution_path=legacy_slot:primary ← but we sent it to primary
upstream=primary
The agent-hermes slot was loaded with qwen3-coder-reap-25b-a3b-q5km
on port 8002. The primary slot was loaded with the 40b coder on 8001.
A chat request asking for the 25b model was still forwarded to primary.
Root cause
Dispatcher.dispatch() (src/hal0/dispatcher/router.py) reaches Step 4
(legacy heuristics) because:
- The model isn't in the upstream registry (Lemonade-loaded models don't auto-register).
- No upstream's cached
/v1/models advertises it (or the cache is cold).
resolve_slot() in dispatcher/proxy.py matches by path, not by
model name, and /v1/chat/completions always resolves to primary.
So the dropdown in the WebUI suggesting "talk to agent-hermes" is
effectively cosmetic for chat requests — they all land on primary.
Impact
- Users can't route chat to specific slots by model name.
- Multi-slot setups (primary + agent-hermes) effectively share the
primary slot for all /v1/chat/completions traffic.
- Will compound once we expose more chat-capable slots (NPU, FLM, etc.).
Proposed direction (not in scope of this issue — defer)
Either:
- Auto-register Lemonade-loaded models into the model registry on slot transition to READY, so Step 1 finds them; OR
- Make
resolve_slot() consult slot manifests' [model] default + models lists when the path is /v1/chat/completions.
Deferred from a debug session on 2026-05-28 where we fixed the
swap-window 503 race; see related branch fix/swap-window-503.
Related
- ADR-0006 (Lemonade migration) — registry/catalog drift was noted but not closed
- Memory [[hal0_lemonade_hf_cache_gotchas]] — model catalog surfaces
Summary
POST /v1/chat/completionsresolves every request through the legacyslot resolver and dispatches to the
primaryslot, regardless of whatmodelthe request body specifies.Observed in production logs (2026-05-28):
The agent-hermes slot was loaded with
qwen3-coder-reap-25b-a3b-q5kmon port 8002. The primary slot was loaded with the 40b coder on 8001.
A chat request asking for the 25b model was still forwarded to primary.
Root cause
Dispatcher.dispatch()(src/hal0/dispatcher/router.py) reaches Step 4(legacy heuristics) because:
/v1/modelsadvertises it (or the cache is cold).resolve_slot()indispatcher/proxy.pymatches by path, not bymodel name, and
/v1/chat/completionsalways resolves toprimary.So the dropdown in the WebUI suggesting "talk to agent-hermes" is
effectively cosmetic for chat requests — they all land on
primary.Impact
primary slot for all
/v1/chat/completionstraffic.Proposed direction (not in scope of this issue — defer)
Either:
resolve_slot()consult slot manifests'[model] default+modelslists when the path is/v1/chat/completions.Deferred from a debug session on 2026-05-28 where we fixed the
swap-window 503 race; see related branch
fix/swap-window-503.Related