You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: Lazy tokenizer init in StructuredOutputManager to prevent semaphore leak
GGUF models without precomputed merges trigger `build_merges_on_the_fly`
in the transformers library, which uses multiprocessing primitives.
When this happens in both the APIServer process (for request validation)
and the EngineCore subprocess (via StructuredOutputManager), the
subprocess leaks a semaphore, causing the server to hang indefinitely.
This change makes tokenizer initialization lazy in StructuredOutputManager:
- Tokenizer is only loaded when grammar_init() is first called
- Most inference requests don't use structured output, so the tokenizer
in EngineCore is never loaded
- For requests that do use structured output, tokenizer is loaded on-demand
The fix resolves the following symptoms:
- Server hangs after "resource_tracker: There appear to be 1 leaked
semaphore objects to clean up at shutdown"
- Tokenizer merges being built twice (once in APIServer, once in EngineCore)
- GGUF models failing to start even though weights load successfully
Tested with bartowski/Phi-3.5-mini-instruct-GGUF (Q5_K_M).
0 commit comments