Skip to content

[Fix] Auto-detect XGrammar compiler threads based on CPU cores. #17737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions docs/source/features/structured_outputs.md
Original file line number Diff line number Diff line change
Expand Up @@ -273,3 +273,22 @@ print(outputs[0].outputs[0].text)
```

Full example: <gh-file:examples/offline_inference/structured_outputs.py>

## Performance Tuning: XGrammar Compiler Threads

When using the xgrammar backend for structured outputs, the number of threads
used by the grammar compiler can affect the time to first token, especially for
complex schemas.

You can control the thread pool size for the xgrammar grammar compiler using the
environment variable `VLLM_XGRAMMAR_COMPILER_THREADS`. By default, this is set
to `8`. Increasing this value may help reduce first token latency in some
environments, particularly when compiling large or complex schemas. For example:

```bash
export VLLM_XGRAMMAR_COMPILER_THREADS=16
```

Set this variable before starting vLLM. Note that increasing the thread count
may increase CPU usage during grammar compilation, and may not always improve
performance.
6 changes: 6 additions & 0 deletions vllm/envs.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@
VLLM_ALLOW_INSECURE_SERIALIZATION: bool = False
VLLM_NIXL_SIDE_CHANNEL_HOST: str = "localhost"
VLLM_NIXL_SIDE_CHANNEL_PORT: int = 5557
VLLM_XGRAMMAR_COMPILER_THREADS: int = 8


def get_default_cache_root():
Expand Down Expand Up @@ -757,6 +758,11 @@ def maybe_convert_int(value: Optional[str]) -> Optional[int]:
# Port used for NIXL handshake between remote agents.
"VLLM_NIXL_SIDE_CHANNEL_PORT":
lambda: int(os.getenv("VLLM_NIXL_SIDE_CHANNEL_PORT", "5557")),

# Number of threads to use for the XGrammar grammar compiler.
# If not set, defaults to None and the backend will auto-detect.
"VLLM_XGRAMMAR_COMPILER_THREADS":
lambda: int(os.getenv("VLLM_XGRAMMAR_COMPILER_THREADS", "8")),
}

# end-env-vars-definition
Expand Down
2 changes: 1 addition & 1 deletion vllm/v1/structured_output/backend_xgrammar.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ def __init__(self, vllm_config: VllmConfig):
)
self.compiler = xgr.GrammarCompiler(
tokenizer_info,
max_threads=8,
max_threads=vllm.envs.VLLM_XGRAMMAR_COMPILER_THREADS,
cache_enabled=True,
cache_limit_bytes=vllm.envs.VLLM_XGRAMMAR_CACHE_MB * 1024 * 1024,
)
Expand Down