Description
Anything you want to discuss about vllm.
Hi,
I am running some benchmarks on the vllm.entrypoints.openai.api_server
measuring latency and throughput with different number of concurrent requests.
Specs:
- H100 80GB
- qwen-1.5-14B-chat
I am sending 1000 requests with random prompts of token length 512. These are the results I get (see attached image):
Guided_json
- ~100 running requests
- ~70 generation tokens per second
- ~1700 ms median token time
Non-guided_json
- ~100 running requests
- ~800 generation tokens per second
- ~75 ms median token time (TPOT)
At 10 concurrent request (GPU utlization << 100%
Non-guided_json: ~20 ms median token time
guided_json: ~ 160 ms median token time
Currently the application I am building heavily relies on guided_json, however, to put it in an online setting I would like to ask 1) are the numbers I experience sensible and 2) what can be done to improve performance in the guided_json paradigm?
I am debating whether I should try and prompt my way to structured outputs and thus avoiding constrained decoding.
