[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization

### Anything you want to discuss about vllm.

Hi,

I am running some benchmarks on the `vllm.entrypoints.openai.api_server` measuring latency and throughput with different number of concurrent requests.

Specs:
- H100 80GB
- qwen-1.5-14B-chat

I am sending 1000 requests with random prompts of token length 512. These are the results I get (see attached image):


**Guided_json**
- ~100 running requests
- ~70 generation tokens per second
- ~1700 ms median token time

**Non-guided_json**
- ~100 running requests
- ~800 generation tokens per second
- ~75 ms median token time (TPOT)

At 10 concurrent request (GPU utlization << 100%

Non-guided_json: ~20 ms median token time
guided_json: ~ 160 ms median token time


Currently the application I am building heavily relies on guided_json, however, to put it in an online setting I would like to ask 1) are the numbers I experience sensible and 2) what can be done to improve performance in the guided_json paradigm?

I am debating whether I should try and prompt my way to structured outputs and thus avoiding constrained decoding. 

<img width="1494" alt="Screenshot 2024-03-22 at 10 10 14" src="https://github.com/vllm-project/vllm/assets/61116071/d39f78fe-403c-4472-98d3-858a763df6bf">
)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization #3567

Anything you want to discuss about vllm.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization #3567

Description

Anything you want to discuss about vllm.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions