Skip to content

[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization #3567

Closed as not planned
@jens-create

Description

@jens-create

Anything you want to discuss about vllm.

Hi,

I am running some benchmarks on the vllm.entrypoints.openai.api_server measuring latency and throughput with different number of concurrent requests.

Specs:

  • H100 80GB
  • qwen-1.5-14B-chat

I am sending 1000 requests with random prompts of token length 512. These are the results I get (see attached image):

Guided_json

  • ~100 running requests
  • ~70 generation tokens per second
  • ~1700 ms median token time

Non-guided_json

  • ~100 running requests
  • ~800 generation tokens per second
  • ~75 ms median token time (TPOT)

At 10 concurrent request (GPU utlization << 100%

Non-guided_json: ~20 ms median token time
guided_json: ~ 160 ms median token time

Currently the application I am building heavily relies on guided_json, however, to put it in an online setting I would like to ask 1) are the numbers I experience sensible and 2) what can be done to improve performance in the guided_json paradigm?

I am debating whether I should try and prompt my way to structured outputs and thus avoiding constrained decoding.

Screenshot 2024-03-22 at 10 10 14 )

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions