Skip to content

Perf: analyze and reach parity with MLPerf LoadGen, vLLM benchmark_serving (and backend metric server) #8

@nvzhihanj

Description

@nvzhihanj

We need to make sure what we measure is reflecting the real endpoint perf.

We need to cross compare inference endpoint reported perf/latency with the following on the same endpoint:

  • MLPerf LoadGen (use 5.1 submission harness)
  • benchmark_serving.py used by SemiAnalysis
  • (Tentative) AIPerf

Metadata

Metadata

Assignees

Labels

area: core-engineLoad generator, scheduler, async utilspriority: ShowStopperDrop everything — critical blocker, all hands on decktype: performancePerformance regression or improvement

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions