We need to make sure what we measure is reflecting the real endpoint perf.
We need to cross compare inference endpoint reported perf/latency with the following on the same endpoint:
- MLPerf LoadGen (use 5.1 submission harness)
- benchmark_serving.py used by SemiAnalysis
- (Tentative) AIPerf
We need to make sure what we measure is reflecting the real endpoint perf.
We need to cross compare inference endpoint reported perf/latency with the following on the same endpoint: