Perf: analyze and reach parity with MLPerf LoadGen, vLLM benchmark_serving (and backend metric server)

We need to make sure what we measure is reflecting the real endpoint perf.

We need to cross compare inference endpoint reported perf/latency with the following on the same endpoint:

- MLPerf LoadGen (use 5.1 submission harness)
- [benchmark_serving.py](https://github.com/kimbochen/bench_serving/blob/main/benchmark_serving.py) used by SemiAnalysis
- (Tentative) AIPerf