Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregate trial statistics #112

Merged
merged 4 commits into from
Jun 8, 2022
Merged

Aggregate trial statistics #112

merged 4 commits into from
Jun 8, 2022

Conversation

Tabrizian
Copy link
Member

@Tabrizian Tabrizian commented Jun 2, 2022

Aggregate trial statistics to report the average of trials instead of reporting the last trial.

After

Ensemble Model

PA Output

*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Pass [1] throughput: 269.4 infer/sec. Avg latency: 3705 usec (std 220 usec)
  Pass [2] throughput: 267.4 infer/sec. Avg latency: 3733 usec (std 239 usec)
  Pass [3] throughput: 268.6 infer/sec. Avg latency: 3714 usec (std 228 usec)
  Client:
    Request count: 4027
    Throughput: 268.467 infer/sec
    Avg latency: 3717 usec (standard deviation 229 usec)
    p50 latency: 3736 usec
    p90 latency: 3983 usec
    p95 latency: 4010 usec
    p99 latency: 4049 usec
    Avg HTTP time: 3699 usec (send 148 usec + response wait 3550 usec + receive 1 usec)
  Server:
    Inference count: 4809
    Execution count: 4809
    Successful request count: 4809
    Avg request latency: 3089 usec (overhead 297 usec + queue 188 usec + compute 2604 usec)

  Composing models:
  add_sub_1, version:
      Inference count: 4809
      Execution count: 4809
      Successful request count: 4809
      Avg request latency: 1505 usec (overhead 167 usec + queue 87 usec + compute input 165 usec + compute infer 797 usec + compute output 288 usec)

  add_sub_2, version:
      Inference count: 4809
      Execution count: 4809
      Successful request count: 4809
      Avg request latency: 1624 usec (overhead 170 usec + queue 101 usec + compute input 164 usec + compute infer 813 usec + compute output 375 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 268.467 infer/sec, latency 3717 usec

Verbose CSV

Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency,Avg latency,request/response,response wait
1,268.467,148,956,6,330,1610,664,1,3736,3983,4010,4049,3717,149,3550

Sequence Model

PA Output

*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using asynchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Pass [1] throughput: 1546 infer/sec. Avg latency: 625 usec (std 37 usec)
  Pass [2] throughput: 1525.6 infer/sec. Avg latency: 632 usec (std 40 usec)
  Pass [3] throughput: 1512.8 infer/sec. Avg latency: 636 usec (std 55 usec)
  Client:
    Request count: 22922
    Sequence count: 1144 (76.2667 seq/sec)
    Throughput: 1528.13 infer/sec
    Avg latency: 631 usec (standard deviation 45 usec)
    p50 latency: 625 usec
    p90 latency: 653 usec
    p95 latency: 694 usec
    p99 latency: 825 usec
    Avg HTTP time: 596 usec (send 34 usec + response wait 562 usec + receive 0 usec)
  Server:
    Inference count: 27542
    Execution count: 27542
    Successful request count: 27542
    Avg request latency: 277 usec (overhead 57 usec + queue 58 usec + compute input 66 usec + compute infer 74 usec + compute output 21 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 1528.13 infer/sec, latency 631 usec

Verbose CSV

Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency,Avg latency,request/response,response wait
1,1528.13,34,376,58,66,74,21,0,625,653,694,825,631,34,562

Normal Model

PA Output

*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Pass [1] throughput: 489.8 infer/sec. Avg latency: 2033 usec (std 206 usec)
  Pass [2] throughput: 486 infer/sec. Avg latency: 2050 usec (std 172 usec)
  Pass [3] throughput: 490.8 infer/sec. Avg latency: 2030 usec (std 200 usec)
  Client:
    Request count: 7333
    Throughput: 488.867 infer/sec
    Avg latency: 2038 usec (standard deviation 194 usec)
    p50 latency: 2090 usec
    p90 latency: 2236 usec
    p95 latency: 2279 usec
    p99 latency: 2342 usec
    Avg HTTP time: 2000 usec (send 142 usec + response wait 1857 usec + receive 1 usec)
  Server:
    Inference count: 8806
    Execution count: 8806
    Successful request count: 8806
    Avg request latency: 1395 usec (overhead 167 usec + queue 80 usec + compute input 157 usec + compute infer 785 usec + compute output 206 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 488.867 infer/sec, latency 2038 usec

Verbose CSV

Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency,Avg latency,request/response,response wait
1,488.867,142,664,80,157,785,206,1,2090,2236,2279,2342,2038,143,1857

Before

Ensemble Model

PA Output

*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Pass [1] throughput: 269.8 infer/sec. Avg latency: 3697 usec (std 230 usec)
  Pass [2] throughput: 266.8 infer/sec. Avg latency: 3739 usec (std 271 usec)
  Pass [3] throughput: 269.4 infer/sec. Avg latency: 3703 usec (std 267 usec)
  Client:
    Request count: 1347
    Throughput: 269.4 infer/sec
    Avg latency: 3703 usec (standard deviation 267 usec)
    p50 latency: 3740 usec
    p90 latency: 3986 usec
    p95 latency: 4005 usec
    p99 latency: 4070 usec
    Avg HTTP time: 3671 usec (send 141 usec + response wait 3529 usec + receive 1 usec)
  Server:
    Inference count: 1616
    Execution count: 1616
    Successful request count: 1616
    Avg request latency: 3075 usec (overhead 294 usec + queue 190 usec + compute 2591 usec)

  Composing models:
  add_sub_1, version:
      Inference count: 1616
      Execution count: 1616
      Successful request count: 1616
      Avg request latency: 1492 usec (overhead 166 usec + queue 88 usec + compute input 164 usec + compute infer 787 usec + compute output 286 usec)

  add_sub_2, version:
      Inference count: 1616
      Execution count: 1616
      Successful request count: 1616
      Avg request latency: 1622 usec (overhead 167 usec + queue 102 usec + compute input 164 usec + compute infer 815 usec + compute output 374 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 269.4 infer/sec, latency 3703 usec

Verbose CSV

Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency,Avg latency,request/response,response wait
1,269.4,141,962,6,328,1603,660,1,3740,3986,4005,4070,3703,142,3529

Sequence Model

PA Output

*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using asynchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Pass [1] throughput: 1541.8 infer/sec. Avg latency: 626 usec (std 33 usec)
  Pass [2] throughput: 1515.2 infer/sec. Avg latency: 636 usec (std 51 usec)
  Pass [3] throughput: 1510.4 infer/sec. Avg latency: 638 usec (std 49 usec)
  Client:
    Request count: 7552
    Sequence count: 379 (75.8 seq/sec)
    Throughput: 1510.4 infer/sec
    Avg latency: 638 usec (standard deviation 49 usec)
    p50 latency: 628 usec
    p90 latency: 677 usec
    p95 latency: 710 usec
    p99 latency: 877 usec
    Avg HTTP time: 601 usec (send 36 usec + response wait 565 usec + receive 0 usec)
  Server:
    Inference count: 9071
    Execution count: 9071
    Successful request count: 9071
    Avg request latency: 278 usec (overhead 58 usec + queue 59 usec + compute input 65 usec + compute infer 74 usec + compute output 21 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 1510.4 infer/sec, latency 638 usec

Verbose CSV

Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency,Avg latency,request/response,response wait
1,1510.4,36,380,59,65,74,21,0,628,677,710,877,638,36,565

Normal Model

PA Output

*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Pass [1] throughput: 492 infer/sec. Avg latency: 2024 usec (std 212 usec)
  Pass [2] throughput: 490.8 infer/sec. Avg latency: 2030 usec (std 174 usec)
  Pass [3] throughput: 495 infer/sec. Avg latency: 2012 usec (std 212 usec)
  Client:
    Request count: 2475
    Throughput: 495 infer/sec
    Avg latency: 2012 usec (standard deviation 212 usec)
    p50 latency: 2063 usec
    p90 latency: 2221 usec
    p95 latency: 2272 usec
    p99 latency: 2332 usec
    Avg HTTP time: 1978 usec (send 141 usec + response wait 1836 usec + receive 1 usec)
  Server:
    Inference count: 2969
    Execution count: 2969
    Successful request count: 2969
    Avg request latency: 1376 usec (overhead 165 usec + queue 79 usec + compute input 156 usec + compute infer 772 usec + compute output 204 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 495 infer/sec, latency 2012 usec

Verbose CSV

Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency,Avg latency,request/response,response wait
1,495,141,656,79,156,772,204,1,2063,2221,2272,2332,2012,142,1836

@Tabrizian Tabrizian force-pushed the imant-throughput-latency branch from c3937a4 to 634a579 Compare June 2, 2022 22:41
@Tabrizian Tabrizian force-pushed the imant-throughput-latency branch from 598cb3c to 77f1f4b Compare June 3, 2022 18:11
@Tabrizian Tabrizian marked this pull request as ready for review June 3, 2022 18:59
@tanmayv25
Copy link
Contributor

What is the motivation of this change?
Instead of inference statistics of a single trial (count_window or time_window), it looks we are reporting stats from trial times count_windows and timed_windows.? Wouldn't the latest trial run be the most stable with everything warmed up?

@tanmayv25 tanmayv25 requested a review from GuanLuo June 3, 2022 19:23
@Tabrizian
Copy link
Member Author

Tabrizian commented Jun 3, 2022

Instead of inference statistics of a single trial (count_window or time_window), it looks we are reporting stats from trial times count_windows and timed_windows.?

Exactly. The motivation is that we have more data from multiple trials that we are not utilizing. By combining the information from all the trials we would have better results since we have collected more samples.

Wouldn't the latest trial run be the most stable with everything warmed up?

I think you are raising an important point. Do you think we can assume that the loaded models are warmed up using the server warm up feature? Otherwise, I agree that we should be holding off on these changes until alternative warmup mechanism is introduced in PA.

CC @nv-braf

@tanmayv25
Copy link
Contributor

tanmayv25 commented Jun 3, 2022

Do you think we can assume that the loaded models are warmed up using the server warm up feature?

Model's may be warmed up using the server warmup feature. But there are other aspects of the Inference pipeline that still needs warming up. The inference threads in the the server endpoint, the messages in the buckets(also in the endpoint). Even the perf_analyzer worker threads needs warming up. Not all worker threads would be up and running from get go. To get to a consistent concurrency it might take some time.

until alternative warmup mechanism is introduced in PA.

The point of trials with stability threshold was to do exactly that. The user provides us a count_window or time_window for which we must measure the statistics. The first window run is most likely the most unstable. The second window run will stabilize little bit. We use last 3(default value) trial windows to detect when the stability is achieved so that we can report the stats from apparently the most stable measurement window.

(N-2)th window, (N-1)th, Nth window

If the average latency and throughput are within acceptable noise threshold then we report Nth window measurement.
If we are reporting our throughput and latency from all these windows then I don't think measurement windows definition would hold any meaning.

For an ideal scenario(no noise), these numbers for all the three windows should be identical, hence it doesn't have any effect on whether the averages are reported for which one of them or all of them. But if the PA was still warming up the pipeline then, latest window is supposed to be most stable and most noise-free.

@tgerdesnv
Copy link
Collaborator

If we are determining stability based on the last N windows, then we want to include all N windows in the calculation (not ALL windows). If N includes all windows, then we are stable enough and are ok to include what might be considered 'warmup'. Longer term there is another story to explicitly determine warmup, which will at least always exclude the first window.

@Tabrizian
Copy link
Member Author

Tabrizian commented Jun 6, 2022

@tgerdesnv We are only including the last 3 stable windows in this PR. I think @tanmayv25's point is that there can be cases where the model warmup doesn't break the stability but it is affecting the throughput/latency numbers. Thus by reporting the last trial we are reporting the perf numbers that are most warmed-up.

@Tabrizian
Copy link
Member Author

@tanmayv25 I chatted with Brian about the warmup issue. I think the only risk is that we could be under-reporting perf numbers for models that stabilize AND require a warm-up. Most of the warmup issues will be caught by not being able to stabilize. @nv-braf mentioned some cases that using this PR we could see much more stable results compared to what we get currently from PA.

@tanmayv25
Copy link
Contributor

@nv-braf mentioned some cases that using this PR we could see much more stable results compared to what we get currently from PA.

Merging the trials is equivalent to having a larger window size. And larger sample size are expected to have more stable averages. I am fine in making this change if you really need this. But we must document the change in the treatment of measurement windows as now our reported values are not for the given measurement_window but "trial * measurement window". Most of the time the user might not even care about it. But we must keep an accurate description.

@tgerdesnv
Copy link
Collaborator

@tanmayv25 @Tabrizian Where are docs to be updated? Just what is printed out from --help? Or is there somewhere else as well? @debermudez is already about to update some of the --help doc in #111

@Tabrizian
Copy link
Member Author

@tgerdesnv I took a quick look but I couldn't find where the docs need to be updated. @tanmayv25 could you please point me to the location where the docs need update so that I can fix the docs?

@tanmayv25
Copy link
Contributor

It's all over in this file. Some points I found:
https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/perf_analyzer.cc#L62
https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/perf_analyzer.cc#L359
https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/perf_analyzer.cc#L490
All in all, take close look into where we are talking about latencies and throughput. Also, when hitting latency threshold, should we include the averages of all three trials or should we still use the latency of the latest window.

@nv-braf
Copy link
Contributor

nv-braf commented Jun 7, 2022

Here's some data backing up my claim. This was done using a version of PA provided by Matt with his throughput fix. I ran resnet50 ten times with a BS=256, C=256, MRC=2560:

  • Avg. throughput reported was 925 +/- 20 (range from 899-963)
  • Mean of last 3 windows avg. throughput was 914 +/- 5 (range from 908-923)

You can clearly see how much tighter the reported throughput range is using the mean of the last 3 windows.

@Tabrizian
Copy link
Member Author

All in all, take close look into where we are talking about latencies and throughput. Also, when hitting latency threshold, should we include the averages of all three trials or should we still use the latency of the latest window.

Right now we are using the average.

Regarding the documentation update, I think it might be better to keep the definition of "measurement" and "trial" the same as before. I added a paragraph explaining that the numbers reported by PA are the average of the last three trials.

@Tabrizian Tabrizian force-pushed the imant-throughput-latency branch from 151e514 to b20370d Compare June 8, 2022 17:00
@Tabrizian Tabrizian force-pushed the imant-throughput-latency branch from b20370d to b66ec66 Compare June 8, 2022 17:02
Copy link
Contributor

@matthewkotila matthewkotila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: just testing GitHub, ignore

@matthewkotila matthewkotila self-requested a review June 8, 2022 17:24
Copy link
Contributor

@matthewkotila matthewkotila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: just testing GitHub, ignore

@matthewkotila matthewkotila self-requested a review June 8, 2022 17:24
Copy link
Contributor

@tanmayv25 tanmayv25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments otherwise the code looks clean to me.

src/c++/perf_analyzer/inference_profiler.h Outdated Show resolved Hide resolved
src/c++/perf_analyzer/perf_analyzer.cc Show resolved Hide resolved
@Tabrizian Tabrizian requested a review from tanmayv25 June 8, 2022 18:19
@Tabrizian Tabrizian merged commit 82986cf into main Jun 8, 2022
@Tabrizian Tabrizian deleted the imant-throughput-latency branch June 8, 2022 20:35
mc-nv pushed a commit that referenced this pull request Jun 13, 2022
* Aggregate trial statistics

* Fix merging for ensemble models

* Add documentation

* review edit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants