-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregate trial statistics #112
Conversation
c3937a4
to
634a579
Compare
598cb3c
to
77f1f4b
Compare
What is the motivation of this change? |
Exactly. The motivation is that we have more data from multiple trials that we are not utilizing. By combining the information from all the trials we would have better results since we have collected more samples.
I think you are raising an important point. Do you think we can assume that the loaded models are warmed up using the server warm up feature? Otherwise, I agree that we should be holding off on these changes until alternative warmup mechanism is introduced in PA. CC @nv-braf |
Model's may be warmed up using the server warmup feature. But there are other aspects of the Inference pipeline that still needs warming up. The inference threads in the the server endpoint, the messages in the buckets(also in the endpoint). Even the perf_analyzer worker threads needs warming up. Not all worker threads would be up and running from get go. To get to a consistent concurrency it might take some time.
The point of trials with stability threshold was to do exactly that. The user provides us a count_window or time_window for which we must measure the statistics. The first window run is most likely the most unstable. The second window run will stabilize little bit. We use last 3(default value) trial windows to detect when the stability is achieved so that we can report the stats from apparently the most stable measurement window.
If the average latency and throughput are within acceptable noise threshold then we report Nth window measurement. For an ideal scenario(no noise), these numbers for all the three windows should be identical, hence it doesn't have any effect on whether the averages are reported for which one of them or all of them. But if the PA was still warming up the pipeline then, latest window is supposed to be most stable and most noise-free. |
If we are determining stability based on the last N windows, then we want to include all N windows in the calculation (not ALL windows). If N includes all windows, then we are stable enough and are ok to include what might be considered 'warmup'. Longer term there is another story to explicitly determine warmup, which will at least always exclude the first window. |
@tgerdesnv We are only including the last 3 stable windows in this PR. I think @tanmayv25's point is that there can be cases where the model warmup doesn't break the stability but it is affecting the throughput/latency numbers. Thus by reporting the last trial we are reporting the perf numbers that are most warmed-up. |
@tanmayv25 I chatted with Brian about the warmup issue. I think the only risk is that we could be under-reporting perf numbers for models that stabilize AND require a warm-up. Most of the warmup issues will be caught by not being able to stabilize. @nv-braf mentioned some cases that using this PR we could see much more stable results compared to what we get currently from PA. |
Merging the trials is equivalent to having a larger window size. And larger sample size are expected to have more stable averages. I am fine in making this change if you really need this. But we must document the change in the treatment of measurement windows as now our reported values are not for the given measurement_window but "trial * measurement window". Most of the time the user might not even care about it. But we must keep an accurate description. |
@tanmayv25 @Tabrizian Where are docs to be updated? Just what is printed out from |
@tgerdesnv I took a quick look but I couldn't find where the docs need to be updated. @tanmayv25 could you please point me to the location where the docs need update so that I can fix the docs? |
It's all over in this file. Some points I found: |
Here's some data backing up my claim. This was done using a version of PA provided by Matt with his throughput fix. I ran resnet50 ten times with a BS=256, C=256, MRC=2560:
You can clearly see how much tighter the reported throughput range is using the mean of the last 3 windows. |
Right now we are using the average. Regarding the documentation update, I think it might be better to keep the definition of "measurement" and "trial" the same as before. I added a paragraph explaining that the numbers reported by PA are the average of the last three trials. |
151e514
to
b20370d
Compare
b20370d
to
b66ec66
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edit: just testing GitHub, ignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edit: just testing GitHub, ignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments otherwise the code looks clean to me.
* Aggregate trial statistics * Fix merging for ensemble models * Add documentation * review edit
Aggregate trial statistics to report the average of trials instead of reporting the last trial.
After
Ensemble Model
PA Output
Verbose CSV
Sequence Model
PA Output
Verbose CSV
Normal Model
PA Output
Verbose CSV
Before
Ensemble Model
PA Output
Verbose CSV
Sequence Model
PA Output
Verbose CSV
Normal Model
PA Output
Verbose CSV