Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/perf/cmd/benchstat: tips or quickstart for newcomers #23471

Open
mvdan opened this issue Jan 18, 2018 · 18 comments
Open

x/perf/cmd/benchstat: tips or quickstart for newcomers #23471

mvdan opened this issue Jan 18, 2018 · 18 comments
Labels
Documentation NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@mvdan
Copy link
Member

mvdan commented Jan 18, 2018

benchstat is a very useful tool, but if you're not familiar with what it does, it can be very confusing to use for the first time.

One such example is "how many times should my benchmarks run". If one has used benchcmp before, running each benchmark before and after exactly once, trying to use benchstat will result in something confusing like:

name                old time/op    new time/op    delta
Decode-4               2.20s ± 0%     1.54s ± 0%   ~     (p=1.000 n=1+1)

The answer here is that the user should be running the benchmark more times - at least 3 or 4 to get p-values low enough for a result.

However, neither benchstat -h nor the godoc page are very clear on this, nor do they have a "quickstart" guide. The godoc page does show an example input with many runs, and does talk about "a number of runs" and p-values, but it's not very clear if you're not familiar with statistics and benchmarking.

I believe that a quick guide would greatly improve the usability of the tool - for example:

$ go test -bench=. -count 5 >old.txt
$ <apply source changes>
$ go test -bench=. -count 5 >new.txt
$ benchstat old.txt new.txt

I think it should also introduce other best practices, such as:

  • Using higher -count values if the benchmark numbers aren't stable
  • Usingn -benchmem to also get stats on allocated objects and space
  • Running the benchmarks on an idle machine not running on battery (and with power management off?)
  • Adding -run='$^' or -run=- to each go test command to avoid running the tests too

I realise that some of these tips are more about benchmarking than benchstat itself. But I think it's fine to have it all there, as in general you're going to be using that tool anyway.

/cc @rsc @ALTree @aclements @AlekSi

@aclements
Copy link
Member

These all sound like good things to include in the docs to me. If I don't get around to this myself, I'd be happy to review a CL.

I'd recommend doing at least 10 runs, if not 20 (where practical). Otherwise you might be tempted to compare them, see that the p-values are too high, and then get more samples, falling prey to the multiple testing problem. (I'd love for benchstat to support some form of sequential hypothesis testing, but alas.)

@AlekSi
Copy link
Contributor

AlekSi commented Jan 18, 2018

Daniel, thank you for creating this issue!

May I also request a short warning in stderr if a number of runs is not enough? Similar to "no test files" from go test.

@ALTree
Copy link
Member

ALTree commented Jan 18, 2018

Where would we put this? I'd say the README.md file in the repo may be a good place, since it's already quite wordy but at the moment I think it's not terribly useful as an introduction to benchstat. It could be re-organized into a small tutorial-style document.

@mvdan
Copy link
Member Author

mvdan commented Jan 18, 2018

Well, I was thinking just the package godoc. That's where the current information on the tool is.

@mvdan
Copy link
Member Author

mvdan commented Jan 18, 2018

@AlekSi that might work, but it's tricker than it may sound. What should the minimum be? If the numbers are very stable (for example, the number of allocations for a deterministic program tends to be), 5 runs may be enough. Or you might not get any low p-values even if you do 20 runs.

However, such a warning could be added for 1 run, as that seems to be a common mistake.

@aclements 10 runs is probably better general advice. My default is 6 simply because I tend to split the benchmarks into multiple sub-benchmarks, meaning that even just 6 runs stalls me for a good minute. And I tend to go after substantial performance wins, in which cases less than 10 runs is often enough.

I didn't know about the multiple testing problem. Is that basically saying that it's much worse to benchmark twice with -count 5 than once with -count 10, because with the former you are more likely to get wrong results?

@aclements
Copy link
Member

I didn't know about the multiple testing problem. Is that basically saying that it's much worse to benchmark twice with -count 5 than once with -count 10, because with the former you are more likely to get wrong results?

Benchmarking twice with -count 5 is fine, as long as you don't look at the results after the first run. :) (In fact, just in terms of running benchmarks, it's better to alternate individual runs back and forth in case there are systemic effects.)

The multiple testing problem arises when you look at the results and then decide whether to run more iterations based on the p-value. Since you'll be happy and stop if the p-value is small (or sad, if it got worse, but the same argument applies :), you bias toward accepting a "significant" result. But with a p-value of 0.05, you have a 5% chance that any random result will be "significant". In the extreme, if you imagine that you just keep running iterations until p < 0.05, you'll eventually get what you're looking for, even if it's not real.

There are ways to deal with this. For example, you can keep raising the p-value threshold each time you look at the results (the "Bonferroni correction"). But this sort of thing gets really complicated really fast, so the easiest rule is just to pick an n and stick with it. :)

In fact! The exact same problem arises if you run a whole bunch of separate benchmarks and look at whether any of them changed. You should expect 5% of them to show a statistically significant change, even if it's just noise. I don't think all of the sequential testing stuff I said above is worth explaining in the benchstat docs, but it might be worth mentioning this.

@mvdan
Copy link
Member Author

mvdan commented Jan 18, 2018

You should expect 5% of them to show a statistically significant change, even if it's just noise.

Welp, I'm likely doing this wrong now then. Instead of having five benchmarks with different inputs for a parser, perhaps I should just have a single benchmark with a mix of all the kinds of input. That would also make it easier to use higher -count values.

This should definitely be an item in the best practices list :)

@dgryski
Copy link
Contributor

dgryski commented Jan 18, 2018

I find git stash helps the workflow while benchmarking old vs new code.

@mvdan
Copy link
Member Author

mvdan commented Jan 18, 2018

I use git stash for this too, but I have found that it can be confusing to use to someone who isn't used to it. Perhaps it could be another one of the tips, instead of putting it as part of the quickstart.

@aclements
Copy link
Member

Welp, I'm likely doing this wrong now then. Instead of having five benchmarks with different inputs for a parser, perhaps I should just have a single benchmark with a mix of all the kinds of input.

That's not necessarily wrong, you just have to understand that there's always a false positive rate. Even if it's just a single benchmark, there's a 5% chance of benchstat reporting noise as a significant change. It's just that if you have a whole benchmark set and each individual benchmark has a 5% chance of a false positive, then there's a higher collective chance that some benchmark in the set is going to have a false positive.

@mvdan
Copy link
Member Author

mvdan commented Jan 20, 2018

Another piece of advice could be whether to do go test -bench=... -count=N or for i in {1..N}; do go test -bench=...; done.

I assume calling go test more times is slower, though. I wonder if an option to alternate benchmark runs would be a good idea, as -count just runs the same benchmark multiple times in a row at the moment.

@dgryski
Copy link
Contributor

dgryski commented Jan 20, 2018

Rather than calling go test multiple times, run go test -c and run the binary multiple times. If you have an original test binary you don't need to worry about swapping back and forth between source code changes with git stash.

@josharian
Copy link
Contributor

I wonder if an option to alternate benchmark runs would be a good idea, as -count just runs the same benchmark multiple times in a row at the moment.

I believe that e.g. -cpu=8,8,8,8 should cycle through the benchmarks four times. Compare with -count=4, which runs each benchmark four times before moving on to the next.

Or write a driver script, a la compilebench. See also my abandoned attempts at writing tooling for a more generic driver script at #10930.

run go test -c and run the binary multiple times

And don't forget -run=NONE.

@mvdan
Copy link
Member Author

mvdan commented Jan 29, 2018

Thanks for the tips regarding cycling through benchmarks. Whatever the best solution at the moment would be, we should document it. And if a better tool or method surfaces at a later point, we can just change the docs.

And don't forget -run=NONE.

Already mentioned in the list in my initial post, though I prefer -run=^$ or -run=- since those leave no possibility of actually matching a test.

quasilyte added a commit to quasilyte/go-contributing-ru that referenced this issue Apr 1, 2018
New tasks include:
golang/go#19675 cmd/vet: report uses of -0 in float32/64 context
golang/go#19683 cmd/compile: eliminate usages of global lineno
golang/go#19670 x/tools/go/ssa: make opaqueType less annoying to use
golang/go#19636 encoding/base64: decoding is slow
golang/go#23471 x/perf/cmd/benchstat: tips or quickstart for newcomers
golang/go#19577 test: errorcheck support for intraline errors
golang/go#19490 cmd/vet: reduce the amount of false positives for -shadow mode
golang/go#19042 cmd/internal/obj: optimize wrapper method prologue for branch prediction
golang/go#19013 cmd/compile: add tool for understanding/debugging SSA rules
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/188438 mentions this issue: cmd/benchstat: add example benchmarking workflow to package comment

@nsajko
Copy link
Contributor

nsajko commented Aug 3, 2019

An idea for a benchmarking suggestion that should maybe be put into benchstat's documentation in https://go-review.googlesource.com/c/perf/+/188438 :

Some computers, notebooks for example, have insufficient cooling, which requires the kernel to lower CPU frequencies at variable times during benchmarking. This obviously affects benchmarking results in a hard to predict way. But, I think there is a way to counteract that effect: disable CPU frequency scaling! That is, set the hardware minimum frequency as the maximum frequency, so that all cores always run on the minimum possible frequency.

Is this a good practice for benchmarking? One possible drawback is prolonged total benchmarking time.

EDIT: indeed the LLVM documentation kind of agrees with me: https://llvm.org/docs/Benchmarking.html There is a bunch of other good suggestions there.

NewbMiao added a commit to NewbMiao/opa-koans that referenced this issue Jul 10, 2020
howardjohn added a commit to howardjohn/istio that referenced this issue Jul 13, 2020
Remove the million endpoint test (we are so far way from scaling that
high it doesn't matter really) because it constantly leads to CPU
throttling both in CI and locally which leads to inaccurate results.

Increase count to 5 per recommendation of
golang/go#23471
istio-testing pushed a commit to istio/istio that referenced this issue Jul 13, 2020
Remove the million endpoint test (we are so far way from scaling that
high it doesn't matter really) because it constantly leads to CPU
throttling both in CI and locally which leads to inaccurate results.

Increase count to 5 per recommendation of
golang/go#23471
@missinglink
Copy link

Thank you, reading this issue description once had a ± 0% impact on my understanding of the output, reading it three times resulted in a ± 100% gain in my understanding.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/309969 mentions this issue: cmd/benchstat: new version of benchstat

zchee pushed a commit to zchee/golang-perf that referenced this issue Nov 28, 2021
This is a complete rewrite of benchstat. Basic usage remains the same,
as does the core idea of showing statistical benchmark summaries and
A/B comparisons in a table, but there are several major improvements.

The statistics is now more robust. Previously, benchstat used
IQR-based outlier rejection, showed the mean of the reduced sample,
its range, and did a non-parametric difference-of-distribution test on
reduced samples. Any form of outlier rejection must start with
distributional assumptions, in this case assuming normality, which is
generally not sound for benchmark data. Hence, now benchstat does not
do any outlier rejection. As a result, it must use robust summary
statistics as well, so benchstat now uses the median and confidence
interval of the median as summary statistics. Benchstat continues to
use the same Mann-Whitney U-test for the delta, but now does it on the
full samples since the U-test is already non-parametric, hence
increasing the power of this test.

As part of these statistical improvements, benchstat now detects and
warns about several common mistakes, such as having too few samples
for meaningful statistical results, or having incomparable geomeans.

The output format is more consistent. Previously, benchstat
transformed units like "ns/op" into a metric like "time/op", which it
used as a column header; and a numerator like "sec", which it used to
label each measurement. This was easy enough for the standard units
used by the testing framework, but was basically impossible to
generalize to custom units. Now, benchstat does unit scaling, but
otherwise leaves units alone. The full (scaled) unit is used as a
column header and each measurement is simply a scaled value shown with
an SI prefix. This also means that the text and CSV formats can be
much more similar while still allowing the CSV format to be usefully
machine-readable.

Benchstat will also now do A/B comparisons even if there are more than
two inputs. It shows a comparison to the base in the second and all
subsequent columns. This approach is consistent for any number of
inputs.

Benchstat now supports the full Go benchmark format, including
sophisticated control over exactly how it structures the results into
rows, columns, and tables. This makes it easy to do meaningful
comparisons across benchmark data that isn't simply structured into
two input files, and gives significantly more control over how results
are sorted. The default behavior is still to turn each input file into
a column and each benchmark into a row.

Fixes golang/go#19565 by showing all results, even if the benchmark
sets don't match across columns, and warning when geomean sets are
incompatible.

Fixes golang/go#19634 by no longer doing outlier rejection and clearly
reporting when there are not enough samples to do a meaningful
difference test.

Updates golang/go#23471 by providing more through command
documentation. I'm not sure it quite fixes this issue, but it's much
better than it was.

Fixes golang/go#30368 because benchstat now supports filter
expressions, which can also filter down units.

Fixes golang/go#33169 because benchstat now always shows file
configuration labels.

Updates golang/go#43744 by integrating unit metadata to control
statistical assumptions into the main tool that implements those
assumptions.

Fixes golang/go#48380 by introducing a way to override labels from the
command line rather than always using file names.

Change-Id: Ie2c5a12024e84b4918e483df2223eb1f10413a4f
gopherbot pushed a commit to golang/perf that referenced this issue Jan 13, 2023
This is a complete rewrite of benchstat. Basic usage remains the same,
as does the core idea of showing statistical benchmark summaries and
A/B comparisons in a table, but there are several major improvements.

The statistics is now more robust. Previously, benchstat used
IQR-based outlier rejection, showed the mean of the reduced sample,
its range, and did a non-parametric difference-of-distribution test on
reduced samples. Any form of outlier rejection must start with
distributional assumptions, in this case assuming normality, which is
generally not sound for benchmark data. Hence, now benchstat does not
do any outlier rejection. As a result, it must use robust summary
statistics as well, so benchstat now uses the median and confidence
interval of the median as summary statistics. Benchstat continues to
use the same Mann-Whitney U-test for the delta, but now does it on the
full samples since the U-test is already non-parametric, hence
increasing the power of this test.

As part of these statistical improvements, benchstat now detects and
warns about several common mistakes, such as having too few samples
for meaningful statistical results, or having incomparable geomeans.

The output format is more consistent. Previously, benchstat
transformed units like "ns/op" into a metric like "time/op", which it
used as a column header; and a numerator like "sec", which it used to
label each measurement. This was easy enough for the standard units
used by the testing framework, but was basically impossible to
generalize to custom units. Now, benchstat does unit scaling, but
otherwise leaves units alone. The full (scaled) unit is used as a
column header and each measurement is simply a scaled value shown with
an SI prefix. This also means that the text and CSV formats can be
much more similar while still allowing the CSV format to be usefully
machine-readable.

Benchstat will also now do A/B comparisons even if there are more than
two inputs. It shows a comparison to the base in the second and all
subsequent columns. This approach is consistent for any number of
inputs.

Benchstat now supports the full Go benchmark format, including
sophisticated control over exactly how it structures the results into
rows, columns, and tables. This makes it easy to do meaningful
comparisons across benchmark data that isn't simply structured into
two input files, and gives significantly more control over how results
are sorted. The default behavior is still to turn each input file into
a column and each benchmark into a row.

Fixes golang/go#19565 by showing all results, even if the benchmark
sets don't match across columns, and warning when geomean sets are
incompatible.

Fixes golang/go#19634 by no longer doing outlier rejection and clearly
reporting when there are not enough samples to do a meaningful
difference test.

Updates golang/go#23471 by providing more through command
documentation. I'm not sure it quite fixes this issue, but it's much
better than it was.

Fixes golang/go#30368 because benchstat now supports filter
expressions, which can also filter down units.

Fixes golang/go#33169 because benchstat now always shows file
configuration labels.

Updates golang/go#43744 by integrating unit metadata to control
statistical assumptions into the main tool that implements those
assumptions.

Fixes golang/go#48380 by introducing a way to override labels from the
command line rather than always using file names.

Change-Id: Ie2c5a12024e84b4918e483df2223eb1f10413a4f
Reviewed-on: https://go-review.googlesource.com/c/perf/+/309969
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants