-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/perf/cmd/benchstat: tips or quickstart for newcomers #23471
Comments
These all sound like good things to include in the docs to me. If I don't get around to this myself, I'd be happy to review a CL. I'd recommend doing at least 10 runs, if not 20 (where practical). Otherwise you might be tempted to compare them, see that the p-values are too high, and then get more samples, falling prey to the multiple testing problem. (I'd love for benchstat to support some form of sequential hypothesis testing, but alas.) |
Daniel, thank you for creating this issue! May I also request a short warning in stderr if a number of runs is not enough? Similar to "no test files" from |
Where would we put this? I'd say the |
Well, I was thinking just the package godoc. That's where the current information on the tool is. |
@AlekSi that might work, but it's tricker than it may sound. What should the minimum be? If the numbers are very stable (for example, the number of allocations for a deterministic program tends to be), 5 runs may be enough. Or you might not get any low p-values even if you do 20 runs. However, such a warning could be added for 1 run, as that seems to be a common mistake. @aclements 10 runs is probably better general advice. My default is 6 simply because I tend to split the benchmarks into multiple sub-benchmarks, meaning that even just 6 runs stalls me for a good minute. And I tend to go after substantial performance wins, in which cases less than 10 runs is often enough. I didn't know about the multiple testing problem. Is that basically saying that it's much worse to benchmark twice with |
Benchmarking twice with The multiple testing problem arises when you look at the results and then decide whether to run more iterations based on the p-value. Since you'll be happy and stop if the p-value is small (or sad, if it got worse, but the same argument applies :), you bias toward accepting a "significant" result. But with a p-value of 0.05, you have a 5% chance that any random result will be "significant". In the extreme, if you imagine that you just keep running iterations until p < 0.05, you'll eventually get what you're looking for, even if it's not real. There are ways to deal with this. For example, you can keep raising the p-value threshold each time you look at the results (the "Bonferroni correction"). But this sort of thing gets really complicated really fast, so the easiest rule is just to pick an n and stick with it. :) In fact! The exact same problem arises if you run a whole bunch of separate benchmarks and look at whether any of them changed. You should expect 5% of them to show a statistically significant change, even if it's just noise. I don't think all of the sequential testing stuff I said above is worth explaining in the benchstat docs, but it might be worth mentioning this. |
Welp, I'm likely doing this wrong now then. Instead of having five benchmarks with different inputs for a parser, perhaps I should just have a single benchmark with a mix of all the kinds of input. That would also make it easier to use higher This should definitely be an item in the best practices list :) |
I find |
I use git stash for this too, but I have found that it can be confusing to use to someone who isn't used to it. Perhaps it could be another one of the tips, instead of putting it as part of the quickstart. |
That's not necessarily wrong, you just have to understand that there's always a false positive rate. Even if it's just a single benchmark, there's a 5% chance of benchstat reporting noise as a significant change. It's just that if you have a whole benchmark set and each individual benchmark has a 5% chance of a false positive, then there's a higher collective chance that some benchmark in the set is going to have a false positive. |
Another piece of advice could be whether to do I assume calling go test more times is slower, though. I wonder if an option to alternate benchmark runs would be a good idea, as -count just runs the same benchmark multiple times in a row at the moment. |
Rather than calling |
I believe that e.g. Or write a driver script, a la compilebench. See also my abandoned attempts at writing tooling for a more generic driver script at #10930.
And don't forget |
Thanks for the tips regarding cycling through benchmarks. Whatever the best solution at the moment would be, we should document it. And if a better tool or method surfaces at a later point, we can just change the docs.
Already mentioned in the list in my initial post, though I prefer |
New tasks include: golang/go#19675 cmd/vet: report uses of -0 in float32/64 context golang/go#19683 cmd/compile: eliminate usages of global lineno golang/go#19670 x/tools/go/ssa: make opaqueType less annoying to use golang/go#19636 encoding/base64: decoding is slow golang/go#23471 x/perf/cmd/benchstat: tips or quickstart for newcomers golang/go#19577 test: errorcheck support for intraline errors golang/go#19490 cmd/vet: reduce the amount of false positives for -shadow mode golang/go#19042 cmd/internal/obj: optimize wrapper method prologue for branch prediction golang/go#19013 cmd/compile: add tool for understanding/debugging SSA rules
Change https://golang.org/cl/188438 mentions this issue: |
An idea for a benchmarking suggestion that should maybe be put into benchstat's documentation in https://go-review.googlesource.com/c/perf/+/188438 : Some computers, notebooks for example, have insufficient cooling, which requires the kernel to lower CPU frequencies at variable times during benchmarking. This obviously affects benchmarking results in a hard to predict way. But, I think there is a way to counteract that effect: disable CPU frequency scaling! That is, set the hardware minimum frequency as the maximum frequency, so that all cores always run on the minimum possible frequency. Is this a good practice for benchmarking? One possible drawback is prolonged total benchmarking time. EDIT: indeed the LLVM documentation kind of agrees with me: https://llvm.org/docs/Benchmarking.html There is a bunch of other good suggestions there. |
Remove the million endpoint test (we are so far way from scaling that high it doesn't matter really) because it constantly leads to CPU throttling both in CI and locally which leads to inaccurate results. Increase count to 5 per recommendation of golang/go#23471
Remove the million endpoint test (we are so far way from scaling that high it doesn't matter really) because it constantly leads to CPU throttling both in CI and locally which leads to inaccurate results. Increase count to 5 per recommendation of golang/go#23471
Thank you, reading this issue description once had a |
Change https://golang.org/cl/309969 mentions this issue: |
This is a complete rewrite of benchstat. Basic usage remains the same, as does the core idea of showing statistical benchmark summaries and A/B comparisons in a table, but there are several major improvements. The statistics is now more robust. Previously, benchstat used IQR-based outlier rejection, showed the mean of the reduced sample, its range, and did a non-parametric difference-of-distribution test on reduced samples. Any form of outlier rejection must start with distributional assumptions, in this case assuming normality, which is generally not sound for benchmark data. Hence, now benchstat does not do any outlier rejection. As a result, it must use robust summary statistics as well, so benchstat now uses the median and confidence interval of the median as summary statistics. Benchstat continues to use the same Mann-Whitney U-test for the delta, but now does it on the full samples since the U-test is already non-parametric, hence increasing the power of this test. As part of these statistical improvements, benchstat now detects and warns about several common mistakes, such as having too few samples for meaningful statistical results, or having incomparable geomeans. The output format is more consistent. Previously, benchstat transformed units like "ns/op" into a metric like "time/op", which it used as a column header; and a numerator like "sec", which it used to label each measurement. This was easy enough for the standard units used by the testing framework, but was basically impossible to generalize to custom units. Now, benchstat does unit scaling, but otherwise leaves units alone. The full (scaled) unit is used as a column header and each measurement is simply a scaled value shown with an SI prefix. This also means that the text and CSV formats can be much more similar while still allowing the CSV format to be usefully machine-readable. Benchstat will also now do A/B comparisons even if there are more than two inputs. It shows a comparison to the base in the second and all subsequent columns. This approach is consistent for any number of inputs. Benchstat now supports the full Go benchmark format, including sophisticated control over exactly how it structures the results into rows, columns, and tables. This makes it easy to do meaningful comparisons across benchmark data that isn't simply structured into two input files, and gives significantly more control over how results are sorted. The default behavior is still to turn each input file into a column and each benchmark into a row. Fixes golang/go#19565 by showing all results, even if the benchmark sets don't match across columns, and warning when geomean sets are incompatible. Fixes golang/go#19634 by no longer doing outlier rejection and clearly reporting when there are not enough samples to do a meaningful difference test. Updates golang/go#23471 by providing more through command documentation. I'm not sure it quite fixes this issue, but it's much better than it was. Fixes golang/go#30368 because benchstat now supports filter expressions, which can also filter down units. Fixes golang/go#33169 because benchstat now always shows file configuration labels. Updates golang/go#43744 by integrating unit metadata to control statistical assumptions into the main tool that implements those assumptions. Fixes golang/go#48380 by introducing a way to override labels from the command line rather than always using file names. Change-Id: Ie2c5a12024e84b4918e483df2223eb1f10413a4f
This is a complete rewrite of benchstat. Basic usage remains the same, as does the core idea of showing statistical benchmark summaries and A/B comparisons in a table, but there are several major improvements. The statistics is now more robust. Previously, benchstat used IQR-based outlier rejection, showed the mean of the reduced sample, its range, and did a non-parametric difference-of-distribution test on reduced samples. Any form of outlier rejection must start with distributional assumptions, in this case assuming normality, which is generally not sound for benchmark data. Hence, now benchstat does not do any outlier rejection. As a result, it must use robust summary statistics as well, so benchstat now uses the median and confidence interval of the median as summary statistics. Benchstat continues to use the same Mann-Whitney U-test for the delta, but now does it on the full samples since the U-test is already non-parametric, hence increasing the power of this test. As part of these statistical improvements, benchstat now detects and warns about several common mistakes, such as having too few samples for meaningful statistical results, or having incomparable geomeans. The output format is more consistent. Previously, benchstat transformed units like "ns/op" into a metric like "time/op", which it used as a column header; and a numerator like "sec", which it used to label each measurement. This was easy enough for the standard units used by the testing framework, but was basically impossible to generalize to custom units. Now, benchstat does unit scaling, but otherwise leaves units alone. The full (scaled) unit is used as a column header and each measurement is simply a scaled value shown with an SI prefix. This also means that the text and CSV formats can be much more similar while still allowing the CSV format to be usefully machine-readable. Benchstat will also now do A/B comparisons even if there are more than two inputs. It shows a comparison to the base in the second and all subsequent columns. This approach is consistent for any number of inputs. Benchstat now supports the full Go benchmark format, including sophisticated control over exactly how it structures the results into rows, columns, and tables. This makes it easy to do meaningful comparisons across benchmark data that isn't simply structured into two input files, and gives significantly more control over how results are sorted. The default behavior is still to turn each input file into a column and each benchmark into a row. Fixes golang/go#19565 by showing all results, even if the benchmark sets don't match across columns, and warning when geomean sets are incompatible. Fixes golang/go#19634 by no longer doing outlier rejection and clearly reporting when there are not enough samples to do a meaningful difference test. Updates golang/go#23471 by providing more through command documentation. I'm not sure it quite fixes this issue, but it's much better than it was. Fixes golang/go#30368 because benchstat now supports filter expressions, which can also filter down units. Fixes golang/go#33169 because benchstat now always shows file configuration labels. Updates golang/go#43744 by integrating unit metadata to control statistical assumptions into the main tool that implements those assumptions. Fixes golang/go#48380 by introducing a way to override labels from the command line rather than always using file names. Change-Id: Ie2c5a12024e84b4918e483df2223eb1f10413a4f Reviewed-on: https://go-review.googlesource.com/c/perf/+/309969 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
benchstat
is a very useful tool, but if you're not familiar with what it does, it can be very confusing to use for the first time.One such example is "how many times should my benchmarks run". If one has used
benchcmp
before, running each benchmark before and after exactly once, trying to usebenchstat
will result in something confusing like:The answer here is that the user should be running the benchmark more times - at least 3 or 4 to get p-values low enough for a result.
However, neither
benchstat -h
nor the godoc page are very clear on this, nor do they have a "quickstart" guide. The godoc page does show an example input with many runs, and does talk about "a number of runs" and p-values, but it's not very clear if you're not familiar with statistics and benchmarking.I believe that a quick guide would greatly improve the usability of the tool - for example:
I think it should also introduce other best practices, such as:
-count
values if the benchmark numbers aren't stable-benchmem
to also get stats on allocated objects and space-run='$^'
or-run=-
to eachgo test
command to avoid running the tests tooI realise that some of these tips are more about benchmarking than
benchstat
itself. But I think it's fine to have it all there, as in general you're going to be using that tool anyway./cc @rsc @ALTree @aclements @AlekSi
The text was updated successfully, but these errors were encountered: