Problems with Unstable Benchmarks

I've been playing with the raindrops exercise from Exercism in both rust and Julia. In Julia the minimum time seems to be quite stable, by which I mean that rerunning the benchmark doesn't change the minimum much at all. However in rust with criterion my benchmarks vary quite significantly from run to run. I've already increased the confidence level to 0.99 and decreased the signifcance level to 0.01 but I still regularly get criterion telling me that I have regressions or improvements even though nothing has changed.

May be I've screwed up something in the implementation or this is some kind of fundamental difference between the languages Julia and rust, but I suspect it has to do with the way criterion estimates central tendency. I have three ideas for what might help.

**Use the minimum as an estimator**

I haven't figured out yet how to get criterion to show me and use the minimum. May be this is already implemented and I'm just screwing up.

We have a distribution that has a lower bound and seems to have a fairly fat tail. This combination makes the sample mean converge very slowly to the true mean. By comparison, because there isn't really any effect that would be able to artificially decrease the benchmark time, the minimum should be the most resistant to noise of the possible estimators. Of course, it doesn't tell you what the expected time to complete some workload would be, but that's not the point of these kind of microbenchmarks anyway. Usually, you're interested in whether some change you made has improved or worsened performance.

**Don't remove low outliers**

With the exception of timing errors, which are tiny compared to other sources of noise, no source of noise could lower times. So, I'm not sure such a thing as a low outlier even exists in this case.

**Take more measurements with fewer samples per measurement**

The more iterations you take together the higher the probability that noise will push up the benchmark time. Of course, there's a tradeoff here because the measurement needs to take enough time to where you can get an accurate measurement. But I think you could push up the number of measurements quite a bit before that becomes an issue. This way, fewer measurements would have their time pulled up because a few samples were very slow and pulled up the whole measurement mean. Then, minimum and median would be better estimators, too.

Again, I think this point basically boils down to sample means being bad estimators of the true mean for fat-tailed distributions. The more samples you take per measurement, the more you rely on the mean as a good estimate of the rental tendency of all the samples in that measurement.

Hopefully this is useful. I'm new to this package and rust, so there's a good chance I missed something. Let me know what you think :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with Unstable Benchmarks #485

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Problems with Unstable Benchmarks #485

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions