-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Is your feature request related to a problem? Please describe.
Here is a small example of benchmark I'll use here to describe the issue:
#include <benchmark/benchmark.h>
#include <string>
#include <cstring>
static void BM_Strlen(benchmark::State& state)
{
std::string s;
for (int i = 0; i < state.range(0); i++)
s += 'a';
benchmark::DoNotOptimize(s.data());
benchmark::ClobberMemory();
size_t size;
for (auto _ : state)
{
benchmark::DoNotOptimize(size = strlen(s.c_str()));
benchmark::ClobberMemory();
}
}
BENCHMARK(BM_Strlen)->RangeMultiplier(2)->Range(8, 8<<10);;
static void BM_StdString_Length(benchmark::State& state)
{
std::string s;
for (int i = 0; i < state.range(0); i++)
s += 'a';
benchmark::DoNotOptimize(s.data());
benchmark::ClobberMemory();
size_t size;
for (auto _ : state)
{
benchmark::DoNotOptimize(size = s.length());
benchmark::ClobberMemory();
}
}
BENCHMARK(BM_StdString_Length)->RangeMultiplier(2)->Range(8, 8<<10);;
BENCHMARK_MAIN();
The idea here is to measure how long it takes to call strlen rather than std::string::length to compute the length of an std::string object.
I'll use compare.py to compare the two runs and I'll perform 10 repetitions to get the p-value associated to each of my test points (in the following example, only 8192 string length runs are displayed for brevity):
$ compare.py filters ./example BM_Strlen BM_StdString_Length --benchmark_display_aggregates_only=true --benchmark_repetitions=10
RUNNING: ./example --benchmark_display_aggregates_only=true --benchmark_repetitions=10 --benchmark_filter=BM_Strlen --benchmark_out=/tmp/tmpvtuhj3ix
[...]
BM_Strlen/8192_mean 89.8 ns 89.8 ns 10
BM_Strlen/8192_median 89.8 ns 89.8 ns 10
BM_Strlen/8192_stddev 0.514 ns 0.502 ns 10
[...]
BM_StdString_Length/8192_mean 0.271 ns 0.271 ns 10
BM_StdString_Length/8192_median 0.273 ns 0.273 ns 10
BM_StdString_Length/8192_stddev 0.005 ns 0.005 ns 10
[...]
[BM_Strlen vs. BM_StdString_Length]/8192_pvalue 0.0002 0.0002 U Test, Repetitions: 10 vs 10
[BM_Strlen vs. BM_StdString_Length]/8192_mean -0.9970 -0.9970 90 0 90 0
[BM_Strlen vs. BM_StdString_Length]/8192_median -0.9970 -0.9970 90 0 90 0
[BM_Strlen vs. BM_StdString_Length]/8192_stddev -0.9911 -0.9909 1 0 1 0
I can, with a high confidence, say that std::string::length is faster than strlen.
However, something bothers me. The BM_Strlen benchmarks are all launched in a row and then the BM_StdString_Length are also launched in a row. What if something happens on my system during one of these runs only? This can screws all my results.
To convince you, here is an example where I compare the results of BM_StdString_Length with itself:
$ compare.py filters ./example BM_StdString_Length BM_StdString_Length --benchmark_display_aggregates_only=true --benchmark_repetitions=10
[...]
BM_StdString_Length/1024_mean 0.270 ns 0.270 ns 10
BM_StdString_Length/1024_median 0.267 ns 0.267 ns 10
BM_StdString_Length/1024_stddev 0.007 ns 0.007 ns 10
[...]
BM_StdString_Length/1024_mean 0.401 ns 0.400 ns 10
BM_StdString_Length/1024_median 0.385 ns 0.384 ns 10
BM_StdString_Length/1024_stddev 0.072 ns 0.070 ns 10
[...]
[BM_StdString_Length vs. BM_StdString_Length]/1024_pvalue 0.0002 0.0002 U Test, Repetitions: 10 vs 10
[BM_StdString_Length vs. BM_StdString_Length]/1024_mean +0.4831 +0.4793 0 0 0 0
[BM_StdString_Length vs. BM_StdString_Length]/1024_median +0.4384 +0.4370 0 0 0 0
[BM_StdString_Length vs. BM_StdString_Length]/1024_stddev +8.8632 +8.6321 0 0 0 0
In this example, I have a similar p-value than in my previous benchmark. Does it mean I can reject the hypothesis than std::string::length has the same performance than std::string::length? I guess no.
Describe the solution you'd like
A way to prevent this issue is to mix the execution of the different test points: one batch of BM_Strlen, then one batch of BM_StdString_Length, then another batch of BM_Strlen and so on. That way, if something happens on the system during the execution of the benchmarks, either only one batch is screwed (in case of a small slow down), or both BM_Strlen and BM_StdString_Length will be affected and this will affect the corresponding p-value.
It could be interesting to add an option to mix the execution of the different test points.
Describe alternatives you've considered
It is possible to run the benchmarks multiple times to double-check the results are correct.