[FR] Provide an option to mix the execution of the different test points

**Is your feature request related to a problem? Please describe.**
Here is a small example of benchmark I'll use here to describe the issue:
```
#include <benchmark/benchmark.h>

#include <string>
#include <cstring>

static void BM_Strlen(benchmark::State& state)
{
    std::string s;

    for (int i = 0; i < state.range(0); i++)
        s += 'a';

    benchmark::DoNotOptimize(s.data());
    benchmark::ClobberMemory();
    size_t size;

    for (auto _ : state)
    {
        benchmark::DoNotOptimize(size = strlen(s.c_str()));
        benchmark::ClobberMemory();
    }
}

BENCHMARK(BM_Strlen)->RangeMultiplier(2)->Range(8, 8<<10);;

static void BM_StdString_Length(benchmark::State& state)
{
    std::string s;

    for (int i = 0; i < state.range(0); i++)
        s += 'a';

    benchmark::DoNotOptimize(s.data());
    benchmark::ClobberMemory();
    size_t size;

    for (auto _ : state)
    {
        benchmark::DoNotOptimize(size = s.length());
        benchmark::ClobberMemory();
    }
}
BENCHMARK(BM_StdString_Length)->RangeMultiplier(2)->Range(8, 8<<10);;

BENCHMARK_MAIN();
```
The idea here is to measure how long it takes to call `strlen` rather than `std::string::length` to compute the length of an `std::string` object.

I'll use `compare.py` to compare the two runs and I'll perform 10 repetitions to get the p-value associated to each of my test points (in the following example, only 8192 string length runs are displayed for brevity):
```
$ compare.py filters ./example BM_Strlen BM_StdString_Length --benchmark_display_aggregates_only=true --benchmark_repetitions=10
RUNNING: ./example --benchmark_display_aggregates_only=true --benchmark_repetitions=10 --benchmark_filter=BM_Strlen --benchmark_out=/tmp/tmpvtuhj3ix
[...]
BM_Strlen/8192_mean         89.8 ns         89.8 ns           10
BM_Strlen/8192_median       89.8 ns         89.8 ns           10
BM_Strlen/8192_stddev      0.514 ns        0.502 ns           10
[...]
BM_StdString_Length/8192_mean        0.271 ns        0.271 ns           10
BM_StdString_Length/8192_median      0.273 ns        0.273 ns           10
BM_StdString_Length/8192_stddev      0.005 ns        0.005 ns           10
[...]
[BM_Strlen vs. BM_StdString_Length]/8192_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
[BM_Strlen vs. BM_StdString_Length]/8192_mean                  -0.9970         -0.9970            90             0            90             0
[BM_Strlen vs. BM_StdString_Length]/8192_median                -0.9970         -0.9970            90             0            90             0
[BM_Strlen vs. BM_StdString_Length]/8192_stddev                -0.9911         -0.9909             1             0             1             0
```
I can, with a high confidence, say that `std::string::length` is faster than `strlen`.

However, something bothers me. The BM_Strlen benchmarks are all launched in a row and then the BM_StdString_Length are also launched in a row. What if something happens on my system during one of these runs only? This can screws all my results.

To convince you, here is an example where I compare the results of BM_StdString_Length with itself:
```
$ compare.py filters ./example BM_StdString_Length BM_StdString_Length --benchmark_display_aggregates_only=true --benchmark_repetitions=10
[...]
BM_StdString_Length/1024_mean        0.270 ns        0.270 ns           10
BM_StdString_Length/1024_median      0.267 ns        0.267 ns           10
BM_StdString_Length/1024_stddev      0.007 ns        0.007 ns           10
[...]
BM_StdString_Length/1024_mean        0.401 ns        0.400 ns           10
BM_StdString_Length/1024_median      0.385 ns        0.384 ns           10
BM_StdString_Length/1024_stddev      0.072 ns        0.070 ns           10
[...]
[BM_StdString_Length vs. BM_StdString_Length]/1024_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
[BM_StdString_Length vs. BM_StdString_Length]/1024_mean                  +0.4831         +0.4793             0             0             0             0
[BM_StdString_Length vs. BM_StdString_Length]/1024_median                +0.4384         +0.4370             0             0             0             0
[BM_StdString_Length vs. BM_StdString_Length]/1024_stddev                +8.8632         +8.6321             0             0             0             0
```
In this example, I have a similar p-value than in my previous benchmark. Does it mean I can reject the hypothesis than `std::string::length` has the same performance than `std::string::length`? I guess no.

**Describe the solution you'd like**
A way to prevent this issue is to mix the execution of the different test points: one batch of BM_Strlen, then one batch of BM_StdString_Length, then another batch of BM_Strlen and so on. That way, if something happens on the system during the execution of the benchmarks, either only one batch is screwed (in case of a small slow down), or both BM_Strlen and BM_StdString_Length will be affected and this will affect the corresponding p-value.

It could be interesting to add an option to mix the execution of the different test points.

**Describe alternatives you've considered**
It is possible to run the benchmarks multiple times to double-check the results are correct. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] Provide an option to mix the execution of the different test points #1026

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FR] Provide an option to mix the execution of the different test points #1026

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions