Skip to content

[FR] Provide an option to mix the execution of the different test points #1026

@pierretallotte

Description

@pierretallotte

Is your feature request related to a problem? Please describe.
Here is a small example of benchmark I'll use here to describe the issue:

#include <benchmark/benchmark.h>

#include <string>
#include <cstring>

static void BM_Strlen(benchmark::State& state)
{
    std::string s;

    for (int i = 0; i < state.range(0); i++)
        s += 'a';

    benchmark::DoNotOptimize(s.data());
    benchmark::ClobberMemory();
    size_t size;

    for (auto _ : state)
    {
        benchmark::DoNotOptimize(size = strlen(s.c_str()));
        benchmark::ClobberMemory();
    }
}

BENCHMARK(BM_Strlen)->RangeMultiplier(2)->Range(8, 8<<10);;

static void BM_StdString_Length(benchmark::State& state)
{
    std::string s;

    for (int i = 0; i < state.range(0); i++)
        s += 'a';

    benchmark::DoNotOptimize(s.data());
    benchmark::ClobberMemory();
    size_t size;

    for (auto _ : state)
    {
        benchmark::DoNotOptimize(size = s.length());
        benchmark::ClobberMemory();
    }
}
BENCHMARK(BM_StdString_Length)->RangeMultiplier(2)->Range(8, 8<<10);;

BENCHMARK_MAIN();

The idea here is to measure how long it takes to call strlen rather than std::string::length to compute the length of an std::string object.

I'll use compare.py to compare the two runs and I'll perform 10 repetitions to get the p-value associated to each of my test points (in the following example, only 8192 string length runs are displayed for brevity):

$ compare.py filters ./example BM_Strlen BM_StdString_Length --benchmark_display_aggregates_only=true --benchmark_repetitions=10
RUNNING: ./example --benchmark_display_aggregates_only=true --benchmark_repetitions=10 --benchmark_filter=BM_Strlen --benchmark_out=/tmp/tmpvtuhj3ix
[...]
BM_Strlen/8192_mean         89.8 ns         89.8 ns           10
BM_Strlen/8192_median       89.8 ns         89.8 ns           10
BM_Strlen/8192_stddev      0.514 ns        0.502 ns           10
[...]
BM_StdString_Length/8192_mean        0.271 ns        0.271 ns           10
BM_StdString_Length/8192_median      0.273 ns        0.273 ns           10
BM_StdString_Length/8192_stddev      0.005 ns        0.005 ns           10
[...]
[BM_Strlen vs. BM_StdString_Length]/8192_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
[BM_Strlen vs. BM_StdString_Length]/8192_mean                  -0.9970         -0.9970            90             0            90             0
[BM_Strlen vs. BM_StdString_Length]/8192_median                -0.9970         -0.9970            90             0            90             0
[BM_Strlen vs. BM_StdString_Length]/8192_stddev                -0.9911         -0.9909             1             0             1             0

I can, with a high confidence, say that std::string::length is faster than strlen.

However, something bothers me. The BM_Strlen benchmarks are all launched in a row and then the BM_StdString_Length are also launched in a row. What if something happens on my system during one of these runs only? This can screws all my results.

To convince you, here is an example where I compare the results of BM_StdString_Length with itself:

$ compare.py filters ./example BM_StdString_Length BM_StdString_Length --benchmark_display_aggregates_only=true --benchmark_repetitions=10
[...]
BM_StdString_Length/1024_mean        0.270 ns        0.270 ns           10
BM_StdString_Length/1024_median      0.267 ns        0.267 ns           10
BM_StdString_Length/1024_stddev      0.007 ns        0.007 ns           10
[...]
BM_StdString_Length/1024_mean        0.401 ns        0.400 ns           10
BM_StdString_Length/1024_median      0.385 ns        0.384 ns           10
BM_StdString_Length/1024_stddev      0.072 ns        0.070 ns           10
[...]
[BM_StdString_Length vs. BM_StdString_Length]/1024_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
[BM_StdString_Length vs. BM_StdString_Length]/1024_mean                  +0.4831         +0.4793             0             0             0             0
[BM_StdString_Length vs. BM_StdString_Length]/1024_median                +0.4384         +0.4370             0             0             0             0
[BM_StdString_Length vs. BM_StdString_Length]/1024_stddev                +8.8632         +8.6321             0             0             0             0

In this example, I have a similar p-value than in my previous benchmark. Does it mean I can reject the hypothesis than std::string::length has the same performance than std::string::length? I guess no.

Describe the solution you'd like
A way to prevent this issue is to mix the execution of the different test points: one batch of BM_Strlen, then one batch of BM_StdString_Length, then another batch of BM_Strlen and so on. That way, if something happens on the system during the execution of the benchmarks, either only one batch is screwed (in case of a small slow down), or both BM_Strlen and BM_StdString_Length will be affected and this will affect the corresponding p-value.

It could be interesting to add an option to mix the execution of the different test points.

Describe alternatives you've considered
It is possible to run the benchmarks multiple times to double-check the results are correct.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions