Skip to content

Latest commit

 

History

History
58 lines (46 loc) · 4.14 KB

README.md

File metadata and controls

58 lines (46 loc) · 4.14 KB

sightglass-recorder

Calibration

Is sightglass-recorder accurate? The mechanisms used for measuring and the API for capturing portions of a Wasm benchmark (bench.start, bench.end) introduce overhead to our benchmarking--but how much?

First, we examine the overhead of leaving the Wasm engine through the bench.start and bench.end imported functions. The noop benchmark calls these functions back-to-back. We run noop using both the cycles and perf-counters measurement types and aggregate the results:

$ cargo run -- benchmark --processes 1 --iterations-per-process 30 --engine wasmtime --output-format csv benchmarks/noop/benchmark.wasm --measure cycles | grep Execution | cargo run -- analyze --input-format csv --output-format csv
x86_64,wasmtime,benchmarks/noop/benchmark.wasm,Execution,cycles,680,1062,808.2,61.506666666666675

$ cargo run -- benchmark --processes 1 --iterations-per-process 30 --engine wasmtime --output-format csv benchmarks/noop/benchmark.wasm --measure perf-counters | grep Execution | cargo run -- analyze --input-format csv --output-format csv
x86_64,wasmtime,benchmarks/noop/benchmark.wasm,Execution,cache-accesses,16,186,45.0,17.866666666666667
x86_64,wasmtime,benchmarks/noop/benchmark.wasm,Execution,instructions-retired,470,470,470.0,0.0
x86_64,wasmtime,benchmarks/noop/benchmark.wasm,Execution,cache-misses,0,4,1.8,1.1733333333333333
x86_64,wasmtime,benchmarks/noop/benchmark.wasm,Execution,cpu-cycles,1257,3184,1614.3666666666666,248.21555555555557

Note: the CSV columns are architecture, engine, WASM benchmark, phase, event, min, max, mean, deviation.

The above data indicate that the bench.start and bench.end is executing with some overhead, somewhere between ~800±60 (measured with RDTSC) and ~1600±250 (measured with perf). The above measurements were observed on an Intel i7-7700K on a 5.8.8 Linux kernel. The overhead is most likely due to the instructions required to switch from executing native code inside the Wasmtime runtime to the measurement code imported by bench.start and bench.end PLUS any overhead that the measurement itself may have--e.g., the reason the perf results are higher is likely due to the latter.

Overestimating this overhead at 2000 CPU cycles, what effect would this have on the execution of a 1-second workload? Running the current version of shootout-ackermann only takes ~0.08s on the above system, but even so, the 2000-cycle overhead is relatively small:

$ cargo run -- benchmark --processes 1 --iterations-per-process 30 --engine wasmtime --output-format csv benchmarks/shootout-ackermann/benchmark.wasm --measure cycles | grep Execution | cargo run -- analyze -i csv -o csv
x86_64,wasmtime,benchmarks/shootout-ackermann/benchmark.wasm,Execution,cycles,5072731,6466552,5795681.366666666,463938.97333333344

$ cargo run -- benchmark --processes 1 --iterations-per-process 30 --engine wasmtime --output-format csv benchmarks/shootout-ackermann/benchmark.wasm --measure perf-counters | grep Execution | cargo run -- analyze -i csv -o csv
x86_64,wasmtime,benchmarks/shootout-ackermann/benchmark.wasm,Execution,cache-accesses,4020,16565,5107.066666666667,850.8933333333331
x86_64,wasmtime,benchmarks/shootout-ackermann/benchmark.wasm,Execution,cache-misses,73,1573,519.9666666666667,369.68888888888904
x86_64,wasmtime,benchmarks/shootout-ackermann/benchmark.wasm,Execution,cpu-cycles,5423074,9778831,5581209.3,279841.4466666665
x86_64,wasmtime,benchmarks/shootout-ackermann/benchmark.wasm,Execution,instructions-retired,17333381,17333382,17333381.2,0.31999999955296515

Using RDTSC we measure ackermann at ~5.80m CPU cycles and using perf at ~5.58m CPU cycles. With those cycle counts, our estimated 2000-cycle overhead is in the ~0.03% range. Recall that ackermann was only running for ~0.08s; if we run benchmarks for 10x that time, any measurement overhead will be indistinguishable from noise. Certainly 2000 cycles fits well underneath the mean deviation observed of ~±460,000 (RDTSC) or ~±280,000 (perf) cycles of mean deviation (though that level of deviation is slightly suspicious: has there been a context switch?).