We should create and maintain a suite of benchmarks for evaluating Wasmtime and Cranelift. The benchmark programs should be representative of real world programs, and they should be diverse such that they collectively represent many different kinds of programs. Finally, we must ensure that our analyses are statistically sound and that our experimental design avoids measurement bias.
This benchmark suite has two primary goals:
-
Automatically detecting unintended performance regressions
-
Evaluating the speedup of proposed performance optimizations and overhead of security mitigations
Without dilligent monitoring, unintended regressions will sneak past us. Manual review is not sufficient, as the success of continuous integration tests shows us. While our testing and fuzzing infrastructure will generally catch correctness regressions, we have nothing in place for catching performance regressions.
The benchmark suite should be run once per day (or as often as is feasible, ideally it would run for every commit) and report any regressions in execution time or memory overhead.
The decision of whether to include a proposed optimization must balance maintainability, compile times, and runtime improvement. Maintainability is subjective, but our new benchmark should provide quantitative data for the latter two: how does the new optimization affect compile times, and what kind of speed ups can we expect? Does it shrink emitted code size or impose memory overhead? The answers to these questions will help us evaluate a proposed performance optimization.
Similarly, these same types of questions appear when evaluating a security mitigation. How much overhead does, for example, emitting retpolines add to WebAssembly execution time?
Experiments like these are invaluable for engineers and researchers working with Wasmtime and Cranelift.
It is also worth mentioning this explicit nongoal: we do not intend to develop a general-purpose WebAssembly benchmark suite, used to compare between different WebAssembly compilers and runtimes. We don't intend to trigger a WebAssembly benchmarking war, reminiscent of JavaScript benchmarking wars in Web browsers. Doing so would make the benchmark suite's design high stakes, because engineers would be incentivized to game the benchmarks, and would additionally impose cross-engine portability constraints on the benchmark runner. We only intend to compare the performance of various versions of Wasmtime and Cranelift, where we don't need the cross-engine portability in the benchmark runner, and where gaming the benchmarks isn't incentivized.
Furthermore, general-purpose WebAssembly benchmarking must include WebAssembly on the Web. Doing that well requires including interactions with the rest of the Web browser: JavaScript, rendering, and the DOM. Building and integrating a full Web browser is overkill for our purposes, and represents significant additional complexity that we would prefer to avoid.
We should design a benchmark suite that contains realistic, representative programs and diverse programs that collectively represent many different kinds of program properties and behavior. Running the full benchmark suite should take a reasonable amount of time. We should take precautions to avoid measurement bias from uncontrolled variables and other common pitfalls, and the analysis of the benchmark results should be statistically sound. Finally, the benchmark suite should have good developer experience: it should be easy to set up and run on your own machine and reproduce someone else's results.
The result for the Wasmtime and Cranelift community, ultimately, is that we will have a single, shared, canonical, high-quality benchmark suite. A benchmark suite where we all have confidence that performance improvements to or regressions on it are meaningful. We won't be concerned that poor statistical analysis cloud our results, and we won't dismiss results as "not meaningful" because the benchmark programs aren't representative.
Making this a reality requires roughly four things:
-
Consensus on what operations to measure and which metrics to record.
-
A corpus of candidate benchmark programs to choose from.
-
A benchmark runner and runtime that helps us avoid measurement bias.
-
Sound statistical analysis of the results.
-
Good developer ergonomics locally and remotely.
The life cycle of a WebAssembly module consists of three main operations, and our benchmark runner should measure all of them:
- Compilation
- Instantiation
- Execution
For each of these operations, there are many different metrics we can record:
- Wall time
- Instructions retired
- Code size
- Cycles
- Cache misses
- Branch misses
- Max RSS
- TLB misses
- Page faults
- Etc...
The benchmark runner should be capable of measuring all of these metrics, so that we can conduct targeted experiments, but by default we should focus on just the three most important metrics:
-
Wall time: The actual time it takes for the operation to complete. This is the thing we ultimately care about most of the time, so we should measure it.
-
Instructions retired: Unfortunately, wall time is noisy because so many different variables contribute to it. Instructions retired correlates with wall time but, by contrast, has significantly less variation across runs. This means we can often get a more precise idea of an optimizations effect on wall time from fewer samples. This isn't a silver bullet, which is why we measure instructions retired in addition to, rather than instead of, wall time: the correlation can break down in the face of parallelism or instructions are potentially more or less expensive than others (e.g. a memory store versus an add).
-
Max RSS: The maximum resident set size (amount of virtual memory paged into physical memory) for the operation. This gives us coarse-grained insight into the operation's memory overhead.
We will automatically report regressions only for these default metrics, and optimization proposals should always report them. Even if an optimization is specifically targeting some other metric, it should still report the default set, and additionally report that other metric. Focusing on just these default metrics avoids information overload and simplifies interpreting results for engineers. It also reduces the noise of potentially having too many automatic regression notifications often for "secondary" metrics that aren't significant in isolation. When engineers generally perceive these warnings as false alarms then they will stop investigating them, and real, meaningful regressions will sneak past. Maintaining a healthy culture of performance requires avoiding this phenomenon!
We intend to collect a corpus of candidate programs for inclusion in the benchmark suite, quantify these candidates with various static code metrics and dynamic execution metrics, and finally use principal component analysis (PCA) to choose a diverse subset of candidates that are collectively representative of the whole corpus. The use of PCA to select representative and diverse benchmark programs is an established technique, used by both the DaCapo1 and Renaissance2 benchmark suites. It has also been used to show that the SPEC CPU benchmarks includes duplicate workloads whose removal would make SPEC CPU smaller and take less time to run.8
The full set of candidates, the metrics instrumentation code, and the PCA scripts should all be saved in the benchmarks repository. This way we can periodically add a new batch of candidates, recharacterize our corpus, and choose a new subset to act as our benchmark suite. We may want to do this, for example, once interface types are supported in toolchains and Wasmtime, so we can keep an eye on our interface types performance. The full set of candidates could also be useful for training PGO builds of Wasmtime and as a corpus from which we harvest left-hand sides for offline superoptimization.
-
Candidates should be real, widely used programs, or at least extracted kernels of such programs. These programs are ideally taken from domains where Wasmtime and Cranelift are currently used, or domains where they are intended to be a good fit (e.g. serverless compute, game plugins, client Web applications, server Web applications, audio plugins, etc.).
-
A candidate program must be deterministic (modulo Wasm nondeterminism like
memory.grow
failure). -
A candidate program must have two associated input workloads: one small and one large. The small workload may be used by developers locally to get quick, ballpark numbers for whether further investment in an optimization is worth it, without waiting for the full, thorough benchmark suite to complete.
-
Each workload must have an expected result, so that we can validate executions and avoid accepting "fast" but incorrect results.
-
Compiling and instantiating the candidate program and then executing its workload should take roughly one to six seconds total.
Napkin math: We want the full benchmark to run in a reasonable amount of time, say twenty to thirty minutes, and we want somewhere around ten to twenty programs altogether in the benchmark suite to balance diversity, simplicity, and time spent in execution versus compilation and instantiation. Additionally, for good statistical analyses, we need at least 30 samples (ideally more like 100) from each benchmark program. That leaves an average of about one to six seconds for each benchmark program to compile, instantiate, and execute the workload.
-
Inputs should be given through I/O and results reported through I/O. This ensures that the compiler cannot optimize the benchmark program away.
-
Candidate programs should only import WASI functions. They should not depend on any other non-standard imports, hooks, or runtime environment.
-
Candidate programs must be open source under a license that allows redistributing, modifying and redistributing modified versions. This makes distributing the benchmark easy, allows us to rebuild Wasm binaries as new versions are released, and lets us do source-level analysis of benchmark programs when necessary.
-
Repeated executions of a candidate program must yield independent samples (ignoring priming Wasmtime's code cache). If the execution times keep taking longer and longer, or exhibit harmonics, they are not independent and this can invalidate any statistical analyses of the results we perform. We can easily check for this property with either the chi-squared test or Fisher's exact test.
-
The corpus of candidates should include programs that use a variety of languages, compilers, and toolchains.
Choosing diverse and representative programs with PCA requires that we have metrics by which we characterize our candidates. We propose two categories of characterization metrics:
-
Wasm-level properties (like Wasm instruction mix)
-
Execution-level properties recorded via performance counters
We should write a tool to characterize Wasm programs both by how many of each different kind of Wasm instruction statically appears in the program, and by how many are dynamically executed. Measuring these as absolute values would make larger, longer-executing programs look very different from smaller, shorter-executing programs that effectively do the same thing. As a trivial example, a program with all of its loop bodies unrolled N times will look roughly N times more interesting than the original program without loop unrolling. Therefore we should normalize these static and dynamic counts. The static counts should be normalized by the total number of Wasm instructions in the program, and the dynamic counts normalized by the total number of Wasm instructions executed.
Proposed buckets of Wasm instructions:
- Integer instructions (e.g.
i32.add
,i64.popcnt
) - Float instructions (e.g.
f64.sqrt
,f32.mul
) - Control instructions (e.g.
br
,loop
) call
instructionscall_indirect
instructions- Memory load instructions
- Memory store instructions
memory.grow
instructions- Bulk memory instructions (e.g.
memory.fill
,memory.copy
) - Table instructions (e.g.
table.get
,table.set
,table.grow
) - Miscellaneous instructions
This tool should additionally record the Wasm program's initial memory size, its static maximum memory size, and its dynamic maximum memory size on its given workload.
The second set of metrics we should record are hardware and software performance
counters (these can be recorded via perf
, or equivalent):
- Instructions retired
- L1 cache misses (icache and dcache)
- LLC load and store misses
- Branch misses
- TLB misses
- Page faults
- Ratio of time spent in the kernel vs. user space
Depending on the counter, these can be noisy across runs, so we should take the mean of multiple samples. Additionally, these counters should all be normalized by the number of cycles it took to execute the program (other than the ratio of time spent in the kernel vs. user space).
Here is an initial brainstorm of potential candidates benchmark programs:
- The wasmboy gameboy emulator
- The
rustfmt
Rust formatter - A markdown-to-HTML translator
- Resizing images
- Transcoding video
bzip2
and/orbrotli
compression- Some of the libraries that Firefox is sandboxing with Lucet and Cranelift
- A series of queries against a SQLite database
- The Twiggy size profiler for WebAssembly
- The
wasm-smith
WebAssembly test case generator - The
wasmparser
WebAssembly parser - The
wat
WebAssembly assembler - The
wasmprinter
WebAssembly disassembler - The Clang C compiler
- The
gcc-loops-wasm
benchmark program from JetStream 2 (adopted, in turn, from LLVM's test suite) - A WebAssembly interpreter
- Something to do with YoWASP
- Something to do with OpenCV
- Something to do with cryptography
- Some kind of audio filter, processor, decoder, encoder, or synthesizer
- Something to do with regular expressions
- Something to do with protobufs
- Something to do with JSON
- Something to do with XML
This list is in no way complete and we also might not use any of these! Its sole purpose is only to help kickstart the process of coming up with candidates.
NOTE: We should not block this RFC on coming up with a complete list. Finding candidate programs can happen after coming to consensus that we should maintain a benchmark suite, agree on a candidate selection methodology, and merge this RFC.
Seemingly innocuous changes — the order in which object files are passed to the linker, the username of the current OS user, the contents of unix environment variables that aren't ever read by the program — can result in significant performance differences.3 Often these differences come down to the process's address space layout and its affects on CPU caches and branch predictors. It might seem like a source change sped up a process, but the actual cause might be that two branch instructions no longer live at addresses that share an entry in the CPU's branch target buffer. Our "speed up" is, in other words, accidental and nothing is stopping us from losing it in the next commit nor is anything guaranteeing that it is visible on another machine!
An effective approach to mitigating these effects, introduced by the Stabilizer5 tool, is to repeatedly randomize code and heap layout while the program is running. This way we avoid getting a single "lucky" or "unlucky" layout and its associated measurement bias. Stabilizer, unfortunately, requires both a Clang compiler plugin and a runtime, and hasn't been updated for the latest LLVM releases. We can, however, take inspiration from it.
The following measures are not perfect, for example we are missing a way to randomize static data, and there are almost assuredly "unknown unknowns" that aren't addressed. Nonetheless, if we implement all of these mitigations, I think we can have a relatively high degree of confidence in our measurements.
We should write a "shuffling allocator" wrapper that gives (most) heap
allocations' addresses a uniform distribution. This helps us avoid measurement
bias related to the accidental interaction between allocation order and
caches. The Stabilizer paper shows that this can be done by maintaining a 256
element buffer for each size class. When an allocation is requested make the
underlying malloc
call, choose a random integer 0 <= i < 256
, swap
buffer[i]
with the result of the malloc
, and return the previous
buffer[i]
. Freeing a heap allocation involves choosing a random integer 0 <= i < 256
, swapping buffer[i]
with the pointer we were given to free, and then
freeing the previous buffer[i]
. Implementing this is straightforward and I
have a prototype already.
For Wasmtime's dynamically mapped JIT code, we should investigate adding a Cargo feature that randomizes which pages we map for JIT code as well as the order of JIT'd function bodies placed within those pages.
We should investigate adding a Cargo feature to Cranelift that adds a random amount of extra, unused space to each function's stack frame. This padding would always be a multiple of 16 bytes to keep the stack aligned, and would be at most 4096 bytes in size. This helps us avoid measurement bias related to incidental stack layout.
The benchmark runner should define new random environment variables for each benchmark program execution. This helps us avoid measurement bias related to incidental address space layout effects from environment variables.
We should investigate building a "shuffling linker". Similar to how the
shuffling allocator adds randomness by interposing between malloc
calls and
the underlying allocator, this would add randomness to the order in which object
files are passed to the linker by interposing between rustc
and ld
. Our
shuffling linker will look at the CLI arguments passed to it from rustc
,
shuffle the order of object files, shuffle object files within .rlib
s (which
are .a
files that also contain a metadata file), and finally invoke the
underlying linker. Before running benchmarks, we will create a few different
wasmtime
binaries that we choose from when sampling different benchmark
executions, each with a different, randomized object file link order. This will
help us avoid measurement bias related to getting "lucky" or "unlucky" with a
single code layout due to the order of object files passed to the linker.
Note that this approach is more coarsely grained than what Stabilizer achieves. Stabilizer randomized code layout at the function granularity, while this approach works at the object file granularity.
If we are running benchmarks with two different versions of Wasmtime, version A and version B, we should not take all samples from version A and then all samples from version B. We should, instead, intersperse executions of version A and version B. While we should always ensure that the CPU frequency scaling governor is set to "performance" when running benchmarks, this additionally helps us avoid some measurement bias from CPU state transitions that aren't constrained within the duration of process execution, like dynamic CPU throttling due to overheating.
-
We should set the
LD_BIND_NOW
environment variable. -
We should disable hyperthreading and NUMA.
The worst kind of analysis we can do is something like compare averages with and without our proposed optimization, and use that to conclude whether the optimization "works". Or, equivalently, eyeball two bars in a graph and decide which one looks better. We don't know whether the difference is meaningful or just noise.
A better approach is to use a test of statistical significance to answer the question as to whether the difference is meaningful or not. For comparing two means we would use Student's t-test and for comparing three or more means we would use analysis of variance (ANOVA). These tests allow us to justify statements like "we are 95% confident that change X produces a meaningful difference in execution time".
We aren't, however, solely interested in whether execution time is different or not. We want to know how much of a speed up an optimization yields! A significance test, while an improvement over eyeballing results, is still "asking the wrong question". This leads us to the best approach for analyzing performance results: effect size confidence intervals. This analysis justifies statements like "we are 95% confident that change X produces a 5.5% (+/- 0.8%) speed up compared to the original system". See the "Challenge of Summarizing Results" and "Measuring Speedup" sections of Rigorous Benchmarking in Reasonable Time4 for further details.
Our benchmark results analysis code should report effect size along with its confidence interval. The primary data visualization we produce should be a bar graph showing speed up or slowdown normalized to the control, with the confidence intervals overlaid. The R language has an off-the-shelf package for working with effect size confidence intervals, but we can also always implement this analysis ourselves should we choose another statistics and plotting language or framework.
For optimal developer ergonomics, it should be possible to run the benchmark suite remotely by leaving a special comment on a github pull request:
@bytecodealliance-bot bench
This should trigger a github action that runs the benchmark suite on the main
branch and the pull request's branch and leaves a comment in the pull request
with the results. This will require dedicated hardware. We will also likely want
to restrict this capability only to users on an explicit allow list.
For greater control over knobs, number of samples taken, which cargo features
are enabled, etc... developers may want to run the benchmark suite
locally. Doing so should be as easy as cloning the repository and executing
cargo run
:
$ git clone https://github.com/bytecodealliance/benchmarks
$ cd benchmarks/
$ cargo run
The developer could also pass extra CLI arguments to cargo run
to specify, for
example, the path to a local Wasmtime checkout and a branch name, that only the
small benchmark workloads should be used, or that only a particular benchmark
should be run.
Finally, the benchmark runner should remain usable even in situations where the user's OS doesn't expose APIs for reading hardware performance counters (i.e. macOS). It shouldn't be all or nothing; the runner should still be able to report wall times at minimum.
What is described above is a significant amount of work. Like all such engineering efforts, we should gain utility along the way, and shouldn't have to wait until everything is 100% complete before we receive any return on our engineering investment. Below is a dependency graph depicting chunks of work and milestones. Milestones are shown in rectangular boxes, while work chunks are shown in ellipses. An edge from A to B means that A blocks B.
Here is a short description of each milestone:
-
MVP: In order to reach an MVP where we are able to get some kind of return on our engineering investments, we'll need an initial benchmark runner, initial set of candidate programs, and finally a simple analysis that can tell us whether a change is statistically significant or not. This is enough to be useful for some ad-hoc local experiments. The MVP does not mitigate measurement bias, it does not have a representative and diverse corpus, and it does not integrate with GitHub Actions or automatically detect regressions on the
main
branch.Note that the MVP is the only time where we need to implement multiple chunks of work before getting any kind of all that we require to start getting any returns on our engineering investment. Every additional chunk of work after the MVP will immediately start providing benefits as it is implemented, and the other milestones only exist to organize and categorize efforts.
-
Measurement Bias Mitigated: This group of work covers everything required to implement our measurement bias mitigations. Once completed, we can be confident that the source changes are the cause of any speed up or slow down, and not any incidental code, data, or heap layout changes.
-
Excellent Developer Experience: This group of work covers everything we need for an excellent developer experience and integrated workflow. When completed we will automatically be notified of any accidental performance regressions, and we will be able to test specific PRs if we suspect they might have an impact on performance.
-
Representative and Diverse Corpus: This group of work covers everything required to build a representative set of candidate benchmark programs to choose from and to select a subset of them for inclusion in the benchmark suite. We would choose the subset such that it doesn't contain duplicate workloads but still covers the range of workloads represented in the full set of candidates. Upon completion we can be sure that our benchmarks make efficient use of experiment time, and that speed ups to our benchmark suite translate into speed ups in real world applications.
-
Statistically Sound and Rigorous Analysis: The MVP analysis will only do a simple test of significance for the difference in performance with and without a given change. This will tell us whether any difference really exists or is just noise, but it won't tell us what we really want to know: how much faster or slower did it get?! Implementing an effect size confidence interval analysis will give us exactly that information.
-
Benchmark Suite Complete: All done! Once completed, all of our benchmarking suite goals will have been met. We can safely kick our feet back and relax until we want to update the benchmark candidate pool to include programs that use features like interface types.
While doing the MVP first is really the only hard requirement to order of milestones (and work items within milestones) I recommend focusing efforts in this order: MVP first, excellent developer experience second so that developers get "hooked", followed next by representative and diverse corpus and statistically sound and rigorous analysis in any order, and finally finishing with measurement bias mitigations.
There are a handful existing WebAssembly benchmark suites:
These benchmark suites focus on the Web and, even when they don't require a full Web browser, they assume a JavaScript host. They assume they can import Web or JavaScript functions and their benchmark runners are written in JavaScript. This is not suitable for a standalone WebAssembly runtime like Wasmtime, and depending on these extra moving parts would harm developer ergonomics.
The Sightglass benchmark suite has been used to benchmark Lucet's (and, by extension, Cranelift's) generated code. It uses the C implementations of the Computer Language Benchmarks Game (formerly the Great Programming Language Shootout) compiled to WebAssembly and then compiled to native code with Lucet. These programs are admittedly toy programs and do not cover the variety of real world programs we intend to support. It is an open question whether we can or should evolve Sightglass's runner for this new benchmark suite.
Finally, it is not clear that any of these suites are representative and diverse since, to the best of my knowledge, none used a statistical approach in their benchmark program selection methodology.
Why not contribute to the standard webassembly/benchmarks
suite?
While the webassembly/benchmarks
suite currently assumes a Web environment, or at least a JavaScript host, we
could conceivably send pull requests upstream to make some portion of the
benchmarks run in standalone WASI environments. But this is additional work on
our part that ultimately does not bring additional benefit. Finally, in the
roughly year and a half since the project started, only a single benchmark
program has been added to the repository.
Aside: I'd like to thank Ben Titzer for collecting and sharing the references pointed to in
webassembly/benchmarks
. They have been super inspiring and helpful!
SPEC CPU is a commonly used benchmark suite of C, C++, and Fortran programs. It is praised for having large, real world programs. However, it also has a few downsides:
-
It is not open source, and distributing the suite is legally questionable. A license for SPEC CPU costs $1000.
-
It would likely require additional porting efforts to target WebAssembly and WASI.
-
It takes quite a long time to run the full suite, and research has shown that it effectively contains duplicate work loads, meaning that the extra run time isn't time well spent.8
Because wall time is what we ultimately care about. If we only measured instructions retired, we wouldn't be able to measure potential speed ups from, for example, adding parallelism.
The intention is for Lucet to merge into Wasmtime, so we won't need to benchmark Lucet at all eventually.
-
Are there additional metrics we should consider either when characterizing candidates when selecting which programs we want to include in the suite or when measuring a benchmark program's performance?
-
What other candidate benchmark programs should we consider? (Once again: we shouldn't block merging this RFC on answering this question, but we should agree on selection methodology.)
-
Are there additional requirements that candidate benchmark programs should meet to be considered for inclusion in the benchmark suite? Anything missing, overlooked, or left implicit?
-
What should the interface between a candidate benchmark program and the benchmark runner be? This opens up a lot of inter-related questions. Should we just measure
wasmtime benchmark.wasm
at a process level? If so, how do we separate compilation, instantiation, and execution? Do we want to allow benchmark programs to set up and tear down state between samples, so that an OpenCV candidate, for example, could exclude the I/O required to read its model from the measurement? This would imply that the benchmark program either doesn't export amain
function and instead exportssetup
,tear_down
, andrun
functions, or that it imports"bench" "start"
and"bench" "end"
functions. If we do this, do we want to allow multiple samples from the same process? (Rigorous Benchmarking in Reasonable Time4 can help us decide.) -
Are there additional points we've missed where we should add randomization to mitigate another source of potential measurement bias?
-
Can, and should, we evolve Sightglass's benchmark runner into what we need for this new benchmark suite?
-
Should we replace wall time with cycles? This way our measurements should be less affected by CPU frequency scaling. On the other hand, wall time is what we ultimately care about.
-
Should the benchmark runner disable hyper threading? This will read to less noisy results, but also a potentially less realistic benchmarking environment.
-
Cranelift already has some timing infrastructure to measure which passes compile time is spent in. Can we enable this infrastructure when benchmarking compile times and integrate its results into our analyses?
-
We already noted under the "Developer Ergonomics" section that the benchmark runner should remain usable even in environments where it can't record all the same metrics it might be able to on a different OS or architecture. But which platforms will have first class automatic regression testing and how much work will we do to add support for measuring a particular metric across platforms? One potential (and likely?) answer is that we will only support reading hardware performance counters on x86-64 Linux and will only run automatic performance regression testing on x86-64 Linux. All other platform and architecture combinations will support measuring wall time for developers to run local experiments, but won't have automatic regression testing.
-
Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century by Blackburn et al.
Describes the motivation, design, and benchmark program selection methodology of the DaCapo benchmark suite for the JVM. They recorded metrics for each candidate benchmark program and then used principal component analysis to choose a statistically representative and diverse subset for inclusion in the benchmark suite.
-
Renaissance: Benchmarking Suite for Parallel Applications on the JVM by Prokopec et al.
Describes the motivation, design, and benchmark program selection methodology of the Renaissance benchmark suite for the JVM. Designed to fill gaps that were found missing in DaCapo roughly a decade later. They also used principal component analysis, but chose more modern characterization metrics with an eye towards parallelism and concurrency.
-
Producing Wrong Data Without Doing Anything Obviously Wrong! by Mytkowicz et al.
Shows that seemingly innocuous, uncontrolled variables (e.g. order of object files passed to the link, starting address of the stack, size of unix environment variables) can produce large amounts of measurement bias, invalidating experiments.
-
Rigorous Benchmarking in Reasonable Time by Kalibera et al.
Presents a method to design statistically rigorous experiments that use experimental time efficiently. Provides answers to questions like how many iterations should I do within a process execution? How many process executions should I do with the same binary? How many times should I recompile the binary with a different code layout?
-
Stabilizer: Statistically Sound Performance Evaluation by Curtsinger et al.
Presents a runtime and LLVM compiler plugin for avoiding measurement bias from fixed address space layouts by randomizing code layout, static data placement, and heap allocations.
-
SPEC CPU v8 Benchmark Search Program by Standard Performance Evaluation Corporation.
The detailed, step-by-step SPEC CPU benchmark program submission process, as well as an enumeration of the properties they are searching for in a candidate, and potential application areas they are interested in.
-
Virtual Machine Warmup Blows Hot and Cold by Barrett et al.
Questions the common belief that programs exhibit a slower warm up phase, where the JIT observes dynamic behavior or shared libraries are loaded and initialized, followed by a faster steady state phase. It turns out that many programs never reach a steady state! This leads to samples that are not statistically independent, invalidating analyses.
-
A Workload Characterization of the SPEC CPU2017 Benchmark Suite by Limaye et al.
Characterizes the instruction mix, exectuoin performance, branch, and cache behaviors of the SPEC CPU2017 benchmark suite. Finds that many workloads are effectively identical, and suggests ways for researchers to use PCA to choose a representative subset of the suite.