@@ -16,7 +16,7 @@ performance, as well as introduce additional non-determinism into measurements d
16
16
to parallelism.
17
17
18
18
This document describes what we did to build continuous benchmarking
19
- websites<sup >[ 1] </sup > that take a controlled experiment approach to running the
19
+ websites<sup >[ 1] [ ^1 ] </sup > that take a controlled experiment approach to running the
20
20
operf-micro and sandmark benchmarking suites against tracked git branches of
21
21
the OCaml compiler.
22
22
@@ -43,7 +43,7 @@ future work that could follow.
43
43
44
44
### Survey of OCaml benchmarks
45
45
46
- #### Operf-micro<sup >[ 2] </sup >
46
+ #### Operf-micro<sup >[ 2] [ ^2 ] </sup >
47
47
48
48
Operf-micro collates a collection of micro benchmarks originally put together
49
49
to help with the development of flambda. This tool compiles a micro-benchmark
@@ -58,9 +58,9 @@ iteration of the function.
58
58
The tool is designed to have minimal external dependencies and can be run
59
59
against a test compiler binary without additional packages. The experimental
60
60
method of running for multiple embedded iterations is also used by Jane
61
- Street’s ` core_bench ` <sup >[ 3] </sup > and Haskell’s criterion<sup >[ 4] </sup >.
61
+ Street’s ` core_bench ` <sup >[ 3] [ ^3 ] </sup > and Haskell’s criterion<sup >[ 4 ] [ ^ 4] </sup >.
62
62
63
- #### operf-macro<sup >[ 5] </sup > and sandmark<sup >[ 6] </sup >
63
+ #### operf-macro<sup >[ 5] [ ^5 ] </sup > and sandmark<sup >[ 6 ] [ ^ 6] </sup >
64
64
65
65
Operf-macro provides a framework to define and run macro-benchmarks. The
66
66
benchmarks themselves are OPAM packages and the compiler versions are OPAM
@@ -95,9 +95,9 @@ The intention is to expand the set of multicore specific tests to include larger
95
95
96
96
#### Python (CPython and PyPy)
97
97
98
- The Python community has continuous benchmarking both for their CPython<sup >[ 7] </sup > and
99
- PyPy<sup >[ 8] </sup > runtimes. The benchmarking data is collected by running the Python
100
- Performance Benchmark Suite<sup >[ 9] </sup >. The open-source web application Codespeed<sup >[ 10] </sup >
98
+ The Python community has continuous benchmarking both for their CPython<sup >[ 7] [ ^7 ] </sup > and
99
+ PyPy<sup >[ 8] [ ^8 ] </sup > runtimes. The benchmarking data is collected by running the Python
100
+ Performance Benchmark Suite<sup >[ 9] [ ^9 ] </sup >. The open-source web application Codespeed<sup >[ 10 ] [ ^ 10] </sup >
101
101
provides a front end to navigate and visualize the results. Codespeed is
102
102
written in Python on Django and provides views into the results via a revision
103
103
table, a timeline by benchmark and comparison over all benchmarks between
@@ -110,7 +110,7 @@ LLVM has a collection of micro-benchmarks in their C/C++ compiler test suite.
110
110
These micro-benchmarks are built on the google-benchmark library and produce
111
111
statistics that can be easily fed downstream. They also support external tests
112
112
(for example SPEC CPU 2006). LLVM have performance tracking software called LNT
113
- which drives a continuous monitoring site<sup >[ 11] </sup >. While the LNT software is
113
+ which drives a continuous monitoring site<sup >[ 11] [ ^11 ] </sup >. While the LNT software is
114
114
packaged for people to use in other projects, we could not find another project
115
115
using it for visualizing performance data and at first glance did not look easy
116
116
to reuse.
@@ -119,14 +119,14 @@ to reuse.
119
119
120
120
The Haskell community have performance regression tests that have hardcoded
121
121
values which trip a continuous-integration failure. This method has proved
122
- painful for them<sup >[ 12] </sup > and they have been looking to change it to a more data
123
- driven approach<sup >[ 13] </sup >. At this time they did not seem to have infrastructure
122
+ painful for them<sup >[ 12] [ ^12 ] </sup > and they have been looking to change it to a more data
123
+ driven approach<sup >[ 13] [ ^13 ] </sup >. At this time they did not seem to have infrastructure
124
124
running to help them.
125
125
126
126
#### Rust
127
127
128
128
Rust have built their own tools to collect benchmark data and present it in a
129
- web app<sup >[ 14] </sup >. This tool has some interesting features:
129
+ web app<sup >[ 14] [ ^14 ] </sup >. This tool has some interesting features:
130
130
131
131
* It measures both compile time and runtime performance.
132
132
* They are committing the data from their experiments into a github repo to
@@ -147,7 +147,7 @@ OCaml Labs put together an initial multicore benchmarking site
147
147
[ http://ocamllabs.io/multicore ] ( http://ocamllabs.io/multicore ) . This built on
148
148
the OCamlPro flambda site by (i) implementing visualization in a single
149
149
javascript library; (ii) upgrading to OPAM v2; (iii) updating the
150
- macro-benchmarks to more recent version.<sup >[ 15] </sup > The work presented here
150
+ macro-benchmarks to more recent version.<sup >[ 15] [ ^15 ] </sup > The work presented here
151
151
incorporates the experience of building that site and builds on it.
152
152
153
153
### What we put together
@@ -167,15 +167,15 @@ data.
167
167
168
168
Mapping a git branch to a linear timeline requires a bit of care. We use the
169
169
following mapping, we take a branch tag and ask for commits using the
170
- first-parent option<sup >[ 16] </sup >. We then use the commit date from this as the
170
+ first-parent option<sup >[ 16] [ ^16 ] </sup >. We then use the commit date from this as the
171
171
timestamp for a commit. In most repositories this will give a linear code order
172
172
that makes sense, even though the development process has been decentralised.
173
173
It works well for Github style pull-requests development.
174
174
175
175
#### Experimental setup
176
176
177
177
We wanted to try to remove sources of noise in the performance measurements
178
- where possible. We configured our x86 Linux machines as follows:<sup >[ 17] </sup >
178
+ where possible. We configured our x86 Linux machines as follows:<sup >[ 17] [ ^17 ] </sup >
179
179
180
180
* ** Hyperthreading** : Hyperthreading was configured off in the machine BIOS
181
181
to avoid cross-talk and resource sharing for a CPU.
@@ -256,7 +256,7 @@ give the same result on a single run. User binaries in the wild will likely run
256
256
with ASLR on, but we have configured our environment to make benchmarks
257
257
repeatable without the need to take the average of many sample. It remains an
258
258
open area to benchmark in reasonable time and allow easy developer interaction
259
- while investigating performance under the presence of ASLR<sup >[ 18] </sup >.
259
+ while investigating performance under the presence of ASLR<sup >[ 18] [ ^18 ] </sup >.
260
260
261
261
#### Code layout can matter in surprising ways
262
262
@@ -273,13 +273,13 @@ benchmarking suite on the hardware we used, we didn’t see any performance
273
273
instability coming from CPU interactions with the instruction or data cache or
274
274
main memory. The instruction path memory bottleneck may still be worth
275
275
optimizing for and we expect that changes which do optimize this area will be
276
- seen in the sandmark macro-benchmarks<sup >[ 19] </sup >. There are a collection of
276
+ seen in the sandmark macro-benchmarks<sup >[ 19] [ ^19 ] </sup >. There are a collection of
277
277
approaches being used to improve code layout for instruction memory latency
278
- using profile layout<sup >[ 20] </sup >.
278
+ using profile layout<sup >[ 20] [ ^20 ] </sup >.
279
279
280
280
We were surprised by the size of the performance impact that code layout can
281
281
have due to how instructions pass through the front end of an x86 processor on
282
- their way to being issued<sup >[ 21] </sup >. On x86 alignment of blocks of instructions and
282
+ their way to being issued<sup >[ 21] [ ^21 ] </sup >. On x86 alignment of blocks of instructions and
283
283
how decoded uops are packed into decode stream buffers (DSB) can impact the
284
284
performance of smaller hot loops in code. Examples of some of the mechanical
285
285
effects are:
@@ -298,13 +298,13 @@ there is a good LLVM dev talk given by Zia Ansari (Intel)
298
298
299
299
We came across some of these hot-code placement effects in our OCaml
300
300
micro-benchmarks when diving into performance swings that we didn’t understand
301
- by looking at the commit diff. We have an example that can be downloaded<sup >[ 22] </sup >
301
+ by looking at the commit diff. We have an example that can be downloaded<sup >[ 22] [ ^22 ] </sup >
302
302
which alters the alignment of a hot loop in an OCaml microbenchmark (without
303
303
changing the individual instructions in the hot loop) that lead to a ~ 15% range
304
304
in observed performance on our Xeon E5-2430L machine. From a compiler writer’s
305
305
perspective this can be a difficult area, we don’t have any simple solutions
306
306
and other compiler communities also struggle with how to tame these
307
- effects<sup >[ 23] </sup >.
307
+ effects<sup >[ 23] [ ^23 ] </sup >.
308
308
309
309
It is important to realise that a code layout change can lead to changes in the
310
310
performance of hot loops. The best way to proceed when you see an unexpected
@@ -377,7 +377,7 @@ would be a little work to present visualizations of the data.
377
377
378
378
LLVM and GCC have a collection of options available to alter the x86 code
379
379
alignment. Some of these get enabled at higher optimization levels and other
380
- options help with debugging.<sup >[ 24] </sup > For example, it might be useful to be able to
380
+ options help with debugging.<sup >[ 24] [ ^24 ] </sup > For example, it might be useful to be able to
381
381
align all jump targets; code size would increase and performance decrease, but
382
382
it might allow you to isolate a performance swing to being due to
383
383
microstructure alignment when doing a deep dive.
0 commit comments