Skip to content

Commit 6ed8084

Browse files
committed
try footnote style
1 parent 40bf3be commit 6ed8084

File tree

1 file changed

+21
-21
lines changed

1 file changed

+21
-21
lines changed

apr19.md

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ performance, as well as introduce additional non-determinism into measurements d
1616
to parallelism.
1717

1818
This document describes what we did to build continuous benchmarking
19-
websites[^1] that take a controlled experiment approach to running the
19+
websites<sup>[1][1]</sup> that take a controlled experiment approach to running the
2020
operf-micro and sandmark benchmarking suites against tracked git branches of
2121
the OCaml compiler.
2222

@@ -43,7 +43,7 @@ future work that could follow.
4343

4444
### Survey of OCaml benchmarks
4545

46-
#### Operf-micro[^2]
46+
#### Operf-micro<sup>[2][2]</sup>
4747

4848
Operf-micro collates a collection of micro benchmarks originally put together
4949
to help with the development of flambda. This tool compiles a micro-benchmark
@@ -58,9 +58,9 @@ iteration of the function.
5858
The tool is designed to have minimal external dependencies and can be run
5959
against a test compiler binary without additional packages. The experimental
6060
method of running for multiple embedded iterations is also used by Jane
61-
Street’s `core_bench`[^3] and Haskell’s criterion[^4].
61+
Street’s `core_bench`<sup>[3] and Haskell’s criterion[^4][3] and Haskell’s criterion[^4]</sup>.
6262

63-
#### operf-macro[^5] and sandmark[^6]
63+
#### operf-macro<sup>[5] and sandmark[^6][5] and sandmark[^6]</sup>
6464

6565
Operf-macro provides a framework to define and run macro-benchmarks. The
6666
benchmarks themselves are OPAM packages and the compiler versions are OPAM
@@ -95,9 +95,9 @@ The intention is to expand the set of multicore specific tests to include larger
9595

9696
#### Python (CPython and PyPy)
9797

98-
The Python community has continuous benchmarking both for their CPython[^7] and
99-
PyPy[^8] runtimes. The benchmarking data is collected by running the Python
100-
Performance Benchmark Suite[^9]. The open-source web application Codespeed[^10]
98+
The Python community has continuous benchmarking both for their CPython<sup>[7][7]</sup> and
99+
PyPy<sup>[8][8]</sup> runtimes. The benchmarking data is collected by running the Python
100+
Performance Benchmark Suite<sup>[9]. The open-source web application Codespeed[^10][9]. The open-source web application Codespeed[^10]</sup>
101101
provides a front end to navigate and visualize the results. Codespeed is
102102
written in Python on Django and provides views into the results via a revision
103103
table, a timeline by benchmark and comparison over all benchmarks between
@@ -110,7 +110,7 @@ LLVM has a collection of micro-benchmarks in their C/C++ compiler test suite.
110110
These micro-benchmarks are built on the google-benchmark library and produce
111111
statistics that can be easily fed downstream. They also support external tests
112112
(for example SPEC CPU 2006). LLVM have performance tracking software called LNT
113-
which drives a continuous monitoring site[^11]. While the LNT software is
113+
which drives a continuous monitoring site<sup>[11][11]</sup>. While the LNT software is
114114
packaged for people to use in other projects, we could not find another project
115115
using it for visualizing performance data and at first glance did not look easy
116116
to reuse.
@@ -119,14 +119,14 @@ to reuse.
119119

120120
The Haskell community have performance regression tests that have hardcoded
121121
values which trip a continuous-integration failure. This method has proved
122-
painful for them[^12] and they have been looking to change it to a more data
123-
driven approach[^13]. At this time they did not seem to have infrastructure
122+
painful for them<sup>[12][12]</sup> and they have been looking to change it to a more data
123+
driven approach<sup>[13][13]</sup>. At this time they did not seem to have infrastructure
124124
running to help them.
125125

126126
#### Rust
127127

128128
Rust have built their own tools to collect benchmark data and present it in a
129-
web app[^14]. This tool has some interesting features:
129+
web app<sup>[14][14]</sup>. This tool has some interesting features:
130130

131131
* It measures both compile time and runtime performance.
132132
* They are committing the data from their experiments into a github repo to
@@ -147,7 +147,7 @@ OCaml Labs put together an initial multicore benchmarking site
147147
[http://ocamllabs.io/multicore](http://ocamllabs.io/multicore). This built on
148148
the OCamlPro flambda site by (i) implementing visualization in a single
149149
javascript library; (ii) upgrading to OPAM v2; (iii) updating the
150-
macro-benchmarks to more recent version.[^15] The work presented here
150+
macro-benchmarks to more recent version.<sup>[15][15]</sup> The work presented here
151151
incorporates the experience of building that site and builds on it.
152152

153153
### What we put together
@@ -167,15 +167,15 @@ data.
167167

168168
Mapping a git branch to a linear timeline requires a bit of care. We use the
169169
following mapping, we take a branch tag and ask for commits using the
170-
first-parent option[^16]. We then use the commit date from this as the
170+
first-parent option<sup>[16][16]</sup>. We then use the commit date from this as the
171171
timestamp for a commit. In most repositories this will give a linear code order
172172
that makes sense, even though the development process has been decentralised.
173173
It works well for Github style pull-requests development.
174174

175175
#### Experimental setup
176176

177177
We wanted to try to remove sources of noise in the performance measurements
178-
where possible. We configured our x86 Linux machines as follows:[^17]
178+
where possible. We configured our x86 Linux machines as follows:<sup>[17][17]</sup>
179179

180180
* **Hyperthreading**: Hyperthreading was configured off in the machine BIOS
181181
to avoid cross-talk and resource sharing for a CPU.
@@ -256,7 +256,7 @@ give the same result on a single run. User binaries in the wild will likely run
256256
with ASLR on, but we have configured our environment to make benchmarks
257257
repeatable without the need to take the average of many sample. It remains an
258258
open area to benchmark in reasonable time and allow easy developer interaction
259-
while investigating performance under the presence of ASLR[^18].
259+
while investigating performance under the presence of ASLR<sup>[18][18]</sup>.
260260

261261
#### Code layout can matter in surprising ways
262262

@@ -273,13 +273,13 @@ benchmarking suite on the hardware we used, we didn’t see any performance
273273
instability coming from CPU interactions with the instruction or data cache or
274274
main memory. The instruction path memory bottleneck may still be worth
275275
optimizing for and we expect that changes which do optimize this area will be
276-
seen in the sandmark macro-benchmarks[^19]. There are a collection of
276+
seen in the sandmark macro-benchmarks<sup>[19][19]</sup>. There are a collection of
277277
approaches being used to improve code layout for instruction memory latency
278-
using profile layout[^20].
278+
using profile layout<sup>[20][20]</sup>.
279279

280280
We were surprised by the size of the performance impact that code layout can
281281
have due to how instructions pass through the front end of an x86 processor on
282-
their way to being issued[^21]. On x86 alignment of blocks of instructions and
282+
their way to being issued<sup>[21][21]</sup>. On x86 alignment of blocks of instructions and
283283
how decoded uops are packed into decode stream buffers (DSB) can impact the
284284
performance of smaller hot loops in code. Examples of some of the mechanical
285285
effects are:
@@ -298,13 +298,13 @@ there is a good LLVM dev talk given by Zia Ansari (Intel)
298298

299299
We came across some of these hot-code placement effects in our OCaml
300300
micro-benchmarks when diving into performance swings that we didn’t understand
301-
by looking at the commit diff. We have an example that can be downloaded[^22]
301+
by looking at the commit diff. We have an example that can be downloaded<sup>[22][22]</sup>
302302
which alters the alignment of a hot loop in an OCaml microbenchmark (without
303303
changing the individual instructions in the hot loop) that lead to a ~15% range
304304
in observed performance on our Xeon E5-2430L machine. From a compiler writer’s
305305
perspective this can be a difficult area, we don’t have any simple solutions
306306
and other compiler communities also struggle with how to tame these
307-
effects[^23].
307+
effects<sup>[23][23]</sup>.
308308

309309
It is important to realise that a code layout change can lead to changes in the
310310
performance of hot loops. The best way to proceed when you see an unexpected
@@ -377,7 +377,7 @@ would be a little work to present visualizations of the data.
377377

378378
LLVM and GCC have a collection of options available to alter the x86 code
379379
alignment. Some of these get enabled at higher optimization levels and other
380-
options help with debugging.[^24] For example, it might be useful to be able to
380+
options help with debugging.<sup>[24][24]</sup> For example, it might be useful to be able to
381381
align all jump targets; code size would increase and performance decrease, but
382382
it might allow you to isolate a performance swing to being due to
383383
microstructure alignment when doing a deep dive.

0 commit comments

Comments
 (0)