Skip to content

Commit

Permalink
Merge pull request #489 from aalexand/fast-83
Browse files Browse the repository at this point in the history
Publish abseil.io/fast/{62,72,79,83}; other minor fixes.
  • Loading branch information
manshreck authored Sep 12, 2024
2 parents 73fa238 + 1f2d4da commit d01ab40
Show file tree
Hide file tree
Showing 10 changed files with 1,233 additions and 47 deletions.
6 changes: 3 additions & 3 deletions _posts/2023-03-02-fast-39.md
Original file line number Diff line number Diff line change
Expand Up @@ -159,9 +159,9 @@ There are a number of things that commonly go wrong when writing benchmarks. The
following is a non-exhaustive list:

* Data being resident. Workloads have large footprints, a small footprint may
be instruction bound, whereas the true workload could be memory bound.
There's a trade-off between adding instructions to save some memory costs vs
placing data in memory to save instructions.
be instruction bound, whereas the true workload could be
[memory bound](/fast/62). There's a trade-off between adding instructions to
save some memory costs vs placing data in memory to save instructions.
* Small instruction cache footprint. Google codes typically have large
instruction footprints. Benchmarks are often cache resident. The `memcmp`
and TCMalloc examples go directly to this.
Expand Down
51 changes: 26 additions & 25 deletions _posts/2023-09-14-fast-7.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,29 +25,30 @@ in TCMalloc, put protocol buffers into other protocol buffers, or to handle
branch mispredictions by our processors.

To make our fleet more efficient, we want to optimize for how productive our
servers are, that is, how much useful work they accomplish per CPU-second, byte
of RAM, disk IOPS, or by using hardware accelerators. While measuring a job's
resource consumption is easy, it's harder to tell just how much useful work it's
accomplishing without help.

A task's CPU usage going up could mean it suffered a performance regression or
that it's simply busier. Consider a plot of a service's CPU usage against time,
breaking down the total CPU usage of two versions of the binary. We cannot
determine from casual inspection what caused the increase in CPU usage, whether
this is from an increase in workload (serving more videos per unit time) or a
decrease in efficiency (some added, needless protocol conversion per video).

To determine what is really happening we need a productivity metric which
servers are, that is, how much useful work they accomplish per CPU-second,
byte-second of RAM, disk operation, or by using hardware accelerators. While
measuring a job's resource consumption is easy, it's harder to tell just how
much useful work it's accomplishing without help.

A task's CPU usage going up could mean the task has suffered a performance
regression or that it's simply busier. Consider a plot of a service's CPU usage
against time, breaking down the total CPU usage of two versions of the binary.
We cannot determine from casual inspection what caused the increase in CPU
usage, whether this is from an increase in workload (serving more videos per
unit time) or a decrease in efficiency (some added, needless protocol conversion
per video).

To determine what is really happening, we need a productivity metric which
captures the amount of real work completed. If we know the number of cat videos
processed we can easily determine whether we are getting more, or less, real
work done per CPU-second (or byte of RAM, disk operation, or hardware
processed, we can easily determine whether we are getting more, or less, real
work done per CPU-second (or byte-second of RAM, disk operation, or hardware
accelerator time). These metrics are referred to as *application productivity
metrics*, or *APMs*.

If we do not have productivity metrics, we are faced with *entire classes of
optimizations* that are not well-represented by existing metrics:

* **Application speedups through core library changes**:
* **Application speedups through core infrastructure changes**:

As seen in our [2021 OSDI paper](https://research.google/pubs/pub50370/),
"one classical approach is to increase the efficiency of an allocator to
Expand Down Expand Up @@ -79,21 +80,21 @@ optimizations* that are not well-represented by existing metrics:
In future hardware generations, we expect to replace calls to memcpy with
microcode-optimized `rep movsb` instructions that are faster than any
handwritten assembly sequence we can come up with. We expect `rep movsb` to
have low IPC: It's a single instruction that replaces an entire copy loop of
instructions!
have low IPC (instructions per cycle): It's a single instruction that
replaces an entire copy loop of instructions!

Using these new instructions can be triggered by optimizing the source code
or through compiler enhancements that improve vectorization.

Focusing on MIPS or IPC would cause us to prefer any implementation that
executes a large number of instructions, even if those instructions take
longer to execute to copy `n` bytes.
Focusing on MIPS (millions of instructions per second) or IPC would cause us
to prefer any implementation that executes a large number of instructions,
even if those instructions take longer to execute to copy `n` bytes.

In fact, enabling the AVX, FMA, and BMI instruction sets by compiling with
`--march=haswell` shows a MIPS regression while simultaneously improving
*application productivity improvement*. These instructions can do more work
per instruction, however, replacing several low latency instructions may
mean that *average* instruction latency increases. If we had 10 million
`--march=haswell` shows a MIPS regression while simultaneously *improving
application productivity*. These instructions can do more work per
instruction, however, replacing several low latency instructions may mean
that *average* instruction latency increases. If we had 10 million
instructions and 10 ms per query, we may now have 8 million instructions
taking only 9 ms per query. QPS is up and MIPS would go down.

Expand Down
2 changes: 1 addition & 1 deletion _posts/2023-09-30-fast-52.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@ code base, keeping that choice optimal over time.

For some uses, this strategy is infeasible. `my::super_fast_string` will
probably never replace `std::string` because the latter is so entrenched and the
impedence mismatch of living in an independent string ecosystem exceeds the
impedance mismatch of living in an independent string ecosystem exceeds the
benefits. Multiple
[vocabulary types](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2125r0.pdf)
suffer from impedance mismatch--costly interconversions can overwhelm the
Expand Down
12 changes: 6 additions & 6 deletions _posts/2023-10-20-fast-70.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,12 +105,12 @@ developing optimizations. For example,
[limitations](/fast/39), but as long as we're mindful of those pitfalls,
they can get us directional information much more quickly.
* PMU counters can tell us rich details about [bottlenecks in code](/fast/53)
such as cache misses or branch mispredictions. Seeing changes in these
metrics can be a *proxy* that helps us understand the effect. For example,
inserting software prefetches can reduce cache miss events, but in a memory
bandwidth-bound program, the prefetches can go no faster than the "speed of
light" of the memory bus. Similarly, eliminating a stall far off the
critical path might have little bearing on the application's actual
such as [cache misses](/fast/62) or branch mispredictions. Seeing changes in
these metrics can be a *proxy* that helps us understand the effect. For
example, inserting software prefetches can reduce cache miss events, but in
a memory bandwidth-bound program, the prefetches can go no faster than the
"speed of light" of the memory bus. Similarly, eliminating a stall far off
the critical path might have little bearing on the application's actual
performance.

If we expect to improve an application's performance, we might start by taking a
Expand Down
9 changes: 5 additions & 4 deletions _posts/2023-11-10-fast-74.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,11 +63,12 @@ trying to reduce easily understood costs would have led to a worse outcome.
### Artificial costs in TCMalloc

Starting from 2016, work commenced to reduce TCMalloc's cost. Much of this early
work involved making things generally faster, by removing instructions, avoiding
cache misses, and shortening lock critical sections.
work involved making things generally faster, by removing instructions,
[avoiding cache misses](/fast/62), and shortening lock critical sections.

During this process, a prefetch was added on its fast path. GWP even indicates
that 70%+ of cycles in the `malloc` fastpath are
During this process, a prefetch was added on its fast path. Our
[fleet-wide profiling](https://research.google/pubs/google-wide-profiling-a-continuous-profiling-infrastructure-for-data-centers/)
even indicates that 70%+ of cycles in the `malloc` fastpath are
[spent on that prefetch](/fast/39)! Guided by the costs we could easily
understand, we might be tempted to remove it. TCMalloc's fast path would appear
cheaper, but other code somewhere else would experience a cache miss and
Expand Down
16 changes: 8 additions & 8 deletions _posts/2023-11-10-fast-75.md
Original file line number Diff line number Diff line change
Expand Up @@ -227,11 +227,11 @@ when the processor's execution
## Understanding the speed of light

Before embarking too far on optimizing the `ParseVarint32` routine, we might
want to identify the "speed of light" of the hardware. For varint parsing, this
is *approximately* `memcpy`, since we are reading serialized bytes and writing
the (mostly expanded) bytes into the parsed data structure. While this is not
quite the operation we're interested in, it's readily available off the shelf
without much effort.
want to identify the ["speed of light"](/fast/72) of the hardware. For varint
parsing, this is *approximately* `memcpy`, since we are reading serialized bytes
and writing the (mostly expanded) bytes into the parsed data structure. While
this is not quite the operation we're interested in, it's readily available off
the shelf without much effort.

<pre class="prettyprint code">
void BM_Memcpy(benchmark::State& state) {
Expand Down Expand Up @@ -432,9 +432,9 @@ This pattern is also used in
example, it has a "hot" SwissMap benchmark that performs its operations
(lookups, etc.) against a single instance and a "cold" SwissMap benchmark where
we randomly pick a table on each iteration. The latter makes it more likely that
we'll incur a cache miss. Hardware counters and the benchmark framework's
[support for collecting them](/fast/53) can help diagnose and explain
performance differences.
we'll [incur a cache miss](/fast/62). Hardware counters and the benchmark
framework's [support for collecting them](/fast/53) can help diagnose and
explain performance differences.

Even though the extremes are not representative, they can help us frame how to
tackle the problem. We might find an optimization for one extreme and then work
Expand Down
Loading

0 comments on commit d01ab40

Please sign in to comment.