Publish abseil.io/fast/{62,72,79,83}; other minor fixes.

abseil · Sep 12, 2024 · 1f2d4da · 1f2d4da
1 parent 73fa238
commit 1f2d4da
Show file tree

Hide file tree

Showing 10 changed files with 1,233 additions and 47 deletions.
diff --git a/_posts/2023-03-02-fast-39.md b/_posts/2023-03-02-fast-39.md
@@ -159,9 +159,9 @@ There are a number of things that commonly go wrong when writing benchmarks. The
 following is a non-exhaustive list:
 
 *   Data being resident. Workloads have large footprints, a small footprint may
-    be instruction bound, whereas the true workload could be memory bound.
-    There's a trade-off between adding instructions to save some memory costs vs
-    placing data in memory to save instructions.
+    be instruction bound, whereas the true workload could be
+    [memory bound](/fast/62). There's a trade-off between adding instructions to
+    save some memory costs vs placing data in memory to save instructions.
 *   Small instruction cache footprint. Google codes typically have large
     instruction footprints. Benchmarks are often cache resident. The `memcmp`
     and TCMalloc examples go directly to this.

diff --git a/_posts/2023-09-14-fast-7.md b/_posts/2023-09-14-fast-7.md
@@ -25,29 +25,30 @@ in TCMalloc, put protocol buffers into other protocol buffers, or to handle
 branch mispredictions by our processors.
 
 To make our fleet more efficient, we want to optimize for how productive our
-servers are, that is, how much useful work they accomplish per CPU-second, byte
-of RAM, disk IOPS, or by using hardware accelerators. While measuring a job's
-resource consumption is easy, it's harder to tell just how much useful work it's
-accomplishing without help.
-
-A task's CPU usage going up could mean it suffered a performance regression or
-that it's simply busier. Consider a plot of a service's CPU usage against time,
-breaking down the total CPU usage of two versions of the binary. We cannot
-determine from casual inspection what caused the increase in CPU usage, whether
-this is from an increase in workload (serving more videos per unit time) or a
-decrease in efficiency (some added, needless protocol conversion per video).
-
-To determine what is really happening we need a productivity metric which
+servers are, that is, how much useful work they accomplish per CPU-second,
+byte-second of RAM, disk operation, or by using hardware accelerators. While
+measuring a job's resource consumption is easy, it's harder to tell just how
+much useful work it's accomplishing without help.
+
+A task's CPU usage going up could mean the task has suffered a performance
+regression or that it's simply busier. Consider a plot of a service's CPU usage
+against time, breaking down the total CPU usage of two versions of the binary.
+We cannot determine from casual inspection what caused the increase in CPU
+usage, whether this is from an increase in workload (serving more videos per
+unit time) or a decrease in efficiency (some added, needless protocol conversion
+per video).
+
+To determine what is really happening, we need a productivity metric which
 captures the amount of real work completed. If we know the number of cat videos
-processed we can easily determine whether we are getting more, or less, real
-work done per CPU-second (or byte of RAM, disk operation, or hardware
+processed, we can easily determine whether we are getting more, or less, real
+work done per CPU-second (or byte-second of RAM, disk operation, or hardware
 accelerator time). These metrics are referred to as *application productivity
 metrics*, or *APMs*.
 
 If we do not have productivity metrics, we are faced with *entire classes of
 optimizations* that are not well-represented by existing metrics:
 
-*   **Application speedups through core library changes**:
+*   **Application speedups through core infrastructure changes**:
 
     As seen in our [2021 OSDI paper](https://research.google/pubs/pub50370/),
     "one classical approach is to increase the efficiency of an allocator to
@@ -79,21 +80,21 @@ optimizations* that are not well-represented by existing metrics:
     In future hardware generations, we expect to replace calls to memcpy with
     microcode-optimized `rep movsb` instructions that are faster than any
     handwritten assembly sequence we can come up with. We expect `rep movsb` to
-    have low IPC: It's a single instruction that replaces an entire copy loop of
-    instructions!
+    have low IPC (instructions per cycle): It's a single instruction that
+    replaces an entire copy loop of instructions!
 
     Using these new instructions can be triggered by optimizing the source code
     or through compiler enhancements that improve vectorization.
 
-    Focusing on MIPS or IPC would cause us to prefer any implementation that
-    executes a large number of instructions, even if those instructions take
-    longer to execute to copy `n` bytes.
+    Focusing on MIPS (millions of instructions per second) or IPC would cause us
+    to prefer any implementation that executes a large number of instructions,
+    even if those instructions take longer to execute to copy `n` bytes.
 
     In fact, enabling the AVX, FMA, and BMI instruction sets by compiling with
-    `--march=haswell` shows a MIPS regression while simultaneously improving
-    *application productivity improvement*. These instructions can do more work
-    per instruction, however, replacing several low latency instructions may
-    mean that *average* instruction latency increases. If we had 10 million
+    `--march=haswell` shows a MIPS regression while simultaneously *improving
+    application productivity*. These instructions can do more work per
+    instruction, however, replacing several low latency instructions may mean
+    that *average* instruction latency increases. If we had 10 million
     instructions and 10 ms per query, we may now have 8 million instructions
     taking only 9 ms per query. QPS is up and MIPS would go down.
 

diff --git a/_posts/2023-09-30-fast-52.md b/_posts/2023-09-30-fast-52.md
@@ -151,7 +151,7 @@ code base, keeping that choice optimal over time.
 
 For some uses, this strategy is infeasible. `my::super_fast_string` will
 probably never replace `std::string` because the latter is so entrenched and the
-impedence mismatch of living in an independent string ecosystem exceeds the
+impedance mismatch of living in an independent string ecosystem exceeds the
 benefits. Multiple
 [vocabulary types](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2125r0.pdf)
 suffer from impedance mismatch--costly interconversions can overwhelm the

diff --git a/_posts/2023-10-20-fast-70.md b/_posts/2023-10-20-fast-70.md
@@ -105,12 +105,12 @@ developing optimizations. For example,
     [limitations](/fast/39), but as long as we're mindful of those pitfalls,
     they can get us directional information much more quickly.
 *   PMU counters can tell us rich details about [bottlenecks in code](/fast/53)
-    such as cache misses or branch mispredictions. Seeing changes in these
-    metrics can be a *proxy* that helps us understand the effect. For example,
-    inserting software prefetches can reduce cache miss events, but in a memory
-    bandwidth-bound program, the prefetches can go no faster than the "speed of
-    light" of the memory bus. Similarly, eliminating a stall far off the
-    critical path might have little bearing on the application's actual
+    such as [cache misses](/fast/62) or branch mispredictions. Seeing changes in
+    these metrics can be a *proxy* that helps us understand the effect. For
+    example, inserting software prefetches can reduce cache miss events, but in
+    a memory bandwidth-bound program, the prefetches can go no faster than the
+    "speed of light" of the memory bus. Similarly, eliminating a stall far off
+    the critical path might have little bearing on the application's actual
     performance.
 
 If we expect to improve an application's performance, we might start by taking a

diff --git a/_posts/2023-11-10-fast-74.md b/_posts/2023-11-10-fast-74.md
@@ -63,11 +63,12 @@ trying to reduce easily understood costs would have led to a worse outcome.
 ### Artificial costs in TCMalloc
 
 Starting from 2016, work commenced to reduce TCMalloc's cost. Much of this early
-work involved making things generally faster, by removing instructions, avoiding
-cache misses, and shortening lock critical sections.
+work involved making things generally faster, by removing instructions,
+[avoiding cache misses](/fast/62), and shortening lock critical sections.
 
-During this process, a prefetch was added on its fast path. GWP even indicates
-that 70%+ of cycles in the `malloc` fastpath are
+During this process, a prefetch was added on its fast path. Our
+[fleet-wide profiling](https://research.google/pubs/google-wide-profiling-a-continuous-profiling-infrastructure-for-data-centers/)
+even indicates that 70%+ of cycles in the `malloc` fastpath are
 [spent on that prefetch](/fast/39)! Guided by the costs we could easily
 understand, we might be tempted to remove it. TCMalloc's fast path would appear
 cheaper, but other code somewhere else would experience a cache miss and

diff --git a/_posts/2023-11-10-fast-75.md b/_posts/2023-11-10-fast-75.md
@@ -227,11 +227,11 @@ when the processor's execution
 ## Understanding the speed of light
 
 Before embarking too far on optimizing the `ParseVarint32` routine, we might
-want to identify the "speed of light" of the hardware. For varint parsing, this
-is *approximately* `memcpy`, since we are reading serialized bytes and writing
-the (mostly expanded) bytes into the parsed data structure. While this is not
-quite the operation we're interested in, it's readily available off the shelf
-without much effort.
+want to identify the ["speed of light"](/fast/72) of the hardware. For varint
+parsing, this is *approximately* `memcpy`, since we are reading serialized bytes
+and writing the (mostly expanded) bytes into the parsed data structure. While
+this is not quite the operation we're interested in, it's readily available off
+the shelf without much effort.
 
 <pre class="prettyprint code">
 void BM_Memcpy(benchmark::State& state) {
@@ -432,9 +432,9 @@ This pattern is also used in
 example, it has a "hot" SwissMap benchmark that performs its operations
 (lookups, etc.) against a single instance and a "cold" SwissMap benchmark where
 we randomly pick a table on each iteration. The latter makes it more likely that
-we'll incur a cache miss. Hardware counters and the benchmark framework's
-[support for collecting them](/fast/53) can help diagnose and explain
-performance differences.
+we'll [incur a cache miss](/fast/62). Hardware counters and the benchmark
+framework's [support for collecting them](/fast/53) can help diagnose and
+explain performance differences.
 
 Even though the extremes are not representative, they can help us frame how to
 tackle the problem. We might find an optimization for one extreme and then work