btf: introduce caching string table to speed up ext info loading #1809

paulcacheux · 2025-06-20T14:12:28Z

#1772 introduced lazy loading of strings from string tables. It helps a lot when querying types, but when parsing line infos, we end up querying multiple times the same offest (especially the file name), and we currently create a new string each time.

This PR deduplicates this allocation by caching the string when parsing line info.

Benchmarks:

$ go test -exec=sudo -benchmem -run=^\$ -bench=^BenchmarkELF github.com/cilium/ebpf

With the elf file present in this repo (testdata/loader-el.elf):

goos: linux
goarch: arm64
pkg: github.com/cilium/ebpf
            │ before.txt  │           after.txt           │
            │   sec/op    │   sec/op     vs base          │
ELFLoader-4   379.4µ ± 4%   377.9µ ± 2%  ~ (p=0.912 n=10)

            │  before.txt  │              after.txt              │
            │     B/op     │     B/op      vs base               │
ELFLoader-4   187.8Ki ± 0%   186.8Ki ± 0%  -0.49% (p=0.000 n=10)

            │ before.txt  │             after.txt              │
            │  allocs/op  │  allocs/op   vs base               │
ELFLoader-4   1.470k ± 0%   1.402k ± 0%  -4.63% (p=0.000 n=10)

On a 18MiB elf file:

goos: linux
goarch: arm64
pkg: github.com/cilium/ebpf
            │ before.txt  │             after.txt              │
            │   sec/op    │   sec/op     vs base               │
ELFLoader-4   83.81m ± 2%   79.17m ± 2%  -5.54% (p=0.000 n=10)

            │  before.txt  │              after.txt               │
            │     B/op     │     B/op      vs base                │
ELFLoader-4   88.81Mi ± 0%   67.67Mi ± 0%  -23.80% (p=0.000 n=10)

            │  before.txt  │              after.txt              │
            │  allocs/op   │  allocs/op   vs base                │
ELFLoader-4   1132.2k ± 0%   834.3k ± 0%  -26.31% (p=0.000 n=10)

ti-mo · 2025-06-23T10:31:30Z

Curious, couldn't package unique help out here? It already implements this kind of cache and should allow us to keep the string header on the stack (Lookup() would return a Handle[string]).

Otherwise, I'd simply put this lookup cache in stringTable itself.

dylandreimerink · 2025-06-23T10:47:40Z

I tried this locally to see how other benchmarks do as well, specifically those that use a lot of BTF strings. The InspektorGadget benchmark for example shows a less favorable result:

pkg: github.com/cilium/ebpf/btf
                   │ before.txt  │             cache.txt              │
                   │   sec/op    │   sec/op     vs base               │
InspektorGadget-20   11.02m ± 3%   11.75m ± 4%  +6.60% (p=0.000 n=10)

                   │  before.txt  │              cache.txt               │
                   │     B/op     │     B/op      vs base                │
InspektorGadget-20   7.169Mi ± 0%   7.976Mi ± 0%  +11.26% (p=0.000 n=10)

                   │ before.txt  │              cache.txt              │
                   │  allocs/op  │  allocs/op   vs base                │
InspektorGadget-20   24.30k ± 0%   20.24k ± 0%  -16.73% (p=0.000 n=10)

We do decrease allocations per-op, but also increase CPU time and memory usage per op.

I think I have a good alternative. The issue we are trying to solve is the copying that happens when we go from the []byte of the table to a string, in your case by memoization. But I think we can actually eliminate it completely by using unsafe.String. When you do so, no copy of the underlaying byte slice happens, the returned string simply points to the string table. This is only valid if the byte slice is immutable, which is the case for these string tables.

The patch is simple:

diff --git a/btf/strings.go b/btf/strings.go
index f4a3957..264a202 100644
--- a/btf/strings.go
+++ b/btf/strings.go
@@ -7,6 +7,7 @@ import (
        "io"
        "maps"
        "strings"
+       "unsafe"
 )
 
 type stringTable struct {
@@ -55,7 +56,7 @@ func (st *stringTable) Lookup(offset uint32) (string, error) {
        }
 
        b, err := st.lookupSlow(offset)
-       return string(b), err
+       return unsafe.String(&b[0], len(b)), err
 }
 
 func (st *stringTable) LookupBytes(offset uint32) ([]byte, error) {

Comparing all three results:

benchstat before.txt cache.txt unsafe.txt
goos: linux
goarch: amd64
pkg: github.com/cilium/ebpf
cpu: 13th Gen Intel(R) Core(TM) i7-13800H
             │ before.txt  │           cache.txt           │             unsafe.txt             │
             │   sec/op    │   sec/op     vs base          │   sec/op     vs base               │
ELFLoader-20   206.2µ ± 7%   219.2µ ± 6%  ~ (p=0.075 n=10)   199.6µ ± 4%  -3.19% (p=0.015 n=10)

             │  before.txt  │              cache.txt              │             unsafe.txt              │
             │     B/op     │     B/op      vs base               │     B/op      vs base               │
ELFLoader-20   187.9Ki ± 0%   192.0Ki ± 0%  +2.16% (p=0.000 n=10)   183.5Ki ± 0%  -2.35% (p=0.000 n=10)

             │ before.txt  │             cache.txt              │             unsafe.txt              │
             │  allocs/op  │  allocs/op   vs base               │  allocs/op   vs base                │
ELFLoader-20   1.471k ± 0%   1.359k ± 0%  -7.61% (p=0.000 n=10)   1.259k ± 0%  -14.41% (p=0.000 n=10)

pkg: github.com/cilium/ebpf/btf
                   │ before.txt  │             cache.txt              │          unsafe.txt           │
                   │   sec/op    │   sec/op     vs base               │   sec/op     vs base          │
InspektorGadget-20   11.02m ± 3%   11.75m ± 4%  +6.60% (p=0.000 n=10)   11.06m ± 3%  ~ (p=0.971 n=10)

                   │  before.txt  │              cache.txt               │             unsafe.txt              │
                   │     B/op     │     B/op      vs base                │     B/op      vs base               │
InspektorGadget-20   7.169Mi ± 0%   7.976Mi ± 0%  +11.26% (p=0.000 n=10)   6.985Mi ± 0%  -2.57% (p=0.000 n=10)

                   │  before.txt  │              cache.txt               │             unsafe.txt              │
                   │  allocs/op   │  allocs/op    vs base                │  allocs/op   vs base                │
InspektorGadget-20   24.305k ± 0%   20.238k ± 0%  -16.73% (p=0.000 n=10)   9.134k ± 0%  -62.42% (p=0.000 n=10)

Using the unsafe string instead we get more memory savings and no penalty to CPU usage

paulcacheux · 2025-06-23T18:25:03Z

I would be fine with the unsafe solution if you are confident it's safe

lmb · 2025-06-24T08:21:03Z

Wouldn't that mean that the backing slice for the string table is kept live, preventing garbage collection?

Dylan: can you tell why the inspektor gadget benchmark slows down? Seems counterintuitive to me.

dylandreimerink · 2025-06-24T11:06:41Z

Wouldn't that mean that the backing slice for the string table is kept live, preventing garbage collection?

Yes. Though it only matters if these strings are the only reference holding onto that string table. Question is, is it worth it.

Dylan: can you tell why the inspektor gadget benchmark slows down? Seems counterintuitive to me.

pprof without cache

pprof with cache

The main caller for stringTable.Lookup seems to be inflateType during the inspektor gadget benchmark. If a type is inflated we will use it instead of re-getting the string from the string table.

So it seems in this case we get a lot of strings that are only looked up once, thus we pay for the map accesses and insertions, but the cache hit rate is to low to give a CPU benefit.

paulcacheux · 2025-06-24T11:09:18Z

I'm not sure I understand, my PR only touches the code that read ext info/line info. How can this be visible in the inflateType code path ?

dylandreimerink · 2025-06-24T11:12:12Z

I'm not sure I understand, my PR only touches the code that read ext info/line info. How can this be visible in the inflateType code path ?

Ah, I think I may have misrepresented your changes. During my testing I had inline the caching into the string table directly as suggested by Timo.

That is not fair, only caching certain access patterns might indeed make a difference here. Let me re-run with you exact patch set.

Sorry, that is on me.

dylandreimerink · 2025-06-24T11:20:30Z

Re-ran the benchmark with your exact branch. With those change the inspektor gadget benchmark is totally unaffected, which makes sense in hind sight. Sorry for the confusion and derailment.

goos: linux
goarch: amd64
pkg: github.com/cilium/ebpf/btf
cpu: 13th Gen Intel(R) Core(TM) i7-13800H
                   │ before.txt  │           cache.txt           │
                   │   sec/op    │   sec/op     vs base          │
InspektorGadget-20   11.16m ± 4%   11.48m ± 4%  ~ (p=0.247 n=10)

                   │  before.txt  │           cache.txt            │
                   │     B/op     │     B/op      vs base          │
InspektorGadget-20   7.169Mi ± 0%   7.169Mi ± 0%  ~ (p=0.871 n=10)

                   │ before.txt  │            cache.txt            │
                   │  allocs/op  │  allocs/op   vs base            │
InspektorGadget-20   24.30k ± 0%   24.30k ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

paulcacheux · 2025-06-24T11:31:04Z

No worries at all, happy that you could confirm it. So yeah the goal with caching only ext infos access is to keep the great win from #1772 that significantly speed up vmlinux parsing.

Looking at the code:

using unsafe would mean keeping a reference to the full string table directly in the line info structure, probably not what we want (as @lmb clearly said)
using unique would require a change to Line, but since none of the fields are exported today we could do that.

One thing to note with unique, is that before go1.25 doing unique.Make(string(someByteArray)) actually allocates a copy to create the string (as opposed to doing someMap[string(someByteArray)]).

This is important because if you were to add something like

func (st *stringTable) LookupUnique(offset uint32) (unique.Handle[string], error) {
    b, err := st.LookupBytes(offset) 
    if err != nil {
        return unique.Make(""), err
    }
    return unique.Make(string(b)), nil
}

this would actually allocates a string for each call anyway today..

lmb · 2025-06-24T11:32:33Z

I'd rather not do the unsafe hackery if it's not necessary.

Couple of observations:

The ext info parser still builds large intermediate slices of btfLineInfo, etc. before transforming them into slices of the actual types we want. Removing this would give another big boost.
You could drive down the cost of memoization by only doing it for strings which we know are highly likely to be repeated: section names and file names. Individual lines will probably only ever exist once in a given file. (Consider introducing LookupCached instead of shadowing Lookup).

(2) seems like an easy win, it'd be nice if you could give that a try. (1) is probably a bit more work which you could tackle separately if you are interested.

P.S. Not sure unique is worth it vs. just keeping a separate map for the duration of the parse. Seems simpler / faster.

paulcacheux · 2025-06-24T11:33:46Z

Definitely happy to do 2, this was actually something I thought about. I can do 1 in a later PR (at a later time) if it's ok for you

dylandreimerink · 2025-06-24T11:34:28Z

I'd rather not do the unsafe hackery if it's not necessary.

The whole suggestion of doing unsafe.String stemmed from my misunderstanding of what we were doing here. Given that the patch as it stand improves specific cases and does not deteriorate others I suggest we go ahead with the change as is.

dylandreimerink · 2025-06-24T11:36:57Z

Not sure unique is worth it vs. just keeping a separate map for the duration of the parse. Seems simpler / faster.

The unique package hashes the whole string, but since we already have a uint32 that identifies the string doing a custom cache such as this is much better.

when parsing line infos, we end up querying multiple times the same offest (especially the file name), and we currently create a new string each time. This commit deduplicates this allocation by caching the string. Signed-off-by: Paul Cacheux <paul.cacheux@datadoghq.com>

paulcacheux · 2025-06-24T12:32:17Z

Alright, so I simplified the PR as suggested to move the caching in the stringTable directly, only used if you use LookupCache.
My testing shows that it only really makes sense for filename (obviously, multiple lines per file), but also the line content itself. The section name when parsed in the context of ext infos is only parsed once (one btf_ext_info_sec per section).

stringTable has to be safe for concurrent use since it is shared between multiple decoders. Document this and add a lock to LookupCached to make it safe. Signed-off-by: Lorenz Bauer <lmb@isovalent.com>

lmb

Made a small change because stringTable is meant to be concurrency safe.

paulcacheux · 2025-06-24T13:12:08Z

Ah yes I clearly missed that, thanks !

lmb · 2025-06-24T13:21:48Z

Thanks!

paulcacheux marked this pull request as ready for review June 20, 2025 17:16

paulcacheux requested a review from dylandreimerink as a code owner June 20, 2025 17:16

dylandreimerink approved these changes Jun 24, 2025

View reviewed changes

paulcacheux force-pushed the paulcacheux/cache-string-table branch 2 times, most recently from ffe102b to fb7ed7c Compare June 24, 2025 12:28

paulcacheux force-pushed the paulcacheux/cache-string-table branch from fb7ed7c to 928be4b Compare June 24, 2025 12:30

btf: make stringTable.LookupCached concurrency safe

e5dc4e3

stringTable has to be safe for concurrent use since it is shared between multiple decoders. Document this and add a lock to LookupCached to make it safe. Signed-off-by: Lorenz Bauer <lmb@isovalent.com>

lmb approved these changes Jun 24, 2025

View reviewed changes

lmb merged commit 3de0bd4 into cilium:main Jun 24, 2025
18 checks passed

btf: introduce caching string table to speed up ext info loading #1809

btf: introduce caching string table to speed up ext info loading #1809

Uh oh!

Conversation

paulcacheux commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ti-mo commented Jun 23, 2025

Uh oh!

dylandreimerink commented Jun 23, 2025

Uh oh!

paulcacheux commented Jun 23, 2025

Uh oh!

lmb commented Jun 24, 2025

Uh oh!

dylandreimerink commented Jun 24, 2025

Uh oh!

paulcacheux commented Jun 24, 2025

Uh oh!

dylandreimerink commented Jun 24, 2025

Uh oh!

dylandreimerink commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paulcacheux commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lmb commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paulcacheux commented Jun 24, 2025

Uh oh!

dylandreimerink commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dylandreimerink commented Jun 24, 2025

Uh oh!

paulcacheux commented Jun 24, 2025

Uh oh!

lmb left a comment

Choose a reason for hiding this comment

Uh oh!

paulcacheux commented Jun 24, 2025

Uh oh!

Uh oh!

lmb commented Jun 24, 2025

Uh oh!

Uh oh!

paulcacheux commented Jun 20, 2025 •

edited

Loading

dylandreimerink commented Jun 24, 2025 •

edited

Loading

paulcacheux commented Jun 24, 2025 •

edited

Loading

lmb commented Jun 24, 2025 •

edited

Loading

dylandreimerink commented Jun 24, 2025 •

edited

Loading