Skip to content

btf: introduce caching string table to speed up ext info loading #1809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 24, 2025

Conversation

paulcacheux
Copy link
Contributor

@paulcacheux paulcacheux commented Jun 20, 2025

#1772 introduced lazy loading of strings from string tables. It helps a lot when querying types, but when parsing line infos, we end up querying multiple times the same offest (especially the file name), and we currently create a new string each time.

This PR deduplicates this allocation by caching the string when parsing line info.

Benchmarks:

$ go test -exec=sudo -benchmem -run=^\$ -bench=^BenchmarkELF github.com/cilium/ebpf

With the elf file present in this repo (testdata/loader-el.elf):

goos: linux
goarch: arm64
pkg: github.com/cilium/ebpf
            │ before.txt  │           after.txt           │
            │   sec/op    │   sec/op     vs base          │
ELFLoader-4   379.4µ ± 4%   377.9µ ± 2%  ~ (p=0.912 n=10)

            │  before.txt  │              after.txt              │
            │     B/op     │     B/op      vs base               │
ELFLoader-4   187.8Ki ± 0%   186.8Ki ± 0%  -0.49% (p=0.000 n=10)

            │ before.txt  │             after.txt              │
            │  allocs/op  │  allocs/op   vs base               │
ELFLoader-4   1.470k ± 0%   1.402k ± 0%  -4.63% (p=0.000 n=10)

On a 18MiB elf file:

goos: linux
goarch: arm64
pkg: github.com/cilium/ebpf
            │ before.txt  │             after.txt              │
            │   sec/op    │   sec/op     vs base               │
ELFLoader-4   83.81m ± 2%   79.17m ± 2%  -5.54% (p=0.000 n=10)

            │  before.txt  │              after.txt               │
            │     B/op     │     B/op      vs base                │
ELFLoader-4   88.81Mi ± 0%   67.67Mi ± 0%  -23.80% (p=0.000 n=10)

            │  before.txt  │              after.txt              │
            │  allocs/op   │  allocs/op   vs base                │
ELFLoader-4   1132.2k ± 0%   834.3k ± 0%  -26.31% (p=0.000 n=10)

@paulcacheux paulcacheux marked this pull request as ready for review June 20, 2025 17:16
@ti-mo
Copy link
Collaborator

ti-mo commented Jun 23, 2025

Curious, couldn't package unique help out here? It already implements this kind of cache and should allow us to keep the string header on the stack (Lookup() would return a Handle[string]).

Otherwise, I'd simply put this lookup cache in stringTable itself.

@dylandreimerink
Copy link
Member

I tried this locally to see how other benchmarks do as well, specifically those that use a lot of BTF strings. The InspektorGadget benchmark for example shows a less favorable result:

pkg: github.com/cilium/ebpf/btf
                   │ before.txt  │             cache.txt              │
                   │   sec/op    │   sec/op     vs base               │
InspektorGadget-20   11.02m ± 3%   11.75m ± 4%  +6.60% (p=0.000 n=10)

                   │  before.txt  │              cache.txt               │
                   │     B/op     │     B/op      vs base                │
InspektorGadget-20   7.169Mi ± 0%   7.976Mi ± 0%  +11.26% (p=0.000 n=10)

                   │ before.txt  │              cache.txt              │
                   │  allocs/op  │  allocs/op   vs base                │
InspektorGadget-20   24.30k ± 0%   20.24k ± 0%  -16.73% (p=0.000 n=10)

We do decrease allocations per-op, but also increase CPU time and memory usage per op.

I think I have a good alternative. The issue we are trying to solve is the copying that happens when we go from the []byte of the table to a string, in your case by memoization. But I think we can actually eliminate it completely by using unsafe.String. When you do so, no copy of the underlaying byte slice happens, the returned string simply points to the string table. This is only valid if the byte slice is immutable, which is the case for these string tables.

The patch is simple:

diff --git a/btf/strings.go b/btf/strings.go
index f4a3957..264a202 100644
--- a/btf/strings.go
+++ b/btf/strings.go
@@ -7,6 +7,7 @@ import (
        "io"
        "maps"
        "strings"
+       "unsafe"
 )
 
 type stringTable struct {
@@ -55,7 +56,7 @@ func (st *stringTable) Lookup(offset uint32) (string, error) {
        }
 
        b, err := st.lookupSlow(offset)
-       return string(b), err
+       return unsafe.String(&b[0], len(b)), err
 }
 
 func (st *stringTable) LookupBytes(offset uint32) ([]byte, error) {

Comparing all three results:

benchstat before.txt cache.txt unsafe.txt
goos: linux
goarch: amd64
pkg: github.com/cilium/ebpf
cpu: 13th Gen Intel(R) Core(TM) i7-13800H
             │ before.txt  │           cache.txt           │             unsafe.txt             │
             │   sec/op    │   sec/op     vs base          │   sec/op     vs base               │
ELFLoader-20   206.2µ ± 7%   219.2µ ± 6%  ~ (p=0.075 n=10)   199.6µ ± 4%  -3.19% (p=0.015 n=10)

             │  before.txt  │              cache.txt              │             unsafe.txt              │
             │     B/op     │     B/op      vs base               │     B/op      vs base               │
ELFLoader-20   187.9Ki ± 0%   192.0Ki ± 0%  +2.16% (p=0.000 n=10)   183.5Ki ± 0%  -2.35% (p=0.000 n=10)

             │ before.txt  │             cache.txt              │             unsafe.txt              │
             │  allocs/op  │  allocs/op   vs base               │  allocs/op   vs base                │
ELFLoader-20   1.471k ± 0%   1.359k ± 0%  -7.61% (p=0.000 n=10)   1.259k ± 0%  -14.41% (p=0.000 n=10)

pkg: github.com/cilium/ebpf/btf
                   │ before.txt  │             cache.txt              │          unsafe.txt           │
                   │   sec/op    │   sec/op     vs base               │   sec/op     vs base          │
InspektorGadget-20   11.02m ± 3%   11.75m ± 4%  +6.60% (p=0.000 n=10)   11.06m ± 3%  ~ (p=0.971 n=10)

                   │  before.txt  │              cache.txt               │             unsafe.txt              │
                   │     B/op     │     B/op      vs base                │     B/op      vs base               │
InspektorGadget-20   7.169Mi ± 0%   7.976Mi ± 0%  +11.26% (p=0.000 n=10)   6.985Mi ± 0%  -2.57% (p=0.000 n=10)

                   │  before.txt  │              cache.txt               │             unsafe.txt              │
                   │  allocs/op   │  allocs/op    vs base                │  allocs/op   vs base                │
InspektorGadget-20   24.305k ± 0%   20.238k ± 0%  -16.73% (p=0.000 n=10)   9.134k ± 0%  -62.42% (p=0.000 n=10)

Using the unsafe string instead we get more memory savings and no penalty to CPU usage

@paulcacheux
Copy link
Contributor Author

I would be fine with the unsafe solution if you are confident it's safe

@lmb
Copy link
Collaborator

lmb commented Jun 24, 2025

Wouldn't that mean that the backing slice for the string table is kept live, preventing garbage collection?

Dylan: can you tell why the inspektor gadget benchmark slows down? Seems counterintuitive to me.

@dylandreimerink
Copy link
Member

Wouldn't that mean that the backing slice for the string table is kept live, preventing garbage collection?

Yes. Though it only matters if these strings are the only reference holding onto that string table. Question is, is it worth it.

Dylan: can you tell why the inspektor gadget benchmark slows down? Seems counterintuitive to me.

pprof without cache
Screenshot from 2025-06-24 12-57-42

pprof with cache
Screenshot from 2025-06-24 12-57-36

The main caller for stringTable.Lookup seems to be inflateType during the inspektor gadget benchmark. If a type is inflated we will use it instead of re-getting the string from the string table.

So it seems in this case we get a lot of strings that are only looked up once, thus we pay for the map accesses and insertions, but the cache hit rate is to low to give a CPU benefit.

@paulcacheux
Copy link
Contributor Author

I'm not sure I understand, my PR only touches the code that read ext info/line info. How can this be visible in the inflateType code path ?

@dylandreimerink
Copy link
Member

I'm not sure I understand, my PR only touches the code that read ext info/line info. How can this be visible in the inflateType code path ?

Ah, I think I may have misrepresented your changes. During my testing I had inline the caching into the string table directly as suggested by Timo.

That is not fair, only caching certain access patterns might indeed make a difference here. Let me re-run with you exact patch set.

Sorry, that is on me.

@dylandreimerink
Copy link
Member

dylandreimerink commented Jun 24, 2025

Re-ran the benchmark with your exact branch. With those change the inspektor gadget benchmark is totally unaffected, which makes sense in hind sight. Sorry for the confusion and derailment.

goos: linux
goarch: amd64
pkg: github.com/cilium/ebpf/btf
cpu: 13th Gen Intel(R) Core(TM) i7-13800H
                   │ before.txt  │           cache.txt           │
                   │   sec/op    │   sec/op     vs base          │
InspektorGadget-20   11.16m ± 4%   11.48m ± 4%  ~ (p=0.247 n=10)

                   │  before.txt  │           cache.txt            │
                   │     B/op     │     B/op      vs base          │
InspektorGadget-20   7.169Mi ± 0%   7.169Mi ± 0%  ~ (p=0.871 n=10)

                   │ before.txt  │            cache.txt            │
                   │  allocs/op  │  allocs/op   vs base            │
InspektorGadget-20   24.30k ± 0%   24.30k ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

@paulcacheux
Copy link
Contributor Author

paulcacheux commented Jun 24, 2025

No worries at all, happy that you could confirm it. So yeah the goal with caching only ext infos access is to keep the great win from #1772 that significantly speed up vmlinux parsing.

Looking at the code:

  • using unsafe would mean keeping a reference to the full string table directly in the line info structure, probably not what we want (as @lmb clearly said)
  • using unique would require a change to Line, but since none of the fields are exported today we could do that.

One thing to note with unique, is that before go1.25 doing unique.Make(string(someByteArray)) actually allocates a copy to create the string (as opposed to doing someMap[string(someByteArray)]).

This is important because if you were to add something like

func (st *stringTable) LookupUnique(offset uint32) (unique.Handle[string], error) {
    b, err := st.LookupBytes(offset) 
    if err != nil {
        return unique.Make(""), err
    }
    return unique.Make(string(b)), nil
}

this would actually allocates a string for each call anyway today..

@lmb
Copy link
Collaborator

lmb commented Jun 24, 2025

I'd rather not do the unsafe hackery if it's not necessary.

Couple of observations:

  1. The ext info parser still builds large intermediate slices of btfLineInfo, etc. before transforming them into slices of the actual types we want. Removing this would give another big boost.
  2. You could drive down the cost of memoization by only doing it for strings which we know are highly likely to be repeated: section names and file names. Individual lines will probably only ever exist once in a given file. (Consider introducing LookupCached instead of shadowing Lookup).

(2) seems like an easy win, it'd be nice if you could give that a try. (1) is probably a bit more work which you could tackle separately if you are interested.

P.S. Not sure unique is worth it vs. just keeping a separate map for the duration of the parse. Seems simpler / faster.

@paulcacheux
Copy link
Contributor Author

Definitely happy to do 2, this was actually something I thought about. I can do 1 in a later PR (at a later time) if it's ok for you

@dylandreimerink
Copy link
Member

dylandreimerink commented Jun 24, 2025

I'd rather not do the unsafe hackery if it's not necessary.

The whole suggestion of doing unsafe.String stemmed from my misunderstanding of what we were doing here. Given that the patch as it stand improves specific cases and does not deteriorate others I suggest we go ahead with the change as is.

@dylandreimerink
Copy link
Member

Not sure unique is worth it vs. just keeping a separate map for the duration of the parse. Seems simpler / faster.

The unique package hashes the whole string, but since we already have a uint32 that identifies the string doing a custom cache such as this is much better.

@paulcacheux paulcacheux force-pushed the paulcacheux/cache-string-table branch 2 times, most recently from ffe102b to fb7ed7c Compare June 24, 2025 12:28
when parsing line infos, we end up querying multiple times the same
offest (especially the file name), and we currently create a new string
each time. This commit deduplicates this allocation by caching the
string.

Signed-off-by: Paul Cacheux <paul.cacheux@datadoghq.com>
@paulcacheux paulcacheux force-pushed the paulcacheux/cache-string-table branch from fb7ed7c to 928be4b Compare June 24, 2025 12:30
@paulcacheux
Copy link
Contributor Author

Alright, so I simplified the PR as suggested to move the caching in the stringTable directly, only used if you use LookupCache.
My testing shows that it only really makes sense for filename (obviously, multiple lines per file), but also the line content itself. The section name when parsed in the context of ext infos is only parsed once (one btf_ext_info_sec per section).

stringTable has to be safe for concurrent use since it is shared
between multiple decoders. Document this and add a lock to LookupCached
to make it safe.

Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Copy link
Collaborator

@lmb lmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a small change because stringTable is meant to be concurrency safe.

@paulcacheux
Copy link
Contributor Author

Ah yes I clearly missed that, thanks !

@lmb lmb merged commit 3de0bd4 into cilium:main Jun 24, 2025
18 checks passed
@lmb
Copy link
Collaborator

lmb commented Jun 24, 2025

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants