Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bundle: Preallocate buffers for file contents. #6818

Conversation

philipaconrad
Copy link
Contributor

@philipaconrad philipaconrad commented Jun 12, 2024

Why were the changes in this PR made?

This PR reduces the max RSS memory usage of OPA during bundle loading by about 5-6%, and greatly improves GC performance at lower values of GOMAXPROCS.

This improves OPA's performance particularly for container and low-resource use cases by reducing the peak of the spike in memory usage, and also reduces garbage collector overheads.

What are the changes in this PR?

This PR adds logic to preallocate buffers when loading files from both tarballs and on-disk bundle directories. The change results in lower max RSS memory usage at runtime, and better garbage collector performance, especially when at lower values of GOMAXPROCS.

For very large bundles (>1 GB in size), this change can lower startup times for OPA by as much as a full second.

Benchmarking Details & Instructions

Huge thanks to @charlesdaniels for help getting the Awk script working! It's very handy for the summary reports about what the GC traces mean.

Instructions and scripts are provided in the expandable Details block below.

Scripts used for generating bundle contents/collecting results

Generating worst-case, large bundles

I used the following Python script to generate arbitrarily large, but highly compressible OPA bundles.

genbig.py:

N = 100000
oneKBString = "A" * 1024

if __name__ == "__main__":
    print("{")
    for i in range(0, N):
        print('"{}": "{}",'.format(i, oneKBString))
    print('"{}": "{}"'.format(N, oneKBString))
    print("}")
  • Usage: python3 genbig.py > data.json (Adjust the N variable to dial in the size in KB you want the file to be.)

Getting pprof results quickly from an OPA server

profile.sh

#!/bin/bash

SUFFIX=$1

curl -sK -v http://localhost:8181/debug/pprof/heap > heap-${SUFFIX}.prof
curl -sK -v http://localhost:8181/debug/pprof/allocs > allocs-${SUFFIX}.prof

go tool pprof -lines -png -output heap-${SUFFIX}.png heap-${SUFFIX}.prof
go tool pprof -lines -png -output allocs-${SUFFIX}.png allocs-${SUFFIX}.prof
  • Usage: ./profile.sh pr-1gb (Generates allocs-pr-1gb.png and heap-pr-1gb.png)

I ran OPA with the pprof endpoints enabled, using the CLI commands shown earlier in the graph sections. After collecting the pprof traces and generating the graphs with profile.sh, I then switched over to garbage collector trace work.

Collecting and summarizing GC trace results

With some help from @charlesdaniels, I put together this Awk script for reporting the aggregate timing stats for a GC trace log.

summarize-gctrace.awk:

# AWK script for summarizing GODEBUG=gctrace=1 logs.
# Known to work for Go 1.22, but the log format is subject to change.
# The format is the following (pulled from the Go `runtime` package docs):
#   gc # @#s #%: #+#+# ms clock, #+#/#/#+# ms cpu, #->#-># MB, # MB goal, # MB stacks, #MB globals, # P
#                ^$5             ^$10                                                               ^$28
# Example:
#   gc 11 @3.169s 1%: 0.025+509+0.017 ms clock, 0.10+0/261/0.81+0.068 ms cpu, 3585->3585->2561 MB, 3585 MB goal, 0 MB stacks, 0 MB globals, 4 P

BEGIN {
    FS = "[\t +/]+"
}

# Some simple assertions to prevent matching non-gctrace logs.
$1 != "gc" { non_matching_rows++; next; }
$9 == "cpu" { non_matching_rows++; next; }

# The main pattern-action block:
{
    num_threads = $28

    clock_stw_sweep_term += $5
    clock_conc_mark_scan += $6
    clock_stw_mark_term += $7
    total_clock += $5 + $6 + $7

    cpu_ms_stw_sweep_term += $10
    cpu_ms_assist_gc += $11
    cpu_ms_background_gc += $12
    cpu_ms_idle_gc += $13
    cpu_ms_stw_mark_term += $14
    total_cpu += $11 + $12 + $13
}

END {
    print  "--- Wall clock time summary -------------------"
    printf "  STW sweep termination ......... %10.3f ms\n", clock_stw_sweep_term
    printf "  Concurrent mark and scan ...... %10.3f ms\n", clock_conc_mark_scan
    printf "  STW mark termination .......... %10.3f ms\n", clock_stw_mark_term
    printf "                          Total = %10.3f ms (%.3f s)\n", total_clock, total_clock / 1000
    print  ""
    print  "--- CPU time breakdown ------------------------"
    printf "  STW sweep termination ......... %10.3f ms\n", cpu_ms_stw_sweep_term
    printf "    Mark/Scan assist time ....... %10.3f ms\n", cpu_ms_assist_gc
    printf "    Mark/Scan background GC time  %10.3f ms\n", cpu_ms_background_gc
    printf "    Mark/Scan idle GC time ...... %10.3f ms\n", cpu_ms_idle_gc
    printf "  STW mark termination .......... %10.3f ms\n", cpu_ms_stw_mark_term
    printf "                          Total = %10.3f ms (%.3f s) (Does not include STW phases)\n", total_cpu, total_cpu / 1000
    print  ""
    print  "Num threads:", num_threads
    print  "Non-matching rows:", non_matching_rows
}

The script can be run in-line with OPA by tee-ing stderr over to the Awk process, and then pkill'ing OPA with a SIGINT signal. Example with the PR build of OPA:

Terminal 1:

$ GODEBUG=gctrace=1 ./opa-pr run -s -b bundle-1gb.tar.gz --pprof 2> >(tee >(awk -f summarize-gctrace.awk >&2) >&2)

Terminal 2:

# After waiting for the OPA startup message to appear in Terminal 1:
$ pkill -SIGINT opa-pr

Performance Analysis and Benchmarking Results

  • Heap usage increased by about 10% during bundle loading, which made this change look bad at first. Some of the effect appears to be from the Go compiler no longer inlining as far up the call chain during bundle loading (visible in the pprof graphs).
  • Running with GODEBUG=gctrace=1 and varying GOMAXPROCS allows seeing a fuller picture of how performance changes from preallocation. There is much less garbage for the collector, and a noticeable reduction in how much wall-clock time the garbage collector burns during bundle loading.
  • For the 1GB bundle example, I collected overall max RSS memory usage stats with /usr/bin/time (GNU time, not the shell builtin time command seen on some platforms!), and also collected the usual stats with pprof.
  • The test bundle (1 GB uncompressed, 5.3 MB gzipped): bundle-1gb.tar.gz

Results from main:

CLI command: /usr/bin/time -v ./opa-main run -s -b bundle-1gb.tar.gz --pprof

Results
Printout from time (note Maximum resident set size is ~3.8 GB)
	Command being timed: "./opa-main run -s -b bundle-1gb.tar.gz --pprof"
	User time (seconds): 7.74
	System time (seconds): 1.41
	Percent of CPU this job got: 61%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:14.79
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 3834624
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 953708
	Voluntary context switches: 3322
	Involuntary context switches: 66
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
GODEBUG=gctrace=1 logs from main

This is a representative sample of loading a 1GB bundle with GODEBUG=gctrace=1 enabled.

gc 1 @0.019s 2%: 0.14+1.3+0.045 ms clock, 1.1+0.50/1.9/1.4+0.36 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 2 @0.024s 2%: 0.028+4.3+0.039 ms clock, 0.22+0/1.6/5.8+0.31 ms cpu, 3->3->2 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 3 @0.034s 2%: 0.058+10+0.016 ms clock, 0.47+0/1.7/12+0.13 ms cpu, 8->8->7 MB, 8 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 4 @0.046s 2%: 0.026+8.1+0.023 ms clock, 0.21+0/1.0/9.3+0.18 ms cpu, 15->15->13 MB, 15 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 5 @0.056s 2%: 0.014+10+0.012 ms clock, 0.11+0/0.63/11+0.096 ms cpu, 29->29->25 MB, 29 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 6 @0.070s 1%: 0.024+22+0.041 ms clock, 0.19+0/0.98/22+0.33 ms cpu, 57->57->49 MB, 57 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 7 @0.100s 1%: 0.034+38+0.003 ms clock, 0.27+0/0.61/38+0.025 ms cpu, 113->113->97 MB, 113 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 8 @0.152s 0%: 0.024+70+0.097 ms clock, 0.19+0/0.58/71+0.78 ms cpu, 225->225->193 MB, 225 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 9 @0.250s 0%: 0.024+134+0.011 ms clock, 0.19+0/0.58/134+0.088 ms cpu, 449->449->385 MB, 449 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 10 @0.433s 0%: 0.023+268+0.053 ms clock, 0.18+0/0.54/268+0.42 ms cpu, 897->897->769 MB, 897 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 11 @0.801s 0%: 0.024+514+0.003 ms clock, 0.19+0/0.73/514+0.031 ms cpu, 1793->1793->1537 MB, 1793 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 12 @3.252s 0%: 0.028+504+0.011 ms clock, 0.22+0/252/0.87+0.090 ms cpu, 3585->3585->2561 MB, 3585 MB goal, 0 MB stacks, 0 MB globals, 8 P
{"addrs":[":8181"],"diagnostic-addrs":[],"level":"info","msg":"Initializing server. OPA is running on a public (0.0.0.0) network interface. Unless you intend to expose OPA outside of the host, binding to the localhost interface (--addr localhost:8181) is recommended. See https://www.openpolicyagent.org/docs/latest/security/#interface-binding","time":"2024-06-12T18:14:21-04:00"}

Summary stats:

--- Wall clock time summary -------------------
  STW sweep termination .........     0.447 ms
  Concurrent mark and scan ......  1583.700 ms
  STW mark termination ..........     0.354 ms
                           Total = 1584.501 ms (1.585 s)

--- CPU time breakdown ------------------------
  STW sweep termination .........     3.540 ms
  Mark/Scan assist time .......       0.500 ms
  Mark/Scan background GC time      262.850 ms
  Mark/Scan idle GC time ......    1087.370 ms
  STW mark termination ..........     2.840 ms
                           Total = 1350.720 ms (1.351 s) (Does not include STW phases)
Allocs graph from main

allocs-main-1gb-tarball

Heap graph from main

heap-main-1gb-tarball


Results from this PR:

CLI command: /usr/bin/time -v ./opa-pr run -s -b bundle-1gb.tar.gz --pprof

Results
Printout from time (note Maximum resident set size is ~3.6 GB)
	Command being timed: "./opa-pr run -s -b bundle-1gb.tar.gz --pprof"
	User time (seconds): 6.51
	System time (seconds): 1.25
	Percent of CPU this job got: 63%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:12.30
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 3636100
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 913402
	Voluntary context switches: 2421
	Involuntary context switches: 171
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
GODEBUG=gctrace=1 logs from PR

This is a representative sample of loading a 1GB bundle with GODEBUG=gctrace=1 enabled.

gc 1 @0.009s 3%: 0.023+0.67+0.18 ms clock, 0.18+0.21/1.1/0.86+1.5 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 2 @0.013s 3%: 0.021+0.46+0.015 ms clock, 0.16+0.18/0.74/1.3+0.12 ms cpu, 989->989->989 MB, 989 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 3 @1.696s 0%: 0.030+268+0.014 ms clock, 0.24+0/0.61/128+0.11 ms cpu, 2013->2013->1757 MB, 2013 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 4 @6.114s 0%: 0.030+48+0.014 ms clock, 0.24+0.10/96/167+0.11 ms cpu, 3511->3525->2715 MB, 3515 MB goal, 0 MB stacks, 0 MB globals, 8 P
{"addrs":[":8181"],"diagnostic-addrs":[],"level":"info","msg":"Initializing server. OPA is running on a public (0.0.0.0) network interface. Unless you intend to expose OPA outside of the host, binding to the localhost interface (--addr localhost:8181) is recommended. See https://www.openpolicyagent.org/docs/latest/security/#interface-binding","time":"2024-06-12T18:14:42-04:00"}

Summary stats:

--- Wall clock time summary -------------------
  STW sweep termination .........     0.104 ms
  Concurrent mark and scan ......   317.130 ms
  STW mark termination ..........     0.223 ms
                            Total = 317.457 ms (0.317 s)

--- CPU time breakdown ------------------------
  STW sweep termination .........     0.820 ms
  Mark/Scan assist time .......       0.490 ms
  Mark/Scan background GC time       98.450 ms
  Mark/Scan idle GC time ......     297.160 ms
  STW mark termination ..........     1.840 ms
                            Total = 396.100 ms (0.396 s) (Does not include STW phases)
Allocs graph from PR

allocs-pr-1gb-tarball

Heap graph from PR

heap-pr-1gb-tarball

@philipaconrad philipaconrad added optimization performance go Pull requests that update Go code labels Jun 12, 2024
@philipaconrad philipaconrad self-assigned this Jun 12, 2024
@philipaconrad philipaconrad force-pushed the philip/prealloc-bundle-file-buffers branch from 5839c4b to bb9f8e1 Compare June 12, 2024 18:17
Copy link

netlify bot commented Jun 12, 2024

Deploy Preview for openpolicyagent ready!

Name Link
🔨 Latest commit 5839c4b
🔍 Latest deploy log https://app.netlify.com/sites/openpolicyagent/deploys/6669e5f5a96c860008e37d99
😎 Deploy Preview https://deploy-preview-6818--openpolicyagent.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@philipaconrad philipaconrad marked this pull request as ready for review June 13, 2024 19:16
Copy link

netlify bot commented Jun 13, 2024

Deploy Preview for openpolicyagent ready!

Name Link
🔨 Latest commit 48a4ea4
🔍 Latest deploy log https://app.netlify.com/sites/openpolicyagent/deploys/6672ca0d481fe70008e9d5bd
😎 Deploy Preview https://deploy-preview-6818--openpolicyagent.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 😃

bundle/bundle.go Outdated Show resolved Hide resolved
bundle/bundle.go Outdated Show resolved Hide resolved
This commit adds logic to preallocate buffers when loading files from
both tarballs and on-disk bundle directories. The change results in
lower max RSS memory usage at runtime, and better garbage collector
performance, especially when at lower values of GOMAXPROCS.

For very large bundles (>1 GB in size), this change can lower startup
times for OPA by as much as a full second.

The performance analysis was different than for most changes-- heap
usage increased by about 10% during bundle loading, which made the
change look bad at first. Some of the effect appears to be from the
Go compiler no longer inlining as far up the call chain during bundle
loading (visible in the `pprof` graphs).

Running with `GODEBUG=gctrace=1` and varying GOMAXPROCS allowed seeing a
fuller picture of how performance changes from preallocation, which
results in much less garbage for the collector, and a noticeable speedup
in wall-clock time the GC burns during bundle loading.

Signed-off-by: Philip Conrad <philipaconrad@gmail.com>
Signed-off-by: Philip Conrad <philipaconrad@gmail.com>
Signed-off-by: Philip Conrad <philipaconrad@gmail.com>
@philipaconrad philipaconrad force-pushed the philip/prealloc-bundle-file-buffers branch from 85ad3a6 to bb22028 Compare June 17, 2024 15:48
@philipaconrad
Copy link
Contributor Author

@johanfylling If you think a Golang benchmark or something would be valuable to add to this PR to help the case for merging, please LMK, and I'll allocate some time for that.

Thanks again for the helpful review comments! 😄

srenatus
srenatus previously approved these changes Jun 18, 2024
Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks

johanfylling
johanfylling previously approved these changes Jun 18, 2024
Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

If you have the bandwidth, a benchmark wouldn't hurt, like you suggest. Performance is the major claim of this PR after all. Changes look legit, though, so I won't hold you up if you have more on your plate.

@srenatus
Copy link
Contributor

The change will improve the performance when loading large (giant) bundles...I think it might be hard to do this in a benchmark. The PR's instructions outline the procedure used to verify it...that might have to be sufficient for now, I suppose.

@philipaconrad philipaconrad dismissed stale reviews from johanfylling and srenatus via 58e355a June 18, 2024 20:56
@philipaconrad philipaconrad force-pushed the philip/prealloc-bundle-file-buffers branch 3 times, most recently from 683167e to be80a64 Compare June 18, 2024 22:09
Signed-off-by: Philip Conrad <philipaconrad@gmail.com>
@philipaconrad philipaconrad force-pushed the philip/prealloc-bundle-file-buffers branch from be80a64 to 99d6181 Compare June 18, 2024 22:15
@philipaconrad
Copy link
Contributor Author

@johanfylling @srenatus Benchmarks are up! 😄

It took me a sec to find the right way to exercise the changes-- I wound up settling on calling Bundle.Read(), since that invokes all of the loader machinery behind the scenes, and allows the resulting buffers to be used directly by their intended downstream consumers.

Example CLI invocations of the new benchmarks:

go test -benchmem -run=^$ -bench ^BenchmarkTarballLoader$ github.com/open-policy-agent/opa/bundle -count=30 > tar.txt
go test -benchmem -run=^$ -bench ^BenchmarkDirectoryLoader$ github.com/open-policy-agent/opa/bundle -count=30 > dir.txt
Benchstat results for BenchmarkTarballLoader (N=30 runs each)
$ benchstat main-tar.txt pr-6818-tar.txt
name                    old time/op    new time/op    delta
TarballLoader/1000-8      0.00ns ±59%    0.00ns ±36%     ~     (p=0.159 n=28+30)
TarballLoader/10000-8     0.02ns ±36%    0.01ns ±19%  -69.13%  (p=0.000 n=30+28)
TarballLoader/100000-8    0.09ns ±20%    0.04ns ± 4%  -54.52%  (p=0.000 n=30+27)
TarballLoader/250000-8    0.18ns ± 6%    0.10ns ± 5%  -43.45%  (p=0.000 n=30+28)

name                    old alloc/op   new alloc/op   delta
TarballLoader/1000-8       0.00B          0.00B          ~     (all equal)
TarballLoader/10000-8      0.00B          0.00B          ~     (all equal)
TarballLoader/100000-8     0.00B          0.00B          ~     (all equal)
TarballLoader/250000-8     0.00B          0.00B          ~     (all equal)

name                    old allocs/op  new allocs/op  delta
TarballLoader/1000-8        0.00           0.00          ~     (all equal)
TarballLoader/10000-8       0.00           0.00          ~     (all equal)
TarballLoader/100000-8      0.00           0.00          ~     (all equal)
TarballLoader/250000-8      0.00           0.00          ~     (all equal)
Benchstat results for BenchmarkDirectoryLoader (N=30 runs each)
$ benchstat main-dir.txt pr-6818-dir.txt
name                      old time/op    new time/op    delta
DirectoryLoader/10000-8     0.01ns ±18%    0.00ns ±83%   -67.90%  (p=0.000 n=27+30)
DirectoryLoader/100000-8    0.06ns ± 8%    0.02ns ±15%   -68.65%  (p=0.000 n=29+29)
DirectoryLoader/250000-8    0.11ns ±16%    0.05ns ±13%   -55.78%  (p=0.000 n=29+27)
DirectoryLoader/500000-8    0.21ns ± 6%    0.10ns ±18%   -55.17%  (p=0.000 n=26+27)

name                      old alloc/op   new alloc/op   delta
DirectoryLoader/10000-8      0.00B          0.00B           ~     (all equal)
DirectoryLoader/100000-8     0.00B          0.00B           ~     (all equal)
DirectoryLoader/250000-8     0.00B          0.00B           ~     (all equal)
DirectoryLoader/500000-8     1.00B ± 0%     0.00B       -100.00%  (p=0.000 n=30+30)

name                      old allocs/op  new allocs/op  delta
DirectoryLoader/10000-8       0.00           0.00           ~     (all equal)
DirectoryLoader/100000-8      0.00           0.00           ~     (all equal)
DirectoryLoader/250000-8      0.00           0.00           ~     (all equal)
DirectoryLoader/500000-8      0.00           0.00           ~     (all equal)

Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you :)

@johanfylling johanfylling merged commit 8e7172c into open-policy-agent:main Jun 19, 2024
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go Pull requests that update Go code optimization performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants