Skip to content

Conversation

@vanja-p
Copy link

@vanja-p vanja-p commented Dec 12, 2025

Currently, performance is very sensitive to payload size. If the total payload is 32KiB+1byte, the code gets (or allocates) a 1MiB buffer. Then it clears the whole thing but only uses ~3% of it. Adding more tiers between 32KiB and 1MiB makes these requests 30% faster in your benchmarks, and speeds up 256KiB requests by 10% in my benchmarks.

Just adding the tiers causes a 1% regression for 1-byte requests, but switching to slices.BinarySearch more than recovers this. I confirmed that this has the same behavior with https://go.dev/play/p/i7Q9ItQVu5T.

Note that both the old and new search have a potential bug when calling Put with buffers that don't have a power of 2 capacity. For example, getPool(4097) will return the pool with size 16384, but that pool can't accept the buffer so it will just ignore it. This would only be a problem if some code reslices the capacity of a buffer.

33000 byte request and response benchmarks:

unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_33000B-respSize_33000B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps        67198        89033    32.49%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    235034.69    188930.90   -19.62%
           Allocs/op       177.73       181.77     2.25%
             ReqT/op 1774027200.00 2350471200.00    32.49%
            RespT/op 1774027200.00 2350471200.00    32.49%
            50th-Lat    131.034µs     94.934µs   -27.55%
            90th-Lat    187.464µs    177.865µs    -5.12%
            99th-Lat    459.922µs    308.028µs   -33.03%
             Avg-Lat    148.645µs    112.131µs   -24.56%
           GoVersion     go1.25.5     go1.25.5
         GrpcVersion   1.79.0-dev   1.79.0-dev

streaming-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_33000B-respSize_33000B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps        77039       116228    50.87%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    312723.99    184410.91   -41.03%
           Allocs/op        61.74        63.88     3.24%
             ReqT/op 2033829600.00 3068419200.00    50.87%
            RespT/op 2033829600.00 3068419200.00    50.87%
            50th-Lat     108.37µs       72.3µs   -33.28%
            90th-Lat    166.412µs    131.114µs   -21.21%
            99th-Lat     526.06µs    270.545µs   -48.57%
             Avg-Lat    129.656µs     85.868µs   -33.77%
           GoVersion     go1.25.5     go1.25.5
         GrpcVersion   1.79.0-dev   1.79.0-dev

unconstrained-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_33000B-respSize_33000B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps            0            0      NaN%
             SendOps       376536       447734    18.91%
             RecvOps       413423       575729    39.26%
            Bytes/op    103263.42     89198.52   -13.62%
           Allocs/op        27.75        28.12     3.60%
             ReqT/op 9940550400.00 11820177600.00    18.91%
            RespT/op 10914367200.00 15199245600.00    39.26%
            50th-Lat           0s           0s      NaN%
            90th-Lat           0s           0s      NaN%
            99th-Lat           0s           0s      NaN%
             Avg-Lat           0s           0s      NaN%
           GoVersion     go1.25.5     go1.25.5
         GrpcVersion   1.79.0-dev   1.79.0-dev

1 byte request and response benchmarks:

unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps       230977       231843     0.37%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      9235.90      9235.57     0.00%
           Allocs/op       155.07       155.07     0.00%
             ReqT/op    184781.60    185474.40     0.38%
            RespT/op    184781.60    185474.40     0.38%
            50th-Lat      41.15µs     41.109µs    -0.10%
            90th-Lat     55.558µs     55.117µs    -0.79%
            99th-Lat     91.657µs     88.321µs    -3.64%
             Avg-Lat     43.152µs      42.98µs    -0.40%
           GoVersion     go1.25.5     go1.25.5
         GrpcVersion   1.79.0-dev   1.79.0-dev

streaming-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps       394925       395698     0.20%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      1260.37      1260.38     0.00%
           Allocs/op        41.01        41.01     0.00%
             ReqT/op    315940.00    316558.40     0.20%
            RespT/op    315940.00    316558.40     0.20%
            50th-Lat     24.087µs     23.987µs    -0.42%
            90th-Lat     31.861µs     31.861µs     0.00%
            99th-Lat      46.45µs     45.929µs    -1.12%
             Avg-Lat     25.205µs     25.138µs    -0.27%
           GoVersion     go1.25.5     go1.25.5
         GrpcVersion   1.79.0-dev   1.79.0-dev

unconstrained-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps            0            0      NaN%
             SendOps     12550620     11837068    -5.69%
             RecvOps     11448882     11193802    -2.23%
            Bytes/op      1075.81      1069.33    -0.56%
           Allocs/op        25.07        25.07     0.00%
             ReqT/op  10040496.00   9469654.40    -5.69%
            RespT/op   9159105.60   8955041.60    -2.23%
            50th-Lat           0s           0s      NaN%
            90th-Lat           0s           0s      NaN%
            99th-Lat           0s           0s      NaN%
             Avg-Lat           0s           0s      NaN%
           GoVersion     go1.25.5     go1.25.5
         GrpcVersion   1.79.0-dev   1.79.0-dev

32000 byte benchmarks:

streaming-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_32000B-respSize_32000B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps       150481       150473    -0.01%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    139452.12    139517.62     0.05%
           Allocs/op        42.38        42.41     0.00%
             ReqT/op 3852313600.00 3852108800.00    -0.01%
            RespT/op 3852313600.00 3852108800.00    -0.01%
            50th-Lat     57.943µs     57.952µs     0.02%
            90th-Lat     89.133µs     89.744µs     0.69%
            99th-Lat    207.272µs    206.561µs    -0.34%
             Avg-Lat       66.3µs     66.293µs    -0.01%
           GoVersion     go1.25.5     go1.25.5
         GrpcVersion   1.79.0-dev   1.79.0-dev

unconstrained-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_32000B-respSize_32000B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps            0            0      NaN%
             SendOps       774812       547494   -29.34%
             RecvOps       338049       615464    82.06%
            Bytes/op     70072.04     70303.21     0.33%
           Allocs/op        20.97        21.25     4.77%
             ReqT/op 19835187200.00 14015846400.00   -29.34%
            RespT/op 8654054400.00 15755878400.00    82.06%
            50th-Lat           0s           0s      NaN%
            90th-Lat           0s           0s      NaN%
            99th-Lat           0s           0s      NaN%
             Avg-Lat           0s           0s      NaN%
           GoVersion     go1.25.5     go1.25.5
         GrpcVersion   1.79.0-dev   1.79.0-dev

unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_32000B-respSize_32000B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps       109196       110357     1.06%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    147410.90    147495.33     0.06%
           Allocs/op       160.25       160.28     0.00%
             ReqT/op 2795417600.00 2825139200.00     1.06%
            RespT/op 2795417600.00 2825139200.00     1.06%
            50th-Lat     78.342µs     77.651µs    -0.88%
            90th-Lat    140.242µs    138.779µs    -1.04%
            99th-Lat    267.018µs    264.033µs    -1.12%
             Avg-Lat     91.406µs     90.436µs    -1.06%
           GoVersion     go1.25.5     go1.25.5
         GrpcVersion   1.79.0-dev   1.79.0-dev

RELEASE NOTES:

  • mem: add more tiers to the default buffer pool

@codecov
Copy link

codecov bot commented Dec 12, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.32%. Comparing base (9fd5d8a) to head (12af9d7).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8770      +/-   ##
==========================================
- Coverage   83.36%   83.32%   -0.04%     
==========================================
  Files         419      419              
  Lines       32548    32548              
==========================================
- Hits        27134    27122      -12     
- Misses       4028     4033       +5     
- Partials     1386     1393       +7     
Files with missing lines Coverage Δ
mem/buffer_pool.go 96.61% <100.00%> (ø)

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vanja-p
Copy link
Author

vanja-p commented Dec 15, 2025

I should mention that grpc-go users can override the default pool for clients and servers with options, but the same isn't possible for the DefaultBufferPool() uses in the proto encoder. I could rewrite the proto encoder, but it seems better for the default performance to be faster.

@easwars
Copy link
Contributor

easwars commented Dec 15, 2025

@arjan-bal

We probably just went with whatever bucket sizes @PapaCharlie came up with when the initial buffer slicing implementation was done. The change looks OK to me, but I'll punt it to you for the time being, to see if you have more insights here. Thanks.

@easwars easwars requested review from arjan-bal and easwars December 15, 2025 22:34
@easwars easwars added the Type: Performance Performance improvements (CPU, network, memory, etc) label Dec 15, 2025
@easwars easwars added this to the 1.79 Release milestone Dec 15, 2025
@PapaCharlie
Copy link
Contributor

LGTM! I think I just reused whatever sizes were previously defined. Having more seems fine. However, I don't think there's functionally a difference between sort.Search and slices.BinarySearchFunc? I would be wary of using one or the other as you could incur unexpected heap allocations which will kill your perf. Might be worth it to write a small benchmark comparing the two, just to see if there's a meaningful difference.

Note that both the old and new search have a potential bug when calling Put with buffers that don't have a power of 2 capacity. For example, getPool(4097) will return the pool with size 16384, but that pool can't accept the buffer so it will just ignore it. This would only be a problem if some code reslices the capacity of a buffer.

You are 100% right. tieredBufferPool.Put and tieredBufferPool.Get need to use different implementations of getPool, where the former needs to use the pool with the largest buffers that are still smaller than the input buffer’s capacity, while the latter needs the pool with the smallest buffers that are still larger than the input buffer’s capacity. Otherwise resized buffers get dropped on the floor.

@PapaCharlie
Copy link
Contributor

func BenchmarkSearch(b *testing.B) {
	p := NewTieredBufferPool(defaultBufferPoolSizes...).(*tieredBufferPool)
	sizes := []int{defaultBufferPoolSizes[0] - 1}
	for _, size := range defaultBufferPoolSizes {
		sizes = append(sizes, size+1)
	}
	b.Run("func=slices.BinarySearch", func(b *testing.B) {
		for b.Loop() {
			for _, size := range sizes {
				slices.BinarySearchFunc(p.sizedPools, size, func(pool *sizedBufferPool, target int) int {
					return pool.defaultSize - target
				})
			}
		}
	})
	b.Run("func=sort.Search", func(b *testing.B) {
		for b.Loop() {
			for _, size := range sizes {
				sort.Search(len(p.sizedPools), func(i int) bool {
					return p.sizedPools[i].defaultSize >= size
				})
			}
		}
	})
}

According to this, which should be comparing the two apples-to-apples, it seems that sort.Search is actually marginally faster, albeit very marginally:

% go test -bench=BenchmarkSearch -count=10 -benchmem | benchstat -col '/func' -
goos: darwin
goarch: arm64
pkg: whatever
cpu: Apple M4 Max
          │ slices.BinarySearch │            sort.Search             │
          │       sec/op        │   sec/op     vs base               │
Search-16           62.81n ± 2%   60.23n ± 0%  -4.11% (p=0.000 n=10)

          │ slices.BinarySearch │          sort.Search           │
          │        B/op         │    B/op     vs base            │
Search-16            0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

          │ slices.BinarySearch │          sort.Search           │
          │      allocs/op      │ allocs/op   vs base            │
Search-16            0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

So it's totally up to you. If you find slices.BinarySearch to be more legible then it's a pretty reasonable change, since they do exactly the same thing.

@arjan-bal
Copy link
Contributor

arjan-bal commented Dec 16, 2025

It's possible to make the search even faster since we're using powers of 2.

func BenchmarkSearch(b *testing.B) {
	p := NewTieredBufferPool(defaultBufferPoolSizes...).(*tieredBufferPool)
	sizes := []int{defaultBufferPoolSizes[0] - 1}
	for _, size := range defaultBufferPoolSizes {
		sizes = append(sizes, size+1)
	}
	b.Run("func=slices.BinarySearch", func(b *testing.B) {
		// omitted
	})
	b.Run("func=sort.Search", func(b *testing.B) {
		// omitted
	})

	// Ensure all sizes are powers of 2.
	for _, size := range defaultBufferPoolSizes {
		if bits.OnesCount(uint(size)) != 1 {
			b.Fatalf("size %d is not a power of 2", size)
		}
	}
	// Make a slice that maps from highest set bit to index in the defaultBufferPoolSizes slice.
	// This is equivalent to the sort.Search call, but more efficient.
	upperBound := make([]int, 64)
	for i := 0; i < 64; i++ {
		upperBound[i] = -1
	}
	for i, size := range defaultBufferPoolSizes {
		upperBound[bits.Len(uint(size))] = i
	}
	for i := 62; i >= 0; i-- {
		if upperBound[i] == -1 {
			upperBound[i] = upperBound[i+1]
		}
	}

	b.Run("func=custom", func(b *testing.B) {
		for b.Loop() {
			for _, size := range sizes {
				query := size - 1
				_ = upperBound[bits.Len(uint(query))+1]
			}
		}
	})
}

Benchmark results with the list of tier sizes from this PR.

$ go test -bench=BenchmarkSearch -count=10 -benchmem | benchstat -col '/func' -
goos: linux
goarch: amd64
pkg: google.golang.org/grpc/mem
cpu: Intel(R) Xeon(R) CPU @ 2.60GHz
          │ slices.BinarySearch │             sort.Search              │               custom                │
          │       sec/op        │    sec/op     vs base                │   sec/op     vs base                │
Search-48         121.850n ± 0%   99.335n ± 0%  -18.48% (p=0.000 n=10)   7.633n ± 2%  -93.74% (p=0.000 n=10)

          │ slices.BinarySearch │          sort.Search           │             custom             │
          │        B/op         │    B/op     vs base            │    B/op     vs base            │
Search-48            0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

          │ slices.BinarySearch │          sort.Search           │             custom             │
          │      allocs/op      │ allocs/op   vs base            │ allocs/op   vs base            │
Search-48            0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

Since the tiered buffer pool constructor doesn't restrict the tier sizes to be powers of 2, we could store a boolean indicating whether all sizes are powers of 2 and have a fast path for that uses bit manipulation.


We discussed this issue during the maintainers' meeting and believe a better solution is to provide an experimental API that allows users to update the default buffer pool in an init function. This can be achieved by exposing the existing test-only function through the experimental package.

The concern is that adding new tiers to the default pool can lead to lower buffer reuse, as buffers may end up distributed across more tiers. Existing benchmarks don't catch such problems because they use a single payload size for the entire run. By allowing users to configure custom tiers, we can avoid the risk of unintended regressions for the wider user base.

@arjan-bal arjan-bal assigned vanja-p and unassigned arjan-bal Dec 16, 2025
@PapaCharlie
Copy link
Contributor

Well you could make the constructor for it accept a list of powers of 2 instead of arbitrary sizes :) then you're guaranteed it works!

@arjan-bal
Copy link
Contributor

I've raise #8775 which adds a tiered buffer pool that supports only powers of 2. That should allow adding new tiers without a performance cost for looking up the required sizedBufferPool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Requires Reporter Clarification Type: Performance Performance improvements (CPU, network, memory, etc)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants