-
Notifications
You must be signed in to change notification settings - Fork 4.6k
mem: Add more tiers to the default buffer pool #8770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #8770 +/- ##
==========================================
- Coverage 83.36% 83.32% -0.04%
==========================================
Files 419 419
Lines 32548 32548
==========================================
- Hits 27134 27122 -12
- Misses 4028 4033 +5
- Partials 1386 1393 +7
🚀 New features to boost your workflow:
|
|
I should mention that grpc-go users can override the default pool for clients and servers with options, but the same isn't possible for the |
|
We probably just went with whatever bucket sizes @PapaCharlie came up with when the initial buffer slicing implementation was done. The change looks OK to me, but I'll punt it to you for the time being, to see if you have more insights here. Thanks. |
|
LGTM! I think I just reused whatever sizes were previously defined. Having more seems fine. However, I don't think there's functionally a difference between sort.Search and slices.BinarySearchFunc? I would be wary of using one or the other as you could incur unexpected heap allocations which will kill your perf. Might be worth it to write a small benchmark comparing the two, just to see if there's a meaningful difference.
You are 100% right. |
func BenchmarkSearch(b *testing.B) {
p := NewTieredBufferPool(defaultBufferPoolSizes...).(*tieredBufferPool)
sizes := []int{defaultBufferPoolSizes[0] - 1}
for _, size := range defaultBufferPoolSizes {
sizes = append(sizes, size+1)
}
b.Run("func=slices.BinarySearch", func(b *testing.B) {
for b.Loop() {
for _, size := range sizes {
slices.BinarySearchFunc(p.sizedPools, size, func(pool *sizedBufferPool, target int) int {
return pool.defaultSize - target
})
}
}
})
b.Run("func=sort.Search", func(b *testing.B) {
for b.Loop() {
for _, size := range sizes {
sort.Search(len(p.sizedPools), func(i int) bool {
return p.sizedPools[i].defaultSize >= size
})
}
}
})
}According to this, which should be comparing the two apples-to-apples, it seems that So it's totally up to you. If you find |
|
It's possible to make the search even faster since we're using powers of 2. func BenchmarkSearch(b *testing.B) {
p := NewTieredBufferPool(defaultBufferPoolSizes...).(*tieredBufferPool)
sizes := []int{defaultBufferPoolSizes[0] - 1}
for _, size := range defaultBufferPoolSizes {
sizes = append(sizes, size+1)
}
b.Run("func=slices.BinarySearch", func(b *testing.B) {
// omitted
})
b.Run("func=sort.Search", func(b *testing.B) {
// omitted
})
// Ensure all sizes are powers of 2.
for _, size := range defaultBufferPoolSizes {
if bits.OnesCount(uint(size)) != 1 {
b.Fatalf("size %d is not a power of 2", size)
}
}
// Make a slice that maps from highest set bit to index in the defaultBufferPoolSizes slice.
// This is equivalent to the sort.Search call, but more efficient.
upperBound := make([]int, 64)
for i := 0; i < 64; i++ {
upperBound[i] = -1
}
for i, size := range defaultBufferPoolSizes {
upperBound[bits.Len(uint(size))] = i
}
for i := 62; i >= 0; i-- {
if upperBound[i] == -1 {
upperBound[i] = upperBound[i+1]
}
}
b.Run("func=custom", func(b *testing.B) {
for b.Loop() {
for _, size := range sizes {
query := size - 1
_ = upperBound[bits.Len(uint(query))+1]
}
}
})
}Benchmark results with the list of tier sizes from this PR. $ go test -bench=BenchmarkSearch -count=10 -benchmem | benchstat -col '/func' -
goos: linux
goarch: amd64
pkg: google.golang.org/grpc/mem
cpu: Intel(R) Xeon(R) CPU @ 2.60GHz
│ slices.BinarySearch │ sort.Search │ custom │
│ sec/op │ sec/op vs base │ sec/op vs base │
Search-48 121.850n ± 0% 99.335n ± 0% -18.48% (p=0.000 n=10) 7.633n ± 2% -93.74% (p=0.000 n=10)
│ slices.BinarySearch │ sort.Search │ custom │
│ B/op │ B/op vs base │ B/op vs base │
Search-48 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ 0.000 ± 0% ~ (p=1.000 n=10) ¹
¹ all samples are equal
│ slices.BinarySearch │ sort.Search │ custom │
│ allocs/op │ allocs/op vs base │ allocs/op vs base │
Search-48 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ 0.000 ± 0% ~ (p=1.000 n=10) ¹
¹ all samples are equalSince the tiered buffer pool constructor doesn't restrict the tier sizes to be powers of 2, we could store a boolean indicating whether all sizes are powers of 2 and have a fast path for that uses bit manipulation. We discussed this issue during the maintainers' meeting and believe a better solution is to provide an experimental API that allows users to update the default buffer pool in an init function. This can be achieved by exposing the existing test-only function through the experimental package. The concern is that adding new tiers to the default pool can lead to lower buffer reuse, as buffers may end up distributed across more tiers. Existing benchmarks don't catch such problems because they use a single payload size for the entire run. By allowing users to configure custom tiers, we can avoid the risk of unintended regressions for the wider user base. |
|
Well you could make the constructor for it accept a list of powers of 2 instead of arbitrary sizes :) then you're guaranteed it works! |
|
I've raise #8775 which adds a tiered buffer pool that supports only powers of 2. That should allow adding new tiers without a performance cost for looking up the required |
Currently, performance is very sensitive to payload size. If the total payload is 32KiB+1byte, the code gets (or allocates) a 1MiB buffer. Then it clears the whole thing but only uses ~3% of it. Adding more tiers between 32KiB and 1MiB makes these requests 30% faster in your benchmarks, and speeds up 256KiB requests by 10% in my benchmarks.
Just adding the tiers causes a 1% regression for 1-byte requests, but switching to
slices.BinarySearchmore than recovers this. I confirmed that this has the same behavior with https://go.dev/play/p/i7Q9ItQVu5T.Note that both the old and new search have a potential bug when calling
Putwith buffers that don't have a power of 2 capacity. For example,getPool(4097)will return the pool with size 16384, but that pool can't accept the buffer so it will just ignore it. This would only be a problem if some code reslices the capacity of a buffer.33000 byte request and response benchmarks:
1 byte request and response benchmarks:
32000 byte benchmarks:
RELEASE NOTES: