-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: fix buffer release timing in processUnaryRPC #7998
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #7998 +/- ##
==========================================
+ Coverage 81.94% 82.05% +0.10%
==========================================
Files 381 384 +3
Lines 38539 38757 +218
==========================================
+ Hits 31582 31801 +219
- Misses 5626 5632 +6
+ Partials 1331 1324 -7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice catch!
server.go
Outdated
df := func(v any) error { | ||
defer d.Free() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified that the last reference to d
is in the closure, so the change is safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dfawley, the decoding func is called by generated code. The present version of the generated code always calls this in the first couple of lines.
Do you think there is a chance that it may not get called? If so, should we wrap d.Free()
in a OnceFunc
and defer call it outside the closure to be extra cautious?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can't guarantee that the decode function would be called by someone using the generic API to register their own handler, instead of the generated code. They could have their own interceptor-like pattern that checks headers and doesn't call the decoder if that fails.
So yes, I think for correctness reasons we must defer
a OnceFunc
-wrapped d.Free()
. It's not clear this would be a win for all cases. @lqs can you run our benchmarks before/after that change with 1-byte request/response sizes so that we can be sure it doesn't cause any surprises?
grpc-go/benchmark/benchmain/main.go
Lines 24 to 39 in 2d4daf3
go run benchmark/benchmain/main.go -benchtime=10s -workloads=all \ | |
-compression=gzip -maxConcurrentCalls=1 -trace=off \ | |
-reqSizeBytes=1,1048576 -respSizeBytes=1,1048576 -networkMode=Local \ | |
-cpuProfile=cpuProf -memProfile=memProf -memProfileRate=10000 -resultFile=result | |
As a suggestion, when creating a branch, you can run this benchmark and save the result | |
file "-resultFile=basePerf", and later when you at the middle of the work or finish the | |
work, you can get the benchmark result and compare it with the base anytime. | |
Assume there are two result files names as "basePerf" and "curPerf" created by adding | |
-resultFile=basePerf and -resultFile=curPerf. | |
To format the curPerf, run: | |
go run benchmark/benchresult/main.go curPerf | |
To observe how the performance changes based on a base result, run: | |
go run benchmark/benchresult/main.go basePerf curPerf |
Something like this would run just the set of conditions that should matter here:
go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary \
-compression=off -maxConcurrentCalls=100 -trace=off \
-reqSizeBytes=1 -respSizeBytes=1 -networkMode=Local -resultFile=[BEFORE/AFTER]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the benchmark results, taken on a Ryzen 9 7945HX CPU. Each test group was run multiple times, and the fastest result was selected.
before
df := func(v any) error {
defer d.Free()
$ ~/sdk/go1.23.4/bin/go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary -compression=off -maxConcurrentCalls=100 -trace=off -reqSizeBytes=1 -respSizeBytes=1 -networkMode=Local -resultFile=before
go1.23.4/grpc1.71.0-dev
unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_100-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_nil-sharedWriteBuffer_false:
50_Latency: 530.3120µs 90_Latency: 747.7040µs 99_Latency: 1405.5230µs Avg_Latency: 581.5540µs Bytes/op: 10153.135931747744 Allocs/op: 178.97030961211047
Histogram (unit: µs)
Count: 10295689 Min: 110.7 Max: 41317.2 Avg: 581.55
------------------------------------------------------------
[ 110.655000, 110.656000) 1 0.0% 0.0%
[ 110.656000, 110.662016) 0 0.0% 0.0%
[ 110.662016, 110.704228) 0 0.0% 0.0%
[ 110.704228, 111.000400) 0 0.0% 0.0%
[ 111.000400, 113.078425) 0 0.0% 0.0%
[ 113.078425, 127.658446) 0 0.0% 0.0%
[ 127.658446, 229.956072) 164 0.0% 0.0%
[ 229.956072, 947.705647) 9740824 94.6% 94.6% #########
[ 947.705647, 5983.643190) 554125 5.4% 100.0% #
[ 5983.643190, 41317.230000) 574 0.0% 100.0%
[ 41317.230000, 289227.841760) 1 0.0% 100.0%
Number of requests: 10295689 Request throughput: 1.3727585333333334e+06 bit/s
Number of responses: 10295689 Response throughput: 1.3727585333333334e+06 bit/s
after
dataFree := grpcsync.OnceFunc(d.Free)
defer dataFree()
df := func(v any) error {
defer dataFree()
$ ~/sdk/go1.23.4/bin/go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary -compression=off -maxConcurrentCalls=100 -trace=off -reqSizeBytes=1 -respSizeBytes=1 -networkMode=Local -resultFile=after
go1.23.4/grpc1.71.0-dev
unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_100-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_nil-sharedWriteBuffer_false:
50_Latency: 533.2280µs 90_Latency: 744.5700µs 99_Latency: 1415.7970µs Avg_Latency: 584.1710µs Bytes/op: 10225.500515690499 Allocs/op: 181.97781262126145
Histogram (unit: µs)
Count: 10246456 Min: 64.9 Max: 39638.3 Avg: 584.17
------------------------------------------------------------
[ 64.876000, 64.877000) 1 0.0% 0.0%
[ 64.877000, 64.882985) 0 0.0% 0.0%
[ 64.882985, 64.924788) 0 0.0% 0.0%
[ 64.924788, 65.216775) 0 0.0% 0.0%
[ 65.216775, 67.256256) 0 0.0% 0.0%
[ 67.256256, 81.501688) 3 0.0% 0.0%
[ 81.501688, 181.003628) 142 0.0% 0.0%
[ 181.003628, 876.007898) 9584488 93.5% 93.5% #########
[ 876.007898, 5730.495526) 661038 6.5% 100.0% #
[ 5730.495526, 39638.273000) 783 0.0% 100.0%
[ 39638.273000, 276478.380825) 1 0.0% 100.0%
Number of requests: 10246456 Request throughput: 1.3661941333333333e+06 bit/s
Number of responses: 10246456 Response throughput: 1.3661941333333333e+06 bit/s
comparison
$ ~/sdk/go1.23.4/bin/go run benchmark/benchresult/main.go before after
unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_100-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_nil-sharedWriteBuffer_false
Title Before After Percentage
TotalOps 10295689 10246456 -0.48%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 10153.14 10225.50 0.71%
Allocs/op 178.97 181.98 1.68%
ReqT/op 1372758.53 1366194.13 -0.48%
RespT/op 1372758.53 1366194.13 -0.48%
50th-Lat 530.312µs 533.228µs 0.55%
90th-Lat 747.704µs 744.57µs -0.42%
99th-Lat 1.405523ms 1.415797ms 0.73%
Avg-Lat 581.554µs 584.171µs 0.45%
GoVersion go1.23.4 go1.23.4
GrpcVersion 1.71.0-dev 1.71.0-dev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please update this PR with the OnceFunc
changes (which I believe we need for "correctness"*)? The results you shared aren't bad but also aren't great, but I'd like to experiment a bit.
* - Technically there is no correctness problem per-se, just re-use of memory that is lost. In theory we could do the change without the OnceFunc
, but then the memory wouldn't get reused if the buffer wasn't decoded, which feels worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with the OnceFunc
in the latest commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts after looking at this more:
- Async access isn't possible/legal here, so
OnceFunc
is not necessary (just use abool
to track if it's been freed). This was my version:
diff --git a/server.go b/server.go
index 16065a02..9d5b2884 100644
--- a/server.go
+++ b/server.go
@@ -1360,8 +1360,16 @@ func (s *Server) processUnaryRPC(ctx context.Context, stream *transport.ServerSt
}
return err
}
- defer d.Free()
+ freed := false
+ dataFree := func() {
+ if !freed {
+ d.Free()
+ freed = true
+ }
+ }
+ defer dataFree()
df := func(v any) error {
+ defer dataFree()
if err := s.getCodec(stream.ContentSubtype()).Unmarshal(d, v); err != nil {
- Running the command above 10x gives me these results (avg latency for the run; lower is better):
- master: 888.5 stdev: 4.20
- your branch: 884.8 stdev: 4.34
- bool instead: 883.3 stdev: 6.50
Both of these are actually performance improvements over master
, so let's move forward with this. Please also switch to the bool
instead of OnceFunc
since it does seem to be slightly faster, even though variance is higher and everything is well within the margins of error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@dfawley ping on this |
Previously, the buffer
d
was held for the entire duration of the handler execution inprocessUnaryRPC
, leading to unnecessary memory retention. This PR adjusts the code to release the buffer immediately after the decoding functiondf
completes, ensuring memory is freed as soon as it is no longer needed.As a result, the workaround in #7972 to adjust the buffer size is no longer required, as this fix resolves the underlying issue.
RELEASE NOTES: N/A