-
Notifications
You must be signed in to change notification settings - Fork 4.6k
transport: Remove buffer copies while writing HTTP/2 Data frames #8667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #8667 +/- ##
==========================================
- Coverage 83.43% 82.88% -0.56%
==========================================
Files 415 415
Lines 32195 32261 +66
==========================================
- Hits 26863 26739 -124
- Misses 3980 4029 +49
- Partials 1352 1493 +141
🚀 New features to boost your workflow:
|
|
@dfawley : Again moving to your plate if you feel like having a second look. |
| } | ||
| if dSize > 0 { | ||
| var err error | ||
| l.writeBuf, err = reader.Peek(dSize, l.writeBuf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this buffer can only grow and never shrinks.
- What happens if a slice holds a pointer to a huge amount of data? I believe it isn't possible to free it, but am not certain. E.g.
l.writeBuf = [][]byte{nil, nil, nil, nil, nil, nil, make([]byte, 10GB)}
l.writeBuf = l.writeBuf[:0]- What happens if
cap(l.writeBuf)grows to a large value and then we never need it to be that large ever again?
I think we need to have some way to scale this buffer back down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For point 1, I've updated the code to clear the buffer after calling Write. This releases references to all the slices and allows them to be GCed.
With respect to point 2, I've now set a limit of 64 on the buffer's length. If a buffer is longer than that, it's immediately freed after use instead of being cached.
Background on the 64-element limit: The BufferSlice from the proto codec is 1 element. With a potential gRPC header, the length is almost always 2. While custom codecs might produce larger slices, 64 is a generous limit that covers common cases without caching excessive memory.
This change also mitigates a worst-case memory scenario. Since Peek() filters empty slices, a 16KB http2 Data frame (the max size) could theoretically be split into 16K (16,384) distinct 1-byte slices. In that case, the memory overhead for the slice headers alone would be 24 bytes * 16 * 1024 (approx. 393KB), with the 64 size limit, the max held memory is approx 1.5KB. Also note that the framer already has a data buffer that grows up to 16KB, and after this change, that buffer should no longer be used for Data frames.
internal/transport/http_util.go
Outdated
| if len(d) == 0 { | ||
| continue | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be a bug if it were zero? I would have expected it to be.
If it is, then we should delete it. Write should handle a zero-length buffer as a nop already anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. There should not be any empty buffers in the list, since Peek() filters them out. This was an artifact from the time I spent root-causing unexpected behavior on the local benchmarks with large payloads
internal/transport/controlbuf.go
Outdated
| // This must never happen since the reader must have at least dSize | ||
| // bytes. | ||
| clear(l.writeBuf) | ||
| l.writeBuf = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is impossible then:
logger.Errorseems like a good idea, unless the caller already does that with what we return..- We probably don't need to bother with the clear/nil (and surely don't want to do both?)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an error log and removed the buffer resetting.
…c#8667) This PR removes 2 buffer copies while writing data frames to the underlying net.Conn: one [within gRPC](https://github.com/grpc/grpc-go/blob/58d4b2b1492dbcfdf26daa7ed93830ebb871faf1/internal/transport/controlbuf.go#L1009-L1022) and the other [in the framer](https://cs.opensource.google/go/x/net/+/master:http2/frame.go;l=743;drc=6e243da531559f8c99439dabc7647dec07191f9b). Care is taken to avoid any extra heap allocations which can affect performance for smaller payloads. A [CL](https://go-review.git.corp.google.com/c/net/+/711620) is out for review which allows using the framer to write frame headers. This PR duplicates the header writing code as a temporary workaround. This PR will be merged only after the CL is merged. ## Results ### Small payloads Performance for small payloads increases slightly due to the reduction of a `deferred` statement. ``` $ go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary \ -compression=off -maxConcurrentCalls=120 -trace=off \ -reqSizeBytes=100 -respSizeBytes=100 -networkMode=Local -resultFile="${RUN_NAME}" $ go run benchmark/benchresult/main.go unary-before unary-after Title Before After Percentage TotalOps 7600878 7653522 0.69% SendOps 0 0 NaN% RecvOps 0 0 NaN% Bytes/op 10007.07 10000.89 -0.07% Allocs/op 146.93 146.91 0.00% ReqT/op 101345040.00 102046960.00 0.69% RespT/op 101345040.00 102046960.00 0.69% 50th-Lat 833.724µs 830.041µs -0.44% 90th-Lat 1.281969ms 1.275336ms -0.52% 99th-Lat 2.403961ms 2.360606ms -1.80% Avg-Lat 946.123µs 939.734µs -0.68% GoVersion go1.24.8 go1.24.8 GrpcVersion 1.77.0-dev 1.77.0-dev ``` ### Large payloads Local benchmarks show a ~5-10% regression with 1 MB payloads on my dev machine. The profiles show increased time spent in the copy operation [inside the buffered writer](https://github.com/grpc/grpc-go/blob/58d4b2b1492dbcfdf26daa7ed93830ebb871faf1/internal/transport/http_util.go#L334). Counterintuitively, copying the grpc header and message data into a larger buffer increased the performance by 4% (compared to master). To validate this behaviour (extra copy increasing performance) I ran [the k8s benchmark for 1MB payloads](https://github.com/grpc/grpc/blob/65c9be86830b0e423dd970c066c69a06a9240298/tools/run_tests/performance/scenario_config.py#L291-L305) and 100 concurrent streams which showed ~5% increase in QPS without the copies across multiple runs. Adding a copy reduced the performance. Load test config file: [loadtest.yaml](https://github.com/user-attachments/files/23055312/loadtest.yaml) ``` # 30 core client and server Before QPS: 498.284 (16.6095/server core) Latencies (50/90/95/99/99.9%-ile): 233256/275972/281250/291803/298533 us Server system time: 93.0164 Server user time: 142.533 Client system time: 97.2688 Client user time: 144.542 After QPS: 526.776 (17.5592/server core) Latencies (50/90/95/99/99.9%-ile): 211010/263189/270969/280656/288828 us Server system time: 96.5959 Server user time: 147.668 Client system time: 101.973 Client user time: 150.234 # 8 core client and server Before QPS: 291.049 (36.3811/server core) Latencies (50/90/95/99/99.9%-ile): 294552/685822/903554/1.48399e+06/1.50757e+06 us Server system time: 49.0355 Server user time: 87.1783 Client system time: 60.1945 Client user time: 103.633 After QPS: 334.119 (41.7649/server core) Latencies (50/90/95/99/99.9%-ile): 279395/518849/706327/1.09273e+06/1.11629e+06 us Server system time: 69.3136 Server user time: 102.549 Client system time: 80.9804 Client user time: 107.103 ``` RELEASE NOTES: * transport: Avoid two buffer copies when writing Data frames.
This PR removes 2 buffer copies while writing data frames to the underlying net.Conn: one within gRPC and the other in the framer. Care is taken to avoid any extra heap allocations which can affect performance for smaller payloads.
A CL is out for review which allows using the framer to write frame headers. This PR duplicates the header writing code as a temporary workaround. This PR will be merged only after the CL is merged.
Results
Small payloads
Performance for small payloads increases slightly due to the reduction of a
deferredstatement.Large payloads
Local benchmarks show a ~5-10% regression with 1 MB payloads on my dev machine. The profiles show increased time spent in the copy operation inside the buffered writer. Counterintuitively, copying the grpc header and message data into a larger buffer increased the performance by 4% (compared to master).
To validate this behaviour (extra copy increasing performance) I ran the k8s benchmark for 1MB payloads and 100 concurrent streams which showed ~5% increase in QPS without the copies across multiple runs. Adding a copy reduced the performance.
Load test config file: loadtest.yaml
RELEASE NOTES: