transport: Remove buffer copies while writing HTTP/2 Data frames #8667

arjan-bal · 2025-10-22T14:17:03Z

This PR removes 2 buffer copies while writing data frames to the underlying net.Conn: one within gRPC and the other in the framer. Care is taken to avoid any extra heap allocations which can affect performance for smaller payloads.

A CL is out for review which allows using the framer to write frame headers. This PR duplicates the header writing code as a temporary workaround. This PR will be merged only after the CL is merged.

Results

Small payloads

Performance for small payloads increases slightly due to the reduction of a deferred statement.

$ go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary \
   -compression=off -maxConcurrentCalls=120 -trace=off \
   -reqSizeBytes=100 -respSizeBytes=100 -networkMode=Local -resultFile="${RUN_NAME}"

$ go run benchmark/benchresult/main.go unary-before unary-after
               Title       Before        After Percentage
            TotalOps      7600878      7653522     0.69%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     10007.07     10000.89    -0.07%
           Allocs/op       146.93       146.91     0.00%
             ReqT/op 101345040.00 102046960.00     0.69%
            RespT/op 101345040.00 102046960.00     0.69%
            50th-Lat    833.724µs    830.041µs    -0.44%
            90th-Lat   1.281969ms   1.275336ms    -0.52%
            99th-Lat   2.403961ms   2.360606ms    -1.80%
             Avg-Lat    946.123µs    939.734µs    -0.68%
           GoVersion     go1.24.8     go1.24.8
         GrpcVersion   1.77.0-dev   1.77.0-dev

Large payloads

Local benchmarks show a ~5-10% regression with 1 MB payloads on my dev machine. The profiles show increased time spent in the copy operation inside the buffered writer. Counterintuitively, copying the grpc header and message data into a larger buffer increased the performance by 4% (compared to master).

To validate this behaviour (extra copy increasing performance) I ran the k8s benchmark for 1MB payloads and 100 concurrent streams which showed ~5% increase in QPS without the copies across multiple runs. Adding a copy reduced the performance.

Load test config file: loadtest.yaml

# 30 core client and server
Before
QPS: 498.284 (16.6095/server core)
Latencies (50/90/95/99/99.9%-ile): 233256/275972/281250/291803/298533 us
Server system time: 93.0164
Server user time:   142.533
Client system time: 97.2688
Client user time:   144.542

After
QPS: 526.776 (17.5592/server core)
Latencies (50/90/95/99/99.9%-ile): 211010/263189/270969/280656/288828 us
Server system time: 96.5959
Server user time:   147.668
Client system time: 101.973
Client user time:   150.234

# 8 core client and server
Before
QPS: 291.049 (36.3811/server core)
Latencies (50/90/95/99/99.9%-ile): 294552/685822/903554/1.48399e+06/1.50757e+06 us
Server system time: 49.0355
Server user time:   87.1783
Client system time: 60.1945
Client user time:   103.633

After
QPS: 334.119 (41.7649/server core)
Latencies (50/90/95/99/99.9%-ile): 279395/518849/706327/1.09273e+06/1.11629e+06 us
Server system time: 69.3136
Server user time:   102.549
Client system time: 80.9804
Client user time:   107.103

RELEASE NOTES:

transport: Avoid two buffer copies when writing data.

This reverts commit 168d2e6.

codecov · 2025-10-22T14:20:54Z

Codecov Report

❌ Patch coverage is 73.97260% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.88%. Comparing base (363018c) to head (ffbd989).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
internal/transport/controlbuf.go	45.00%	4 Missing and 7 partials ⚠️
internal/transport/http_util.go	73.91%	2 Missing and 4 partials ⚠️
mem/buffer_slice.go	93.33%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8667      +/-   ##
==========================================
- Coverage   83.43%   82.88%   -0.56%     
==========================================
  Files         415      415              
  Lines       32195    32261      +66     
==========================================
- Hits        26863    26739     -124     
- Misses       3980     4029      +49     
- Partials     1352     1493     +141

Files with missing lines	Coverage Δ
mem/buffer_slice.go	`80.14% <93.33%> (-17.16%)`	⬇️
internal/transport/http_util.go	`87.07% <73.91%> (-7.00%)`	⬇️
internal/transport/controlbuf.go	`71.86% <45.00%> (-19.22%)`	⬇️

... and 24 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mem/buffer_slice.go

mem/buffer_slice_test.go

internal/transport/http_util.go

…-write

easwars · 2025-10-28T21:28:36Z

@dfawley : Again moving to your plate if you feel like having a second look.

dfawley · 2025-10-30T17:43:47Z

internal/transport/controlbuf.go

+	}
+	if dSize > 0 {
+		var err error
+		l.writeBuf, err = reader.Peek(dSize, l.writeBuf)


It seems like this buffer can only grow and never shrinks.

What happens if a slice holds a pointer to a huge amount of data? I believe it isn't possible to free it, but am not certain. E.g.

l.writeBuf = [][]byte{nil, nil, nil, nil, nil, nil, make([]byte, 10GB)} l.writeBuf = l.writeBuf[:0]

What happens if cap(l.writeBuf) grows to a large value and then we never need it to be that large ever again?

I think we need to have some way to scale this buffer back down.

For point 1, I've updated the code to clear the buffer after calling Write. This releases references to all the slices and allows them to be GCed.

With respect to point 2, I've now set a limit of 64 on the buffer's length. If a buffer is longer than that, it's immediately freed after use instead of being cached.

Background on the 64-element limit: The BufferSlice from the proto codec is 1 element. With a potential gRPC header, the length is almost always 2. While custom codecs might produce larger slices, 64 is a generous limit that covers common cases without caching excessive memory.

This change also mitigates a worst-case memory scenario. Since Peek() filters empty slices, a 16KB http2 Data frame (the max size) could theoretically be split into 16K (16,384) distinct 1-byte slices. In that case, the memory overhead for the slice headers alone would be 24 bytes * 16 * 1024 (approx. 393KB), with the 64 size limit, the max held memory is approx 1.5KB. Also note that the framer already has a data buffer that grows up to 16KB, and after this change, that buffer should no longer be used for Data frames.

dfawley · 2025-10-30T17:48:34Z

internal/transport/http_util.go

+		if len(d) == 0 {
+			continue
+		}


Would this be a bug if it were zero? I would have expected it to be.

If it is, then we should delete it. Write should handle a zero-length buffer as a nop already anyway.

Removed. There should not be any empty buffers in the list, since Peek() filters them out. This was an artifact from the time I spent root-causing unexpected behavior on the local benchmarks with large payloads

dfawley · 2025-10-31T16:13:19Z

internal/transport/controlbuf.go

 			// This must never happen since the reader must have at least dSize
 			// bytes.
+			clear(l.writeBuf)
+			l.writeBuf = nil


If this is impossible then:

logger.Error seems like a good idea, unless the caller already does that with what we return..

We probably don't need to bother with the clear/nil (and surely don't want to do both?)?

Added an error log and removed the buffer resetting.

…-write

…c#8667) This PR removes 2 buffer copies while writing data frames to the underlying net.Conn: one [within gRPC](https://github.com/grpc/grpc-go/blob/58d4b2b1492dbcfdf26daa7ed93830ebb871faf1/internal/transport/controlbuf.go#L1009-L1022) and the other [in the framer](https://cs.opensource.google/go/x/net/+/master:http2/frame.go;l=743;drc=6e243da531559f8c99439dabc7647dec07191f9b). Care is taken to avoid any extra heap allocations which can affect performance for smaller payloads. A [CL](https://go-review.git.corp.google.com/c/net/+/711620) is out for review which allows using the framer to write frame headers. This PR duplicates the header writing code as a temporary workaround. This PR will be merged only after the CL is merged. ## Results ### Small payloads Performance for small payloads increases slightly due to the reduction of a `deferred` statement. ``` $ go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary \ -compression=off -maxConcurrentCalls=120 -trace=off \ -reqSizeBytes=100 -respSizeBytes=100 -networkMode=Local -resultFile="${RUN_NAME}" $ go run benchmark/benchresult/main.go unary-before unary-after Title Before After Percentage TotalOps 7600878 7653522 0.69% SendOps 0 0 NaN% RecvOps 0 0 NaN% Bytes/op 10007.07 10000.89 -0.07% Allocs/op 146.93 146.91 0.00% ReqT/op 101345040.00 102046960.00 0.69% RespT/op 101345040.00 102046960.00 0.69% 50th-Lat 833.724µs 830.041µs -0.44% 90th-Lat 1.281969ms 1.275336ms -0.52% 99th-Lat 2.403961ms 2.360606ms -1.80% Avg-Lat 946.123µs 939.734µs -0.68% GoVersion go1.24.8 go1.24.8 GrpcVersion 1.77.0-dev 1.77.0-dev ``` ### Large payloads Local benchmarks show a ~5-10% regression with 1 MB payloads on my dev machine. The profiles show increased time spent in the copy operation [inside the buffered writer](https://github.com/grpc/grpc-go/blob/58d4b2b1492dbcfdf26daa7ed93830ebb871faf1/internal/transport/http_util.go#L334). Counterintuitively, copying the grpc header and message data into a larger buffer increased the performance by 4% (compared to master). To validate this behaviour (extra copy increasing performance) I ran [the k8s benchmark for 1MB payloads](https://github.com/grpc/grpc/blob/65c9be86830b0e423dd970c066c69a06a9240298/tools/run_tests/performance/scenario_config.py#L291-L305) and 100 concurrent streams which showed ~5% increase in QPS without the copies across multiple runs. Adding a copy reduced the performance. Load test config file: [loadtest.yaml](https://github.com/user-attachments/files/23055312/loadtest.yaml) ``` # 30 core client and server Before QPS: 498.284 (16.6095/server core) Latencies (50/90/95/99/99.9%-ile): 233256/275972/281250/291803/298533 us Server system time: 93.0164 Server user time: 142.533 Client system time: 97.2688 Client user time: 144.542 After QPS: 526.776 (17.5592/server core) Latencies (50/90/95/99/99.9%-ile): 211010/263189/270969/280656/288828 us Server system time: 96.5959 Server user time: 147.668 Client system time: 101.973 Client user time: 150.234 # 8 core client and server Before QPS: 291.049 (36.3811/server core) Latencies (50/90/95/99/99.9%-ile): 294552/685822/903554/1.48399e+06/1.50757e+06 us Server system time: 49.0355 Server user time: 87.1783 Client system time: 60.1945 Client user time: 103.633 After QPS: 334.119 (41.7649/server core) Latencies (50/90/95/99/99.9%-ile): 279395/518849/706327/1.09273e+06/1.11629e+06 us Server system time: 69.3136 Server user time: 102.549 Client system time: 80.9804 Client user time: 107.103 ``` RELEASE NOTES: * transport: Avoid two buffer copies when writing Data frames.

Original PRs: #8657, #8667 RELEASE NOTES: * transport: Avoid copies when reading and writing Data frames.

arjan-bal added 3 commits October 22, 2025 18:41

remove writes while writing data frames

71f988e

add copy to address regression

168d2e6

Revert "add copy to address regression"

fc7097f

This reverts commit 168d2e6.

arjan-bal added this to the 1.77 Release milestone Oct 22, 2025

arjan-bal added Type: Performance Performance improvements (CPU, network, memory, etc) Area: Transport Includes HTTP/2 client/server and HTTP server handler transports and advanced transport features. labels Oct 22, 2025

arjan-bal requested review from dfawley and easwars October 22, 2025 14:24

arjan-bal assigned easwars and dfawley Oct 22, 2025

unify reader and cursor based on bufio.Reader

ca29c67

easwars reviewed Oct 28, 2025

View reviewed changes

easwars assigned arjan-bal and unassigned easwars and dfawley Oct 28, 2025

arjan-bal added 3 commits October 28, 2025 14:59

Merge remote-tracking branch 'source/master' into copyless-data-frame…

b625fa2

…-write

return errors on insufficient capacity in reader

651b46c

fix Reader godoc

39ed5a5

arjan-bal assigned easwars and unassigned arjan-bal Oct 28, 2025

easwars approved these changes Oct 28, 2025

View reviewed changes

easwars assigned dfawley and unassigned easwars Oct 28, 2025

arjan-bal modified the milestones: 1.77 Release, 1.78 Release Oct 30, 2025

dfawley reviewed Oct 30, 2025

View reviewed changes

dfawley assigned arjan-bal and unassigned dfawley Oct 30, 2025

dfawley reviewed Oct 30, 2025

View reviewed changes

avoid holding large peek buffers

bfee28b

arjan-bal assigned dfawley and unassigned arjan-bal Oct 30, 2025

dfawley approved these changes Oct 31, 2025

View reviewed changes

dfawley assigned arjan-bal and unassigned dfawley Oct 31, 2025

arjan-bal added 3 commits November 3, 2025 14:14

log error to fail tests

aba826c

Merge remote-tracking branch 'source/master' into copyless-data-frame…

ae9f421

…-write

remove buffer reset on unexpected error

ffbd989

arjan-bal merged commit dc355bc into grpc:master Nov 3, 2025
14 checks passed

arjan-bal mentioned this pull request Nov 3, 2025

Cherrypick #8657 and #8667 to v1.77.x #8690

Merged

arjan-bal added a commit that referenced this pull request Nov 3, 2025

Cherrypick #8657 and #8667 to v1.77.x (#8690)

4288cfc

Original PRs: #8657, #8667 RELEASE NOTES: * transport: Avoid copies when reading and writing Data frames.

transport: Remove buffer copies while writing HTTP/2 Data frames #8667

transport: Remove buffer copies while writing HTTP/2 Data frames #8667

Uh oh!

Conversation

arjan-bal commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

Small payloads

Large payloads

Uh oh!

codecov bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

easwars commented Oct 28, 2025

Uh oh!

dfawley Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arjan-bal Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

dfawley Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

arjan-bal Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

dfawley Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

arjan-bal Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arjan-bal commented Oct 22, 2025 •

edited

Loading

codecov bot commented Oct 22, 2025 •

edited

Loading

dfawley Oct 30, 2025 •

edited

Loading