Skip to content

Conversation

@arjan-bal
Copy link
Contributor

Original PRs: #8657, #8667

RELEASE NOTES:

  • transport: Avoid copies when reading and writing Data frames.

This change incorporates changes from
golang/go#73560 to split reading HTTP/2 frame
headers and payloads. If the frame is not a Data frame, it's read
through the standard library framer as before. For Data frames, the
payload is read directly into a buffer from the buffer pool to avoid
copying it from the framer's buffer.

## Testing
For 1 MB payloads, this results in ~4% improvement in throughput.

```sh
# test command
go run benchmark/benchmain/main.go -benchtime=60s -workloads=streaming \
   -compression=off -maxConcurrentCalls=120 -trace=off \
   -reqSizeBytes=1000000 -respSizeBytes=1000000 -networkMode=Local -resultFile="${RUN_NAME}"

# comparison
go run benchmark/benchresult/main.go streaming-before streaming-after  
               Title       Before        After Percentage
            TotalOps        87536        91120     4.09%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op   4074102.92   4070489.30    -0.09%
           Allocs/op        83.60        76.55    -8.37%
             ReqT/op 11671466666.67 12149333333.33     4.09%
            RespT/op 11671466666.67 12149333333.33     4.09%
            50th-Lat  78.209875ms  75.159943ms    -3.90%
            90th-Lat 117.764228ms   107.8697ms    -8.40%
            99th-Lat 146.935704ms 139.069685ms    -5.35%
             Avg-Lat  82.310691ms  79.073282ms    -3.93%
           GoVersion     go1.24.7     go1.24.7
         GrpcVersion   1.77.0-dev   1.77.0-dev
```

For smaller payloads, the difference in minor.
```sh
go run benchmark/benchmain/main.go -benchtime=60s -workloads=streaming \
   -compression=off -maxConcurrentCalls=120 -trace=off \
   -reqSizeBytes=100 -respSizeBytes=100 -networkMode=Local -resultFile="${RUN_NAME}"

go run benchmark/benchresult/main.go streaming-before streaming-after 
               Title       Before        After Percentage
            TotalOps     21490752     21477822    -0.06%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      1902.92      1902.94     0.00%
           Allocs/op        29.21        29.21     0.00%
             ReqT/op 286543360.00 286370960.00    -0.06%
            RespT/op 286543360.00 286370960.00    -0.06%
            50th-Lat    352.505µs    352.247µs    -0.07%
            90th-Lat    433.446µs    434.907µs     0.34%
            99th-Lat    536.445µs    539.759µs     0.62%
             Avg-Lat    333.403µs    333.457µs     0.02%
           GoVersion     go1.24.7     go1.24.7
         GrpcVersion   1.77.0-dev   1.77.0-dev
```

RELEASE NOTES:
* transport: Avoid a buffer copy when reading data.
…c#8667)

This PR removes 2 buffer copies while writing data frames to the
underlying net.Conn: one [within
gRPC](https://github.com/grpc/grpc-go/blob/58d4b2b1492dbcfdf26daa7ed93830ebb871faf1/internal/transport/controlbuf.go#L1009-L1022)
and the other [in the
framer](https://cs.opensource.google/go/x/net/+/master:http2/frame.go;l=743;drc=6e243da531559f8c99439dabc7647dec07191f9b).
Care is taken to avoid any extra heap allocations which can affect
performance for smaller payloads.

A [CL](https://go-review.git.corp.google.com/c/net/+/711620) is out for
review which allows using the framer to write frame headers. This PR
duplicates the header writing code as a temporary workaround. This PR
will be merged only after the CL is merged.

## Results

### Small payloads
Performance for small payloads increases slightly due to the reduction
of a `deferred` statement.
```
$ go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary \
   -compression=off -maxConcurrentCalls=120 -trace=off \
   -reqSizeBytes=100 -respSizeBytes=100 -networkMode=Local -resultFile="${RUN_NAME}"

$ go run benchmark/benchresult/main.go unary-before unary-after
               Title       Before        After Percentage
            TotalOps      7600878      7653522     0.69%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     10007.07     10000.89    -0.07%
           Allocs/op       146.93       146.91     0.00%
             ReqT/op 101345040.00 102046960.00     0.69%
            RespT/op 101345040.00 102046960.00     0.69%
            50th-Lat    833.724µs    830.041µs    -0.44%
            90th-Lat   1.281969ms   1.275336ms    -0.52%
            99th-Lat   2.403961ms   2.360606ms    -1.80%
             Avg-Lat    946.123µs    939.734µs    -0.68%
           GoVersion     go1.24.8     go1.24.8
         GrpcVersion   1.77.0-dev   1.77.0-dev
```

### Large payloads
Local benchmarks show a ~5-10% regression with 1 MB payloads on my dev
machine. The profiles show increased time spent in the copy operation
[inside the buffered
writer](https://github.com/grpc/grpc-go/blob/58d4b2b1492dbcfdf26daa7ed93830ebb871faf1/internal/transport/http_util.go#L334).
Counterintuitively, copying the grpc header and message data into a
larger buffer increased the performance by 4% (compared to master).

To validate this behaviour (extra copy increasing performance) I ran
[the k8s benchmark for 1MB
payloads](https://github.com/grpc/grpc/blob/65c9be86830b0e423dd970c066c69a06a9240298/tools/run_tests/performance/scenario_config.py#L291-L305)
and 100 concurrent streams which showed ~5% increase in QPS without the
copies across multiple runs. Adding a copy reduced the performance.

Load test config file:
[loadtest.yaml](https://github.com/user-attachments/files/23055312/loadtest.yaml)

```
# 30 core client and server
Before
QPS: 498.284 (16.6095/server core)
Latencies (50/90/95/99/99.9%-ile): 233256/275972/281250/291803/298533 us
Server system time: 93.0164
Server user time:   142.533
Client system time: 97.2688
Client user time:   144.542

After
QPS: 526.776 (17.5592/server core)
Latencies (50/90/95/99/99.9%-ile): 211010/263189/270969/280656/288828 us
Server system time: 96.5959
Server user time:   147.668
Client system time: 101.973
Client user time:   150.234

# 8 core client and server
Before
QPS: 291.049 (36.3811/server core)
Latencies (50/90/95/99/99.9%-ile): 294552/685822/903554/1.48399e+06/1.50757e+06 us
Server system time: 49.0355
Server user time:   87.1783
Client system time: 60.1945
Client user time:   103.633

After
QPS: 334.119 (41.7649/server core)
Latencies (50/90/95/99/99.9%-ile): 279395/518849/706327/1.09273e+06/1.11629e+06 us
Server system time: 69.3136
Server user time:   102.549
Client system time: 80.9804
Client user time:   107.103
```

RELEASE NOTES:
* transport: Avoid two buffer copies when writing Data frames.
@arjan-bal arjan-bal added this to the 1.77 Release milestone Nov 3, 2025
@arjan-bal arjan-bal added Type: Performance Performance improvements (CPU, network, memory, etc) Area: Transport Includes HTTP/2 client/server and HTTP server handler transports and advanced transport features. labels Nov 3, 2025
@codecov
Copy link

codecov bot commented Nov 3, 2025

Codecov Report

❌ Patch coverage is 89.10256% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.23%. Comparing base (f959da6) to head (13bb904).
⚠️ Report is 1 commits behind head on v1.77.x.

Files with missing lines Patch % Lines
internal/transport/controlbuf.go 63.15% 3 Missing and 4 partials ⚠️
internal/transport/http_util.go 93.02% 3 Missing and 3 partials ⚠️
mem/buffer_slice.go 93.33% 1 Missing and 1 partial ⚠️
internal/transport/http2_client.go 90.90% 1 Missing ⚠️
internal/transport/http2_server.go 90.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           v1.77.x    #8690      +/-   ##
===========================================
+ Coverage    82.21%   83.23%   +1.01%     
===========================================
  Files          417      417              
  Lines        32198    32296      +98     
===========================================
+ Hits         26472    26880     +408     
- Misses        4021     4037      +16     
+ Partials      1705     1379     -326     
Files with missing lines Coverage Δ
mem/buffer_pool.go 100.00% <ø> (ø)
internal/transport/http2_client.go 92.71% <90.90%> (+15.78%) ⬆️
internal/transport/http2_server.go 91.30% <90.00%> (ø)
mem/buffer_slice.go 96.45% <93.33%> (-0.85%) ⬇️
internal/transport/http_util.go 94.53% <93.02%> (-0.68%) ⬇️
internal/transport/controlbuf.go 89.50% <63.15%> (-0.75%) ⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arjan-bal arjan-bal merged commit 4288cfc into grpc:v1.77.x Nov 3, 2025
17 checks passed
@arjan-bal arjan-bal deleted the cherrypick-copyless-framer branch November 3, 2025 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Transport Includes HTTP/2 client/server and HTTP server handler transports and advanced transport features. Type: Performance Performance improvements (CPU, network, memory, etc)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants