Skip to content

http: optimize header map implementation for improved performance #39252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

agrawroh
Copy link
Contributor

@agrawroh agrawroh commented Apr 28, 2025

Description:

This PR introduces an optimized implementation of the HeaderMap interface that provides better performance for common header operations. It uses a combination of vector and hash map for header storage and lookup. This implementation shows significant performance improvements:

  • 1.7x faster header insertions
  • 3.3x faster header lookups
  • Up to 3.6x faster header iteration

This optimization will benefit all Envoy deployments that handle significant HTTP traffic, particularly those with large numbers of headers or high request rates. The improvements in lookup and iteration performance will reduce CPU usage in header processing code paths.

Benchmark

Benchmarks are performed on both MacOS and Linux using the checked in benchmarking suite.

bazel run -c opt //test/benchmark:header_map_benchmark_test

Benchmark Results

Linux

Run on (32 X 3499.53 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1280 KiB (x16)
  L3 Unified 55296 KiB (x1)
Load Average: 10.41, 17.06, 16.77
------------------------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------
BM_HeaderMapImplInsert/10/10                 793 ns          793 ns       881801 items_per_second=12.6078M/s
BM_HeaderMapImplInsert/50/20                4327 ns         4327 ns       162107 items_per_second=11.556M/s
BM_HeaderMapImplInsert/100/50               9559 ns         9559 ns        73359 items_per_second=10.4614M/s
BM_HeaderMapOptimizedInsert/10/10            678 ns          678 ns      1037087 items_per_second=14.7545M/s
BM_HeaderMapOptimizedInsert/50/20           3528 ns         3528 ns       194380 items_per_second=14.1736M/s
BM_HeaderMapOptimizedInsert/100/50          8098 ns         8098 ns        85731 items_per_second=12.3495M/s
BM_HeaderMapImplLookup/10/10                 355 ns          355 ns      1973543 items_per_second=28.194M/s
BM_HeaderMapImplLookup/50/20                1909 ns         1909 ns       366985 items_per_second=26.1928M/s
BM_HeaderMapImplLookup/100/50               3914 ns         3914 ns       178833 items_per_second=25.5496M/s
BM_HeaderMapOptimizedLookup/10/10            136 ns          136 ns      5100593 items_per_second=73.6141M/s
BM_HeaderMapOptimizedLookup/50/20            719 ns          719 ns       939043 items_per_second=69.5108M/s
BM_HeaderMapOptimizedLookup/100/50          1423 ns         1423 ns       494690 items_per_second=70.2956M/s
BM_HeaderMapImplIteration/10/10             24.2 ns         24.2 ns     29162575 items_per_second=413.807M/s
BM_HeaderMapImplIteration/50/20             87.6 ns         87.6 ns      7993415 items_per_second=570.628M/s
BM_HeaderMapImplIteration/100/50             160 ns          160 ns      4395914 items_per_second=623.138M/s
BM_HeaderMapOptimizedIteration/10/10        5.76 ns         5.76 ns    122007983 items_per_second=1.73746G/s
BM_HeaderMapOptimizedIteration/50/20        28.3 ns         28.3 ns     24778009 items_per_second=1.76795G/s
BM_HeaderMapOptimizedIteration/100/50       56.2 ns         56.2 ns     12446980 items_per_second=1.77809G/s

NOTE: Lower is better.

Benchmark Graph Request

MacOS

Run on (12 X 24 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x12)
Load Average: 13.55, 9.45, 7.33
------------------------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------
BM_HeaderMapImplInsert/10/10                 934 ns          933 ns       741706 items_per_second=10.7156M/s
BM_HeaderMapImplInsert/50/20                4058 ns         4057 ns       168786 items_per_second=12.3242M/s
BM_HeaderMapImplInsert/100/50               7996 ns         7996 ns        87223 items_per_second=12.5057M/s
BM_HeaderMapOptimizedInsert/10/10            634 ns          634 ns      1091669 items_per_second=15.7718M/s
BM_HeaderMapOptimizedInsert/50/20           3272 ns         3261 ns       215173 items_per_second=15.334M/s
BM_HeaderMapOptimizedInsert/100/50          6628 ns         6624 ns       104356 items_per_second=15.0962M/s
BM_HeaderMapImplLookup/10/10                 256 ns          256 ns      2726993 items_per_second=39.1321M/s
BM_HeaderMapImplLookup/50/20                1315 ns         1315 ns       539204 items_per_second=38.015M/s
BM_HeaderMapImplLookup/100/50               2749 ns         2748 ns       259331 items_per_second=36.392M/s
BM_HeaderMapOptimizedLookup/10/10           68.7 ns         68.6 ns     10155672 items_per_second=145.786M/s
BM_HeaderMapOptimizedLookup/50/20            355 ns          355 ns      1815089 items_per_second=140.964M/s
BM_HeaderMapOptimizedLookup/100/50           748 ns          748 ns       927435 items_per_second=133.734M/s
BM_HeaderMapImplIteration/10/10             12.3 ns         12.3 ns     55999552 items_per_second=812.345M/s
BM_HeaderMapImplIteration/50/20             53.9 ns         53.9 ns     12935893 items_per_second=928.35M/s
BM_HeaderMapImplIteration/100/50             101 ns          101 ns      6858711 items_per_second=987.991M/s
BM_HeaderMapOptimizedIteration/10/10        3.55 ns         3.55 ns    196841537 items_per_second=2.81886G/s
BM_HeaderMapOptimizedIteration/50/20        15.5 ns         15.5 ns     44665646 items_per_second=3.23397G/s
BM_HeaderMapOptimizedIteration/100/50       30.4 ns         30.4 ns     22931722 items_per_second=3.28603G/s

NOTE: Lower is better.

Benchmark Graph Request (1)


Commit Message: http: optimize header map implementation for improved performance
Risk Level: Low
Testing: Unit tests and benchmarks added
Docs Changes: N/A
Release Notes: Added
Platform Specific: No

Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #39252 was opened by agrawroh.

see: more, trace.

@agrawroh agrawroh force-pushed the perf-hdr branch 2 times, most recently from 77ef6ec to 0bdfe7b Compare April 28, 2025 17:18
Signed-off-by: Rohit Agrawal <rohit.agrawal@databricks.com>
@ggreenway
Copy link
Member

I'd like to see a performance comparison that mimics real-world access patterns. Specifically, compare lots of use of core headers like :path and :authority and the other headers that envoy must access for routing/processing. In the current implementation there are two classes of headers, and all of the core-protocol headers are in the "fast" path. So I'd like to see comparison of that.

Also, I'd like to see results in a full Envoy end-to-end test to see how much of a difference this change makes in overall performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants