Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quic: refactor QUICHE flag implementation + eliminate scalability bottleneck #18600

Open
2 tasks
lkpdn opened this issue Oct 13, 2021 · 0 comments
Open
2 tasks
Labels
area/perf area/quic no stalebot Disables stalebot from closing an issue

Comments

@lkpdn
Copy link

lkpdn commented Oct 13, 2021

Title

Refactor QUICHE flag implementation + eliminate scalability bottleneck.

Description

QUICHE QUIC_FLAG implementation should be backed by Envoy runtime features not only because it's the canonical way (*1) but also for performance reasons.

It is observed that some quiche::TypedFlag access on the QUICHE datapath becomes a major scaling bottleneck when downstream QUIC traffic is heavy.

Simply replacing absl::MutexLock in TypedFlag implementation with absl::(Reader|Writer)MutexLock where possible is not satisfactory. The bottleneck would just move from the contention on the MutexLock to the atomic operation on the refcount in the new ReaderMutexLock done by each Envoy worker thread. We need a truly scalable flag accessing. Unfortunately, Envoy::Runtime::runtimeFeatureEnabled is also not contention-free since the snapshot created by createNewSnapshot() is shared by every worker thread, which again leads to the refcount contention .

So, we need two steps;

  • 1). At first, make QUICHE QUIC_FLAG implementation backed by Envoy runtime features.
    • From performance perspective, this does not make much difference by itself. Without the second step, this would just move the bottleneck from the absl::MutexLocks inside of quiche::TypedFlag to the shared_ptr to SnapshotImpl, assuming that the flag accessing will be done by runtimeFeatureEnabled() (*2).
  • 2). then, eliminate the contention that occurs when multiple worker threads frequently calls runtimeFeatureEnabled().

(*1) https://github.com/envoyproxy/envoy/blob/1b9c688997f5/source/common/quic/platform/quiche_flags_impl.h#L22-L24
(*2) Here, flag kind that I talk about is QUIC_FLAG, not QUIC_PROTOCOL_FLAG. I don't think QUIC_PROTOCOL_FLAG should be backed by Envoy runtime features, as it doesn't indicate neither newly introduced features nor deprecated ones. They also seem not to fit in with "%"-based Envoy features. Rather than runtime configuration, static/dynamic resources might be a choice for the protocol flags.

Repro Steps

To verify the issue, we need to launch and stress Envoy's QUIC server on multi-core environment.
The following is just a sample setup:

  • launch two AWS EC2 c6gn.16xlarge instances, one for a client (stresser), the other for a Envoy server.
  • on a server instance
  • on a client instance
    • install h2load (https://github.com/nghttp2/nghttp2/tree/quic).
    • stress the server
      # Assuming the server already started running with the config shown above,
      # for example,
      
      $ SERVER_IP=xxxxxxxxx
      $ for i in `seq 1 100`; do
             sudo docker run --rm -t --network=host localhost/h2load-quic \
                         -c $i -t $i -m 800 --warm-up-time=40s -D 40s \
                         --tls13-ciphers=TLS_AES_128_GCM_SHA256 \
                         --npn-list h3 --connect-to $SERVER_IP:443 \
                         https://127.0.0.1/f1m.dat; done
        done
      
      # while i=1 to i=15 or so, you will see the total throughput increases almost linearly.
      # after i=20~30 to i=80, the total throughput drastically degrades. 
      

To-do lists

@lkpdn lkpdn added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Oct 13, 2021
@davinci26 davinci26 added area/perf area/quic no stalebot Disables stalebot from closing an issue and removed enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Oct 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/perf area/quic no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

No branches or pull requests

2 participants