You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Refactor QUICHE flag implementation + eliminate scalability bottleneck.
Description
QUICHE QUIC_FLAG implementation should be backed by Envoy runtime features not only because it's the canonical way (*1) but also for performance reasons.
It is observed that some quiche::TypedFlag access on the QUICHE datapath becomes a major scaling bottleneck when downstream QUIC traffic is heavy.
Simply replacing absl::MutexLock in TypedFlag implementation with absl::(Reader|Writer)MutexLock where possible is not satisfactory. The bottleneck would just move from the contention on the MutexLock to the atomic operation on the refcount in the new ReaderMutexLock done by each Envoy worker thread. We need a truly scalable flag accessing. Unfortunately, Envoy::Runtime::runtimeFeatureEnabled is also not contention-free since the snapshot created by createNewSnapshot() is shared by every worker thread, which again leads to the refcount contention .
So, we need two steps;
1). At first, make QUICHE QUIC_FLAG implementation backed by Envoy runtime features.
From performance perspective, this does not make much difference by itself. Without the second step, this would just move the bottleneck from the absl::MutexLocks inside of quiche::TypedFlag to the shared_ptr to SnapshotImpl, assuming that the flag accessing will be done by runtimeFeatureEnabled() (*2).
2). then, eliminate the contention that occurs when multiple worker threads frequently calls runtimeFeatureEnabled().
(*1) https://github.com/envoyproxy/envoy/blob/1b9c688997f5/source/common/quic/platform/quiche_flags_impl.h#L22-L24
(*2) Here, flag kind that I talk about is QUIC_FLAG, not QUIC_PROTOCOL_FLAG. I don't think QUIC_PROTOCOL_FLAG should be backed by Envoy runtime features, as it doesn't indicate neither newly introduced features nor deprecated ones. They also seem not to fit in with "%"-based Envoy features. Rather than runtime configuration, static/dynamic resources might be a choice for the protocol flags.
Repro Steps
To verify the issue, we need to launch and stress Envoy's QUIC server on multi-core environment.
The following is just a sample setup:
launch two AWS EC2 c6gn.16xlarge instances, one for a client (stresser), the other for a Envoy server.
on a server instance
build the latest Envoy
put the minimal config to make Envoy run as a QUIC server wherever you want
# Assuming the server already started running with the config shown above,
# for example,
$ SERVER_IP=xxxxxxxxx
$ for i in `seq 1 100`; do
sudo docker run --rm -t --network=host localhost/h2load-quic \
-c $i -t $i -m 800 --warm-up-time=40s -D 40s \
--tls13-ciphers=TLS_AES_128_GCM_SHA256 \
--npn-list h3 --connect-to $SERVER_IP:443 \
https://127.0.0.1/f1m.dat; done
done
# while i=1 to i=15 or so, you will see the total throughput increases almost linearly.
# after i=20~30 to i=80, the total throughput drastically degrades.
To-do lists
make QUICHE QUIC_FLAG implementation backed by Envoy runtime features
Title
Refactor QUICHE flag implementation + eliminate scalability bottleneck.
Description
QUICHE QUIC_FLAG implementation should be backed by Envoy runtime features not only because it's the canonical way (*1) but also for performance reasons.
It is observed that some quiche::TypedFlag access on the QUICHE datapath becomes a major scaling bottleneck when downstream QUIC traffic is heavy.
Simply replacing absl::MutexLock in TypedFlag implementation with absl::(Reader|Writer)MutexLock where possible is not satisfactory. The bottleneck would just move from the contention on the MutexLock to the atomic operation on the refcount in the new ReaderMutexLock done by each Envoy worker thread. We need a truly scalable flag accessing. Unfortunately, Envoy::Runtime::runtimeFeatureEnabled is also not contention-free since the snapshot created by createNewSnapshot() is shared by every worker thread, which again leads to the refcount contention .
So, we need two steps;
absl::MutexLock
s inside ofquiche::TypedFlag
to the shared_ptr toSnapshotImpl
, assuming that the flag accessing will be done byruntimeFeatureEnabled()
(*2).runtimeFeatureEnabled()
.(*1) https://github.com/envoyproxy/envoy/blob/1b9c688997f5/source/common/quic/platform/quiche_flags_impl.h#L22-L24
(*2) Here, flag kind that I talk about is QUIC_FLAG, not QUIC_PROTOCOL_FLAG. I don't think QUIC_PROTOCOL_FLAG should be backed by Envoy runtime features, as it doesn't indicate neither newly introduced features nor deprecated ones. They also seem not to fit in with "%"-based Envoy features. Rather than runtime configuration, static/dynamic resources might be a choice for the protocol flags.
Repro Steps
To verify the issue, we need to launch and stress Envoy's QUIC server on multi-core environment.
The following is just a sample setup:
{{ }}
parts)To-do lists
runtimeFeatureEnabled()
calls from every thread will induce.The text was updated successfully, but these errors were encountered: