Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: sharding endpoints in the cluster across worker threads #15052

Open
lambdai opened this issue Feb 15, 2021 · 3 comments
Open

perf: sharding endpoints in the cluster across worker threads #15052

lambdai opened this issue Feb 15, 2021 · 3 comments
Labels
area/cluster_manager area/perf design proposal Needs design doc/proposal before implementation help wanted Needs help!

Comments

@lambdai
Copy link
Contributor

lambdai commented Feb 15, 2021

Goal: reduce the connections to the upstream servers.

Background:
For each active cluster in Envoy, Envoy maintains at least 1 connection to upstream endpoint per worker per endpoint in the cluster per connection pool setting.

Usually, connection pool setting cardinality is decided by cluster attributes. The value can be considered as constant and let's leave it aside.

For a large EDS cluster, it's a waste to maintain per worker connection to each backend endpoint. For a 16-worker Envoy process, if each worker thread only connects to the dedicated 1/16 endpoints of a 1000-endpoint cluster. We can expect the established connection drops from 16K to 1k connections in below scenarios.

  1. normal to high load http2 upstream, or even heavy load streaming grpc upsteam.
  2. light load http1 upstream
  3. light load tcp upstream when preconnect is enabled on cluster.

Note that N-worker Envoy doesn't need to use exactly 1/N of the endpoint per worker.

  1. As a reverse proxy, Envoy usually has many replicas to reach high availability. In this situation, each worker thread could use less than 1/N endpoints and allow other Envoy replicas to balance the connections.

  2. It's fine to establish a connection with more than 1/N endpoints. Still Envoy could benefits until degrade to the current full-worker-endpoint connection graph.

Pros: Save memory usage by using fewer idle connections.
Cons: The imbalance could be amplified.

Originally posted by @mattklein123 in #8702 (comment)

@lambdai
Copy link
Contributor Author

lambdai commented Feb 16, 2021

Depends on #14663
This issue can be resolved if we can replace the client_id with the worker_id in the hash seed

@mattklein123 mattklein123 added area/cluster_manager area/perf design proposal Needs design doc/proposal before implementation help wanted Needs help! labels Feb 16, 2021
@mattklein123
Copy link
Member

I think this is an interesting idea and something we should think about, but it's not clear to me that we should invest in functionality like this in Envoy vs. building subsetting into the control plane (especially if we do it in xds-relay which can be shared by everyone).

Also, I think to get the full memory benefits we would also want to consider actually sharding the hosts on each worker vs. just the connections, though I agree that starting with just connections for hashing would be an interesting place to start (and per your last point perhaps we could do this with a new LB and have it be completely separate from core).

@jnt0r
Copy link

jnt0r commented Jul 29, 2024

Is there any update? Would love to see this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster_manager area/perf design proposal Needs design doc/proposal before implementation help wanted Needs help!
Projects
None yet
Development

No branches or pull requests

3 participants