Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix https-> ConcurrentMaxStream bug #13985

Closed
wants to merge 0 commits into from
Closed

Conversation

YANGJINJUE
Copy link

when etcd enable https,many requests will blocked,because of etcd->server.go use a default ConcurrentMaxStream:250.

Fix : I set MaxConcurrentStreams: math.MaxUint32,it work good,we haved deploy the etcd-fix in production.

Copy link
Member

@serathius serathius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried to use other etcd versions to verify if issue introduced in v3.5.3 and not any other version? From surface I don't see any reason why v3.5.3 would have introduced this limitation.

Have you investigated why there are so many requests blocked? I would like to first understand if there was any changes in client/server that causes us to require so many concurrent streams.

I don't think this a proper fix, as it doesn't identify the root cause, but tries to fix a symptom.

@YANGJINJUE YANGJINJUE changed the title V3.5.3 fix https-> ConcurrentMaxStream bug fix https-> ConcurrentMaxStream bug Apr 25, 2022
@ptabor
Copy link
Contributor

ptabor commented Apr 25, 2022

Unbound number of connections is likely to cause:

  • OOM errors as each connection comes with memory overhead.
  • depletion of networking ports (there are only 63k)

Please use 'debug' interface (pprof) or SIGSEV to check what code the connections are being stack on.

@YANGJINJUE
Copy link
Author

YANGJINJUE commented Apr 25, 2022

if you disable https,it work good。
if you enable https,send many requests to etcd-cluster with go-etcd-client-sdk,the request will cost long time . this problem exists in many etcd versions.

Example:
go-etcd-client-sdk:
go.etcd.io/etcd/api/v3 v3.5.2
go.etcd.io/etcd/client/v3 v3.5.2

@codecov-commenter
Copy link

Codecov Report

Merging #13985 (4ad859d) into main (b00bb53) will decrease coverage by 0.26%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##             main   #13985      +/-   ##
==========================================
- Coverage   74.99%   74.73%   -0.27%     
==========================================
  Files         447      447              
  Lines       37166    37170       +4     
==========================================
- Hits        27874    27779      -95     
- Misses       7523     7599      +76     
- Partials     1769     1792      +23     
Flag Coverage Δ
all 74.73% <50.00%> (-0.27%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/embed/serve.go 61.88% <50.00%> (-1.14%) ⬇️
server/proxy/grpcproxy/register.go 79.06% <0.00%> (-11.63%) ⬇️
server/auth/simple_token.go 80.00% <0.00%> (-8.47%) ⬇️
client/pkg/v3/fileutil/purge.go 66.03% <0.00%> (-7.55%) ⬇️
client/v3/namespace/watch.go 87.87% <0.00%> (-6.07%) ⬇️
server/etcdserver/api/v3rpc/watch.go 82.55% <0.00%> (-5.37%) ⬇️
raft/rafttest/node.go 95.00% <0.00%> (-5.00%) ⬇️
server/etcdserver/api/v3rpc/util.go 70.96% <0.00%> (-3.23%) ⬇️
server/storage/mvcc/watchable_store.go 89.85% <0.00%> (-2.90%) ⬇️
client/v3/experimental/recipes/key.go 75.34% <0.00%> (-2.74%) ⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b00bb53...4ad859d. Read the comment docs.

@dojiong
Copy link
Contributor

dojiong commented Apr 25, 2022

ulimited grpc streams is first introduced at #8238, but it's not working under http2 connection when tls enabled

@ahrtr
Copy link
Member

ahrtr commented Apr 26, 2022

Comments:

  1. Please update commit message using English so that all contributors can understand;
  2. Please convince us or clarify why 8238 doesn't work when http2/tls is enabled.

It will be helpful if grpc/grpc-go experts can provide comments here. @menghanl @iamqizhao @dfawley

@ptabor
Copy link
Contributor

ptabor commented Apr 26, 2022

Comments:

  1. Please update commit message using English so that all contributors can understand;
  1. Please convince us or clarify why 8238 doesn't work when http2/tls is enabled.

I assume it's because of our layout: we have httpserver -> cmux -> grpc.
grpc is configured to allow unlimited number of connections, but http server sticks to its defaults.

What I don't understand is:

  1. What makes the TLS special ? The code path for insecure is also not setting the max concurrent connections.
    Do you mean that plane http server is:
    a) not capping the connections at 100 ?
    b) capping, but not causing problems, as establishing new http connections is lightweight vs. expensive TLS handshake
    c) [not really tested for plain http scenario]

  2. Does it mean currently a 101 clients cannot establish 101 watch streams (each client 1 watch stream) ?
    I assume it's not a problem for k8s as it has O(3) api-servers, each creating a lot of watch-streams over the same connection.... but might be an issue for other etcd applications.

My recommendation and top of Benjamin's:

  • let's expose this as an explicit flag that applies to both http, https & grpc (with default of 100).
  • let's have a test that shows that it works (e.g. when the flag is 5, the 6 independent watch connections cannot be established)
  • let's have some measurement on memory impact on number of connections. E.g Prometheus graphs showing a load-test of growing number of connections versus RSS memory usage, to be able to comment on the flag the memory expections for the high values of the flag.

@ahrtr
Copy link
Member

ahrtr commented Apr 26, 2022

  1. What makes the TLS special ? The code path for insecure is also not setting the max concurrent connections.

The reason is etcd creates a new goroutine for gRPC server in insecure mode. See serve.go#L129

@ptabor
Copy link
Contributor

ptabor commented Apr 28, 2022

  1. What makes the TLS special ? The code path for insecure is also not setting the max concurrent connections.

The reason is etcd creates a new goroutine for gRPC server in insecure mode. See serve.go#L129

the serve method is executed once per each listening port, so it's creates O(opened ports) go-routines, in the code-path you mentioned. I don't see (but likely missing something) how this few additional goroutines could be fundamental change for >200 connections.

@ahrtr
Copy link
Member

ahrtr commented May 6, 2022

the serve method is executed once per each listening port, so it's creates O(opened ports) go-routines, in the code-path you mentioned. I don't see (but likely missing something) how this few additional goroutines could be fundamental change for >200 connections.

The point is the gRPC server can receive client requests directly when creating dedicate goroutines just as I pointed out at serve.go#L129; in other words, it doesn't go through the http.Server any more, accordingly it will not be impacted by the MaxConcurrentStreams settings of http2.

@ahrtr
Copy link
Member

ahrtr commented May 6, 2022

@YANGJINJUE could you update this PR per above comments?

@ahrtr
Copy link
Member

ahrtr commented May 15, 2022

@YANGJINJUE Please update this PR per comments issuecomment-1109346863 and issuecomment-1109512194

@ahrtr
Copy link
Member

ahrtr commented May 27, 2022

@YANGJINJUE Please resolve the following comments:

  1. Squash the commits (Refer to git-squash ), and change the commit message into English;
  2. Add an integration or E2E test to verify the change. We can address it in a separate PR.

@Battlefield-Pony
Copy link

Sorry but I have not get a clue on the 250 Connections where's the 250 comeing from?

@ahrtr
Copy link
Member

ahrtr commented May 28, 2022

FYI.
server.go;l=170
server.go;l=58

@leslie-tsang
Copy link

@YANGJINJUE @ahrtr Hello there, After having tested this PR and proving that it solves our problem.

seems issue/14065 can be solved as well. xD

It will be nice if this PR can be merge as soon as possible. :)

@YANGJINJUE
Copy link
Author

YANGJINJUE commented May 30, 2022

how to reopen the pull request, see the new pull request fix https server MaxConcurrentStream

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

8 participants