Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Connection refused" when using keep-alive with gthread #1698

Closed
ddzialak opened this issue Feb 6, 2018 · 13 comments
Closed

"Connection refused" when using keep-alive with gthread #1698

ddzialak opened this issue Feb 6, 2018 · 13 comments

Comments

@ddzialak
Copy link
Contributor

ddzialak commented Feb 6, 2018

Gunicorn version: 19.7.1 (also tried master branch)

This bug could be reproduced with attached files:

  • test_http_gunicorn_raw_keep-alive-requests.py (test runner)
  • myapp.py (gunicorn app that should be in the same directory)

Test case starts command:

gunicorn --worker-class gthread --workers 1 --threads 20 --keep-alive 9999 \
           --log-level DEBUG --bind localhost:7777 myapp:app

with logs redirected to /tmp/_test_gunicorn.out
and then create separate threads to open http session and send 1000 requests within it.

In my case gunicorn server received request /request/1096 and then reset connection (see tcp.stream eq 10 from attached gunicorn_reset_keep_alived_connection.pcapng

NOTE: this is race condition so it may happen that all requests finish with success (see myapp.py -> without time.sleep it almost never fails) but with that time.sleep it usually fails and then script should output line similar to:
[1] Failed request id=1096 with ('Connection aborted.', error(104, 'Connection reset by peer'))
it mean that thread sent request GET /request/1096 and received Connection reset by peer (so the last one request from that session that succeed is GET /request/1095).

@ddzialak
Copy link
Contributor Author

ddzialak commented Feb 6, 2018

gunicorn_bug.zip

@ddzialak
Copy link
Contributor Author

ddzialak commented Feb 6, 2018

My guess: [NOT VALID - see comment below]

  • request incomes, new TConn object is stored in self._keep
  • request finished and registered partial(self.reuse_connection, conn) in poller
  • other thread (ex. worker MainThread) in the same time started murder_keepalived and removed TConn from self._keep
  • data arrived so gthread started function reuse_connection
  • object TConn has not been enqueue-d so it became unavailable so python released it and closed connection.

@ddzialak
Copy link
Contributor Author

ddzialak commented Feb 6, 2018

NOTE: couldn't reproduce after introduced patch: #1699

@tilgovi
Copy link
Collaborator

tilgovi commented Feb 12, 2018

I'm reading this and trying to reproduce and I wonder if there is not actually a bug here. The code for reuse_connection returns if the connection is not in the keep-alive list (_keep), terminating the connection. While Gunicorn could try to avoid this case it does not eliminate the possibility that a keep-alive client starts sending a new request and Gunicorn disconnects it. There will always be a window for error where the client begins to send a new request by Gunicorn has not received any bytes yet. Even if Gunicorn can guarantee that any socket with received bytes will never be closed by the keep-alive code, a connection could still be reset after a client starts to send.

I'm okay with making a change, but I want to check that our expectations are aligned. I don't think anything we do can guarantee that requests will not close with this error.

Finally, I have not yet reproduced the issue locally, but I think your analysis is correct.

@ddzialak
Copy link
Contributor Author

@tilgovi you are right, Gunicorn may not eliminate possibility that client sends request just before keepalive expired but this is client side error and it's not a case here. Please see my testing script that enlarge keepalive timeout.
If you can't reproduce, maybe your machine have different parameters, then try to tweak with number of workers/threads or number of requests (last line of test_http_gunicorn_raw_keep-alive_requests.py from attached zip file).

This is very rare error, in our prod we want to replace CherryPy with Gunicorn and we have around ten services and we are running stress tests each night and it is happening only few times per month. Our tests are configured to handle each error (in prod request will be repeated and that error woudn't be visible). Still this is bug in gthread and should be fixed. Maybe gthread is not expected to be used on prod(?) We can't disable keep-alive due to many short HTTPS requests that will consume to much processor resource.
If you want I can try to change murder_keepalived to take lock only once and test it again, do you?

@benoitc
Copy link
Owner

benoitc commented Feb 12, 2018

I doubt that the keepalive setting will have any impact there until the OS in instructed to do it though. In HTTP keepalive it's expected to do the next request really quick. Gunicorn doesn't do any TCP keep alive (per design as it makes hard to detect client disconnections/reset).

I'm not quite sure the timing of events you describe is true though. A connection is added or removed from self._keep using a lock. So either it has been reenqueued and won't be affected by the murder_keepalived function or it has been removed by this function. If the connection die anyway shouldn't it handled by the finish_request function ?

@ddzialak
Copy link
Contributor Author

ddzialak commented Feb 12, 2018

I didn't say about TCP keep alive, unfortunately both are named in the same way, I'm always referring to HTTP connection keep-alive. My test executes series of queries without waiting for anything and it should not receive 'Connection refused' within less then 30 seconds...

I've analyzed a bit deeper and probably have to confirm your doubts as the only function that is called from other thread then MainThread is finish_request. It occurred that problem could be reproduced only with pytohn2. That's because TConn has defined __cmp__ function as __lt__ and that's a bug. Depending on python2/python3 object TConn is equal to other object if only timeout is the same.

@tilgovi
Copy link
Collaborator

tilgovi commented Feb 12, 2018

I am using your script to test, and I saw the large keepalive. Again, I don't disagree with your analysis at all, I just want to set expectations for anyone reading this thread that such errors cannot be completely eliminated.

I will look at the code some more and figure out whether your patch or some other change makes sense.

@ddzialak
Copy link
Contributor Author

Is this operator used anywhere?

@tilgovi
Copy link
Collaborator

tilgovi commented Feb 12, 2018

Good find. I was testing on Python 3. I will look later today.

@ddzialak
Copy link
Contributor Author

ddzialak commented Feb 12, 2018

@tilgovi this test take ~30 seconds, for that test I can use 60 seconds if you wish. I've found the root cause, please read my previous comment. If you want to reproduce it you must use python2 (this bug is happening only on python2)

ddzialak pushed a commit to ddzialak/gunicorn that referenced this issue Feb 12, 2018
ddzialak pushed a commit to ddzialak/gunicorn that referenced this issue Feb 12, 2018
@benoitc
Copy link
Owner

benoitc commented Feb 12, 2018

. That's because TConn has defined __cmp__ function as __lt__ and that's a bug. I will check. This has been done on purpose if I remember. Probably due to the way iterators are working. need to recheck but good find thanks!

@benoitc
Copy link
Owner

benoitc commented Feb 12, 2018

note: i will also run a test using a client different than requests like curl, for me ab with the -k setting is just working which is weird.

benoitc pushed a commit that referenced this issue Mar 1, 2018
mjjbell pushed a commit to mjjbell/gunicorn that referenced this issue Mar 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants