"Connection refused" when using keep-alive with gthread #1698

ddzialak · 2018-02-06T13:32:22Z

Gunicorn version: 19.7.1 (also tried master branch)

This bug could be reproduced with attached files:

test_http_gunicorn_raw_keep-alive-requests.py (test runner)
myapp.py (gunicorn app that should be in the same directory)

Test case starts command:

gunicorn --worker-class gthread --workers 1 --threads 20 --keep-alive 9999 \
           --log-level DEBUG --bind localhost:7777 myapp:app

with logs redirected to /tmp/_test_gunicorn.out
and then create separate threads to open http session and send 1000 requests within it.

In my case gunicorn server received request /request/1096 and then reset connection (see tcp.stream eq 10 from attached gunicorn_reset_keep_alived_connection.pcapng

NOTE: this is race condition so it may happen that all requests finish with success (see myapp.py -> without time.sleep it almost never fails) but with that time.sleep it usually fails and then script should output line similar to:
[1] Failed request id=1096 with ('Connection aborted.', error(104, 'Connection reset by peer'))
it mean that thread sent request GET /request/1096 and received Connection reset by peer (so the last one request from that session that succeed is GET /request/1095).

The text was updated successfully, but these errors were encountered:

ddzialak · 2018-02-06T13:36:50Z

gunicorn_bug.zip

ddzialak · 2018-02-06T14:09:37Z

My guess: [NOT VALID - see comment below]

request incomes, new TConn object is stored in self._keep
request finished and registered partial(self.reuse_connection, conn) in poller
other thread (ex. worker MainThread) in the same time started murder_keepalived and removed TConn from self._keep
data arrived so gthread started function reuse_connection
object TConn has not been enqueue-d so it became unavailable so python released it and closed connection.

…th keep-alive Fixes benoitc#1698

ddzialak · 2018-02-06T14:37:16Z

NOTE: couldn't reproduce after introduced patch: #1699

tilgovi · 2018-02-12T02:12:58Z

I'm reading this and trying to reproduce and I wonder if there is not actually a bug here. The code for reuse_connection returns if the connection is not in the keep-alive list (_keep), terminating the connection. While Gunicorn could try to avoid this case it does not eliminate the possibility that a keep-alive client starts sending a new request and Gunicorn disconnects it. There will always be a window for error where the client begins to send a new request by Gunicorn has not received any bytes yet. Even if Gunicorn can guarantee that any socket with received bytes will never be closed by the keep-alive code, a connection could still be reset after a client starts to send.

I'm okay with making a change, but I want to check that our expectations are aligned. I don't think anything we do can guarantee that requests will not close with this error.

Finally, I have not yet reproduced the issue locally, but I think your analysis is correct.

ddzialak · 2018-02-12T11:41:41Z

@tilgovi you are right, Gunicorn may not eliminate possibility that client sends request just before keepalive expired but this is client side error and it's not a case here. Please see my testing script that enlarge keepalive timeout.
If you can't reproduce, maybe your machine have different parameters, then try to tweak with number of workers/threads or number of requests (last line of test_http_gunicorn_raw_keep-alive_requests.py from attached zip file).

This is very rare error, in our prod we want to replace CherryPy with Gunicorn and we have around ten services and we are running stress tests each night and it is happening only few times per month. Our tests are configured to handle each error (in prod request will be repeated and that error woudn't be visible). Still this is bug in gthread and should be fixed. Maybe gthread is not expected to be used on prod(?) We can't disable keep-alive due to many short HTTPS requests that will consume to much processor resource.
If you want I can try to change murder_keepalived to take lock only once and test it again, do you?

benoitc · 2018-02-12T12:23:33Z

I doubt that the keepalive setting will have any impact there until the OS in instructed to do it though. In HTTP keepalive it's expected to do the next request really quick. Gunicorn doesn't do any TCP keep alive (per design as it makes hard to detect client disconnections/reset).

I'm not quite sure the timing of events you describe is true though. A connection is added or removed from self._keep using a lock. So either it has been reenqueued and won't be affected by the murder_keepalived function or it has been removed by this function. If the connection die anyway shouldn't it handled by the finish_request function ?

ddzialak · 2018-02-12T14:15:29Z

I didn't say about TCP keep alive, unfortunately both are named in the same way, I'm always referring to HTTP connection keep-alive. My test executes series of queries without waiting for anything and it should not receive 'Connection refused' within less then 30 seconds...

I've analyzed a bit deeper and probably have to confirm your doubts as the only function that is called from other thread then MainThread is finish_request. It occurred that problem could be reproduced only with pytohn2. That's because TConn has defined __cmp__ function as __lt__ and that's a bug. Depending on python2/python3 object TConn is equal to other object if only timeout is the same.

tilgovi · 2018-02-12T16:04:57Z

I am using your script to test, and I saw the large keepalive. Again, I don't disagree with your analysis at all, I just want to set expectations for anyone reading this thread that such errors cannot be completely eliminated.

I will look at the code some more and figure out whether your patch or some other change makes sense.

ddzialak · 2018-02-12T16:05:36Z

Is this operator used anywhere?

tilgovi · 2018-02-12T16:06:44Z

Good find. I was testing on Python 3. I will look later today.

ddzialak · 2018-02-12T16:09:21Z

@tilgovi this test take ~30 seconds, for that test I can use 60 seconds if you wish. I've found the root cause, please read my previous comment. If you want to reproduce it you must use python2 (this bug is happening only on python2)

…th keep-alive Fixes benoitc#1698

benoitc · 2018-02-12T19:21:10Z

. That's because TConn has defined __cmp__ function as __lt__ and that's a bug. I will check. This has been done on purpose if I remember. Probably due to the way iterators are working. need to recheck but good find thanks!

benoitc · 2018-02-12T19:23:29Z

note: i will also run a test using a client different than requests like curl, for me ab with the -k setting is just working which is weird.

…th keep-alive (#1699) Fixes #1698

…th keep-alive (benoitc#1699) Fixes benoitc#1698

ddzialak pushed a commit to ddzialak/gunicorn that referenced this issue Feb 6, 2018

Avoid 'Connection refused' when using multithreaded gthread worker wi…

2339103

…th keep-alive Fixes benoitc#1698

ddzialak mentioned this issue Feb 6, 2018

Avoid 'Connection refused' when using gthread with keep-alive #1699

Merged

ddzialak pushed a commit to ddzialak/gunicorn that referenced this issue Feb 12, 2018

Avoid 'Connection refused' when using multithreaded gthread worker wi…

239a8bf

…th keep-alive Fixes benoitc#1698

ddzialak pushed a commit to ddzialak/gunicorn that referenced this issue Feb 12, 2018

Avoid 'Connection refused' when using multithreaded gthread worker wi…

cdf82cd

…th keep-alive Fixes benoitc#1698

benoitc closed this as completed in #1699 Mar 1, 2018

benoitc pushed a commit that referenced this issue Mar 1, 2018

Avoid 'Connection refused' when using multithreaded gthread worker wi…

9c8695b

…th keep-alive (#1699) Fixes #1698

mjjbell pushed a commit to mjjbell/gunicorn that referenced this issue Mar 16, 2018

Avoid 'Connection refused' when using multithreaded gthread worker wi…

716666f

…th keep-alive (benoitc#1699) Fixes benoitc#1698

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Connection refused" when using keep-alive with gthread #1698

"Connection refused" when using keep-alive with gthread #1698

ddzialak commented Feb 6, 2018 •

edited

Loading

ddzialak commented Feb 6, 2018

ddzialak commented Feb 6, 2018 •

edited

Loading

ddzialak commented Feb 6, 2018

tilgovi commented Feb 12, 2018

ddzialak commented Feb 12, 2018

benoitc commented Feb 12, 2018

ddzialak commented Feb 12, 2018 •

edited

Loading

tilgovi commented Feb 12, 2018

ddzialak commented Feb 12, 2018

tilgovi commented Feb 12, 2018

ddzialak commented Feb 12, 2018 •

edited

Loading

benoitc commented Feb 12, 2018

benoitc commented Feb 12, 2018 •

edited

Loading

"Connection refused" when using keep-alive with gthread #1698

"Connection refused" when using keep-alive with gthread #1698

Comments

ddzialak commented Feb 6, 2018 • edited Loading

ddzialak commented Feb 6, 2018

ddzialak commented Feb 6, 2018 • edited Loading

ddzialak commented Feb 6, 2018

tilgovi commented Feb 12, 2018

ddzialak commented Feb 12, 2018

benoitc commented Feb 12, 2018

ddzialak commented Feb 12, 2018 • edited Loading

tilgovi commented Feb 12, 2018

ddzialak commented Feb 12, 2018

tilgovi commented Feb 12, 2018

ddzialak commented Feb 12, 2018 • edited Loading

benoitc commented Feb 12, 2018

benoitc commented Feb 12, 2018 • edited Loading

ddzialak commented Feb 6, 2018 •

edited

Loading

ddzialak commented Feb 6, 2018 •

edited

Loading

ddzialak commented Feb 12, 2018 •

edited

Loading

ddzialak commented Feb 12, 2018 •

edited

Loading

benoitc commented Feb 12, 2018 •

edited

Loading