-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix possible endless wait in stop() after AUTH_FAILED error #688
Conversation
Refs: python-zk/kazoo#688 Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Hey! This makes sense in theory. Could you rebase in master so the tests rerun. Would appreciate a review from someone else, as I'm not super familiar with the lifetime cycle of the client |
0b09da2
to
8b660fb
Compare
Rebased, but now it requires workflow approval |
Codecov ReportBase: 94.65% // Head: 94.70% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #688 +/- ##
==========================================
+ Coverage 94.65% 94.70% +0.04%
==========================================
Files 57 57
Lines 8338 8340 +2
==========================================
+ Hits 7892 7898 +6
+ Misses 446 442 -4
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Looks like all failures are unrelated, since the last error in the log is:
|
Yes you are right, we have some flaky tests... |
So, any news/feedback on this one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like Alex, this seems reasonable to me, but I'm not super familiar with this section of code.
I'm okay merging, but @ceache if you do get a chance I'd appreciate a sanity review on this since IIRC you run kazoo in prod scenarios that include auth... so may be more familiar with this.
@azat can you check the "allow edits from maintainers" box? I can't rebase to merge w/o that... or you can simply rebase it yourself, up to you. |
In case of AUTH_FAILED in the zk-loop thread it will call client._session_callback which will reset the queue. However another thread can add to this queue CloseInstance event, and if the _session_callback() will be called after CloseInstance was added to the queue, then stop() will never return (and zk-loop will endlessly spin). Here is how it looks like with addititional logging: 39: [ Thread-3 (zk_loop) ] INFO: client.py:568: _session_callback: Zookeeper session closed, state: AUTH_FAILED 39: [ MainThread ] Level 5: client.py:721: stop: Sending CloseInstance 39: [ Thread-3 (zk_loop) ] Level 5: client.py:403: _reset: Reseting the client 39: [ Thread-3 (zk_loop) ] Level 5: connection.py:625: _connect_attempt: Connecting 39: [ Thread-3 (zk_loop) ] Level 5: connection.py:625: _connect_attempt: Connecting You can find details in this gist [1]. [1]: https://gist.github.com/azat/bc7aaea1c32a4f1ea75ad646d26280e9
8b660fb
to
cd4a2b2
Compare
@jeffwidman I don't have Thanks! |
Gotcha, I didn't even realize that was a functionality gap as I always contribute from my personal GitHub... makes sense. |
Run testauth.py to test the behavior described in python-zk#688
…ython-zk#688)" This reverts commit 5225b3e.
…ython-zk#688)" This reverts commit 5225b3e.
…LED error (python-zk#688)" This reverts commit 5225b3e. The commit being reverted here caused kazoo not to empty the send queue before disconnecting. This means that if a client submitted asynchronous requests and then called client.stop(), the connection would be closed immediately, usually after only one (but possibly more) of the submitted requests were sent. Prior to this, Kazoo would empty the queue of submitted requests all the way up to and including the Close request when client.stop() was called. Another area where this caused problems is in a busy multi-threaded system. One thread might decide to gracefully close the connection, but if there is any traffic generated by another thread, then the connection would end up terminating without ever sending the Close request. Failure to gracefully shutdown a ZooKeeper connection can mean that other system components need to wait for ephemeral node timeouts to detect that a component has shutdown.
In case of AUTH_FAILED in the zk-loop thread it will call client._session_callback which will reset the queue.
However another thread can add to this queue CloseInstance event, and if the _session_callback() will be called after CloseInstance was added to the queue, then stop() will never return (and zk-loop will endlessly spin).
Here is how it looks like with addititional logging:
You can find details in this gist 1.