Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix possible endless wait in stop() after AUTH_FAILED error #688

Merged
merged 1 commit into from
Feb 8, 2023

Conversation

azat
Copy link
Contributor

@azat azat commented Dec 28, 2022

In case of AUTH_FAILED in the zk-loop thread it will call client._session_callback which will reset the queue.

However another thread can add to this queue CloseInstance event, and if the _session_callback() will be called after CloseInstance was added to the queue, then stop() will never return (and zk-loop will endlessly spin).

Here is how it looks like with addititional logging:

39: [ Thread-3 (zk_loop) ] INFO: client.py:568: _session_callback: Zookeeper session closed, state: AUTH_FAILED
39: [ MainThread ] Level 5: client.py:721: stop: Sending CloseInstance
39: [ Thread-3 (zk_loop) ] Level 5: client.py:403: _reset: Reseting the client
39: [ Thread-3 (zk_loop) ] Level 5: connection.py:625: _connect_attempt: Connecting
39: [ Thread-3 (zk_loop) ] Level 5: connection.py:625: _connect_attempt: Connecting

You can find details in this gist 1.

@azat azat mentioned this pull request Dec 28, 2022
azat added a commit to azat/ClickHouse that referenced this pull request Dec 28, 2022
Refs: python-zk/kazoo#688
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
@a-ungurianu
Copy link
Member

Hey!

This makes sense in theory. Could you rebase in master so the tests rerun. Would appreciate a review from someone else, as I'm not super familiar with the lifetime cycle of the client

@azat
Copy link
Contributor Author

azat commented Feb 1, 2023

Rebased, but now it requires workflow approval

@codecov
Copy link

codecov bot commented Feb 2, 2023

Codecov Report

Base: 94.65% // Head: 94.70% // Increases project coverage by +0.04% 🎉

Coverage data is based on head (8b660fb) compared to base (92b071d).
Patch coverage: 100.00% of modified lines in pull request are covered.

❗ Current head 8b660fb differs from pull request most recent head cd4a2b2. Consider uploading reports for the commit cd4a2b2 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #688      +/-   ##
==========================================
+ Coverage   94.65%   94.70%   +0.04%     
==========================================
  Files          57       57              
  Lines        8338     8340       +2     
==========================================
+ Hits         7892     7898       +6     
+ Misses        446      442       -4     
Impacted Files Coverage Δ
kazoo/protocol/connection.py 96.07% <100.00%> (-0.62%) ⬇️
kazoo/tests/test_client.py 98.80% <100.00%> (+<0.01%) ⬆️
kazoo/tests/test_gevent_handler.py 77.46% <0.00%> (-2.12%) ⬇️
kazoo/handlers/eventlet.py 99.01% <0.00%> (-0.99%) ⬇️
kazoo/testing/harness.py 97.52% <0.00%> (-0.83%) ⬇️
kazoo/handlers/utils.py 94.06% <0.00%> (-0.46%) ⬇️
kazoo/tests/test_utils.py 93.33% <0.00%> (ø)
kazoo/tests/test__connection.py 98.14% <0.00%> (ø)
kazoo/tests/test_partitioner.py 98.69% <0.00%> (ø)
... and 6 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@azat
Copy link
Contributor Author

azat commented Feb 2, 2023

Looks like all failures are unrelated, since the last error in the log is:

==> Uploading
    .url https://codecov.io/
    .query commit=d9a20a766ba942b5722ee36a1210e62fbb0ae9f8&branch=HEAD&package=py2.1.12
    Gzipping contents..
    Compressed contents to 18443 bytes
    Pinging Codecov...
Error: Could not determine repo and owner
Tip: See all example repositories: https://github.com/codecov?query=example
.pkg: _exit> python /opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta
  py310-gevent-eventlet-sasl: FAIL code 1 (171.97=setup[19.90]+cmd[152.06] seconds)
  codecov: OK (14.40=setup[13.73]+cmd[0.66] seconds)
  evaluation failed :( (187.83 seconds)
Error: Process completed with exit code 255.

@StephenSorriaux
Copy link
Member

Yes you are right, we have some flaky tests...

@azat
Copy link
Contributor Author

azat commented Feb 7, 2023

So, any news/feedback on this one?

Copy link
Member

@jeffwidman jeffwidman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like Alex, this seems reasonable to me, but I'm not super familiar with this section of code.

I'm okay merging, but @ceache if you do get a chance I'd appreciate a sanity review on this since IIRC you run kazoo in prod scenarios that include auth... so may be more familiar with this.

@jeffwidman
Copy link
Member

@azat can you check the "allow edits from maintainers" box? I can't rebase to merge w/o that... or you can simply rebase it yourself, up to you.

In case of AUTH_FAILED in the zk-loop thread it will call
client._session_callback which will reset the queue.

However another thread can add to this queue CloseInstance event, and
if the _session_callback() will be called after CloseInstance was added
to the queue, then stop() will never return (and zk-loop will endlessly
spin).

Here is how it looks like with addititional logging:

    39: [ Thread-3 (zk_loop) ] INFO: client.py:568: _session_callback: Zookeeper session closed, state: AUTH_FAILED
    39: [ MainThread ] Level 5: client.py:721: stop: Sending CloseInstance
    39: [ Thread-3 (zk_loop) ] Level 5: client.py:403: _reset: Reseting the client
    39: [ Thread-3 (zk_loop) ] Level 5: connection.py:625: _connect_attempt: Connecting
    39: [ Thread-3 (zk_loop) ] Level 5: connection.py:625: _connect_attempt: Connecting

You can find details in this gist [1].

  [1]: https://gist.github.com/azat/bc7aaea1c32a4f1ea75ad646d26280e9
@jeffwidman jeffwidman merged commit 5225b3e into python-zk:master Feb 8, 2023
@azat
Copy link
Contributor Author

azat commented Feb 8, 2023

@jeffwidman I don't have Allow edits from mainters box due to the github issues for pull requests from organizations, and I completely forgot about this, next time will keep it in mind (maybe will do this from my main account).

Thanks!

@jeffwidman
Copy link
Member

Gotcha, I didn't even realize that was a functionality gap as I always contribute from my personal GitHub... makes sense.

jeblair added a commit to jeblair/kazoo that referenced this pull request Feb 22, 2024
Run testauth.py to test the behavior described in
python-zk#688
jeblair added a commit to jeblair/kazoo that referenced this pull request Feb 22, 2024
ceache pushed a commit to jeblair/kazoo that referenced this pull request Feb 27, 2024
jeblair added a commit to jeblair/kazoo that referenced this pull request Mar 1, 2024
…LED error (python-zk#688)"

This reverts commit 5225b3e.

The commit being reverted here caused kazoo not to empty the send
queue before disconnecting. This means that if a client submitted
asynchronous requests and then called client.stop(), the connection
would be closed immediately, usually after only one (but possibly
more) of the submitted requests were sent. Prior to this, Kazoo
would empty the queue of submitted requests all the way up to and
including the Close request when client.stop() was called.

Another area where this caused problems is in a busy multi-threaded
system. One thread might decide to gracefully close the connection,
but if there is any traffic generated by another thread, then the
connection would end up terminating without ever sending the Close
request.

Failure to gracefully shutdown a ZooKeeper connection can mean that
other system components need to wait for ephemeral node timeouts to
detect that a component has shutdown.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants