-
Notifications
You must be signed in to change notification settings - Fork 81
Refactor channel destruction logic #253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@bdraco Please take a look. I think I can no longer make the test script fail on macOS, but I don't have a Linux machine handy right now to give it a full test. |
| # https://github.com/c-ares/c-ares/blob/4f42928848e8b73d322b15ecbe3e8d753bf8734e/src/lib/ares_process.c#L1422 | ||
| time.sleep(1.0) | ||
| # Wait for all queries to finish | ||
| _lib.ares_queue_wait_empty(channel[0], -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are here nothing can be holding on to a channel, so we should be able to safely wait for the cancelled queries to end. I don't see any option other than a c-ares internal bug breaking this.
| if local_dev: | ||
| self.set_local_dev(local_dev) | ||
|
|
||
| # Ensure the shutdown thread is started |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than swallow the error in case we try to cleanup on interpreter shutdown, I chose to start the thread as soon as the channel is ready, since we are going to need it anyway.
Its late here. I'll try to find some time to test tomorrow. For linux, this should work on macOS: |
|
Yeah I thought of docker but wanted to avoid the extra layer of uncertainty :-P |
|
Seems there was a little hiccup on windows-11-arm 3.13.3 otherwise I think this looks pretty good I think |
|
Tested on Mac, no issues. Tested on Linux, I get a double free |
|
Its also reproducible in docker on MacOS |
|
Thanks for giving it a try, I'll test on docker. |
|
Surprisingly I'm not getting a double-free, but some form of heap corruption: and It always seems to happen when initializing a new channel, not when destroying a dying one. My test script: I usually need > 1k iterations to trigger it. |
|
Update! I pushed a new commit and with it I can no longer make it crash! I suspect what might happen is that the channel gets closed by refcounting / close while in a query callback, and something inside c-ares doesn't quite like that. Delaying cancellation until destruction seems like a reasonabe compromise: we no longer need to sleep, so timing is not a problem, and we are guaranteed that no Python code is touching the channel anymore. |
|
Tested. Cannot get it to crash now. Small suggestion to avoid the lock churn since its pretty clear that users create/destroy the objects frequently. diff --git a/src/pycares/__init__.py b/src/pycares/__init__.py
index 3255a85..f0476a1 100644
--- a/src/pycares/__init__.py
+++ b/src/pycares/__init__.py
@@ -340,7 +340,6 @@ class _ChannelShutdownManager:
self._queue: SimpleQueue = SimpleQueue()
self._thread: Optional[threading.Thread] = None
self._start_lock = threading.Lock()
- self._thread_started = False
def _run_safe_shutdown_loop(self) -> None:
"""Process channel destruction requests from the queue."""
@@ -360,11 +359,13 @@ class _ChannelShutdownManager:
def start(self) -> None:
"""Start the background thread if not already started."""
+ if self._thread is not None:
+ return # Already started
with self._start_lock:
- if not self._thread_started:
+ if self._thread is not None:
+ return # Started by another thread while waiting for the lock
self._thread = threading.Thread(target=self._run_safe_shutdown_loop, daemon=True)
self._thread.start()
- self._thread_started = True
def destroy_channel(self, channel) -> None:
""" |
|
That makes it thread unsafe though. The lock is not contended so it's essentially free performance wise. |
|
The actual thread creation and assignment happen inside the lock, preventing multiple threads from creating the thread. Even if Thread A doesn't see the updated The worst-case scenario is that multiple threads unnecessarily acquire the lock, but only one will create the thread. Python's threading module ensures proper memory barriers when acquiring/releasing locks Consider this scenario with two threads: The second check under the lock ensures Thread A sees the updated value and returns without creating a duplicate thread. |
|
Even uncontended locks aren't free - they require syscalls (futex on Linux, pthread_mutex on macOS) which involve kernel transitions. |
|
Ah I hadn't seen you check twice. Sounds good, I'll apply the fix and create a new release . |
- Use ares_queue_wait_empty to wait for queries to be complete before destruction - Make sure NO queries are cancelled as side effects on __del__ - Start the destruction thread early, as soon as a channel is created Cancelling pending queries while in a query callback seemingly causes heap corruption or double-free bugs, so delay the operation until no Python code if using the channel anymore, that is, the destructor thread. Fixes: aio-libs/aiodns#175 Fixes: #248
|
Thanks! |
Fixes: aio-libs/aiodns#175
Fixes: #248