Refactor channel destruction logic #253

saghul · 2025-08-04T08:40:17Z

Use ares_queue_wait_empty to wait for queries to be complete before destruction
Make sure all queries are cancelled also on del
Start the destruction thread early, as soon as a channel is created

saghul · 2025-08-04T08:42:33Z

@bdraco Please take a look. I think ares_queue_wait_empty is what we wanted that I wasn't aware of before.

I can no longer make the test script fail on macOS, but I don't have a Linux machine handy right now to give it a full test.

saghul · 2025-08-04T08:50:21Z

src/pycares/__init__.py

-            # https://github.com/c-ares/c-ares/blob/4f42928848e8b73d322b15ecbe3e8d753bf8734e/src/lib/ares_process.c#L1422
-            time.sleep(1.0)
+            # Wait for all queries to finish
+            _lib.ares_queue_wait_empty(channel[0], -1)


If we are here nothing can be holding on to a channel, so we should be able to safely wait for the cancelled queries to end. I don't see any option other than a c-ares internal bug breaking this.

saghul · 2025-08-04T08:51:06Z

src/pycares/__init__.py

        if local_dev:
            self.set_local_dev(local_dev)

+        # Ensure the shutdown thread is started


Rather than swallow the error in case we try to cleanup on interpreter shutdown, I chose to start the thread as soon as the channel is ready, since we are going to need it anyway.

bdraco · 2025-08-04T09:17:52Z

I can no longer make the test script fail on macOS, but I don't have a Linux machine handy right now to give it a full test.

Its late here. I'll try to find some time to test tomorrow.

For linux, this should work on macOS:

docker run --rm -it --entrypoint /bin/bash -v `pwd`:/tmp/pycares python:3.13.3

saghul · 2025-08-04T09:43:59Z

Yeah I thought of docker but wanted to avoid the extra layer of uncertainty :-P

Vizonex · 2025-08-04T15:21:39Z

Seems there was a little hiccup on windows-11-arm 3.13.3 otherwise I think this looks pretty good I think ares_queue_wait_empty is a good idea for channel cleanup. I wonder if we could also implement something like this for users to also control if an event-thread is in use?

bdraco · 2025-08-04T18:49:18Z

Tested on Mac, no issues.

Tested on Linux, I get a double free

# gdb python3
GNU gdb (Debian 13.1-3) 13.1
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...
(gdb) set args aiodns_bug.py
(gdb) run
Starting program: /usr/local/bin/python3 aiodns_bug.py
....
__pthread_kill_implementation (threadid=281472982380928, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) back
#0  __pthread_kill_implementation (threadid=281472982380928, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x0000ffff8aed0ab4 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x0000ffff8ae8a72c in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x0000ffff8ae7747c in __GI_abort () at ./stdlib/abort.c:79
#4  0x0000ffff8aec4aac in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0xffff8afa6d28 "%s\n") at ../sysdeps/posix/libc_fatal.c:156
#5  0x0000ffff8aedaebc in malloc_printerr (str=str@entry=0xffff8afa1fb8 "free(): double free detected in tcache 2") at ./malloc/malloc.c:5660
#6  0x0000ffff8aedd0a8 in _int_free (av=0xffff8aff0af0 <main_arena>, p=p@entry=0xaaaac4225980, have_lock=have_lock@entry=0) at ./malloc/malloc.c:4469
#7  0x0000ffff8aedf77c in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3385
#8  0x0000ffff89e81f44 in ares_dnsrec_convert_cb (arg=0xaaaac4225990, status=ARES_ENOTFOUND, timeouts=0, dnsrec=<optimized out>) at deps/c-ares/src/lib/ares_search.c:425
#9  0x0000ffff89e81c40 in ares_query_dnsrec_cb (arg=0xaaaac42259d0, status=<optimized out>, timeouts=0, dnsrec=0xffff68021050) at deps/c-ares/src/lib/ares_query.c:56
#10 0x0000ffff89e80f60 in end_query (dnsrec=0xffff68021050, status=ARES_SUCCESS, query=0xaaaac4041370, server=0xaaaac4224b00, channel=0xaaaac4222c10)
    at deps/c-ares/src/lib/ares_process.c:1421
#11 process_answer (requeue=<optimized out>, now=<optimized out>, conn=<optimized out>, alen=<optimized out>, abuf=<optimized out>, channel=<optimized out>)
    at deps/c-ares/src/lib/ares_process.c:890
#12 read_answers (now=0xffff8920e7a0, conn=<optimized out>) at deps/c-ares/src/lib/ares_process.c:582
#13 process_read (now=0xffff8920e7a0, read_fd=<optimized out>, channel=0xaaaac4222c10) at deps/c-ares/src/lib/ares_process.c:644
#14 ares_process_fds_nolock (channel=channel@entry=0xaaaac4222c10, events=events@entry=0xffff8920e818, nevents=nevents@entry=1, flags=flags@entry=1)
    at deps/c-ares/src/lib/ares_process.c:226
#15 0x0000ffff89e810e0 in ares_process_fds_nolock (flags=1, nevents=1, events=0xffff8920e818, channel=0xaaaac4222c10) at deps/c-ares/src/lib/ares_process.c:200
#16 ares_process_fds (channel=0xaaaac4222c10, events=events@entry=0xffff8920e818, nevents=nevents@entry=1, flags=flags@entry=1) at deps/c-ares/src/lib/ares_process.c:258
#17 0x0000ffff89e8b5b0 in ares_event_thread_process_fd (e=<optimized out>, fd=<optimized out>, data=<optimized out>, flags=<optimized out>)
    at deps/c-ares/src/lib/event/ares_event_thread.c:204
#18 0x0000ffff89e8adfc in ares_evsys_epoll_wait (e=0xaaaac41fe2c0, timeout_ms=<optimized out>) at deps/c-ares/src/lib/event/ares_event_epoll.c:179
#19 0x0000ffff89e8b8c8 in ares_event_thread (arg=0xaaaac41fe2c0) at deps/c-ares/src/lib/event/ares_event_thread.c:347
#20 0x0000ffff8aeceea0 in start_thread (arg=0xffffdb7802a7) at ./nptl/pthread_create.c:442
#21 0x0000ffff8af37b1c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79

bdraco · 2025-08-04T18:52:58Z

Its also reproducible in docker on MacOS

saghul · 2025-08-04T20:25:31Z

Thanks for giving it a try, I'll test on docker.

saghul · 2025-08-05T07:59:44Z

Surprisingly I'm not getting a double-free, but some form of heap corruption:

Thread 1 "python" received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=277925466865696, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (threadid=277925466865696, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x0000fcc590a40ab4 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x0000fcc5909fa72c in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x0000fcc5909e747c in __GI_abort () at ./stdlib/abort.c:79
#4  0x0000fcc590a34aac in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0xfcc590b16d28 "%s\n") at ../sysdeps/posix/libc_fatal.c:156
#5  0x0000fcc590a4aebc in malloc_printerr (str=str@entry=0xfcc590b11dc0 "corrupted double-linked list") at ./malloc/malloc.c:5660
#6  0x0000fcc590a4b6fc in unlink_chunk (p=p@entry=0xc023ffae6de0, av=0xfcc590b60af0 <main_arena>) at ./malloc/malloc.c:1629
#7  0x0000fcc590a4e240 in _int_malloc (av=av@entry=0xfcc590b60af0 <main_arena>, bytes=bytes@entry=48) at ./malloc/malloc.c:4303
#8  0x0000fcc590a4efd0 in __GI___libc_malloc (bytes=48) at ./malloc/malloc.c:3323
#9  0x0000fcc58f34e8a4 in ares_malloc_zero (size=size@entry=48) at deps/c-ares/src/lib/ares_library_init.c:86
#10 0x0000fcc58f369060 in ares_buf_create () at deps/c-ares/src/lib/str/ares_buf.c:48
#11 0x0000fcc58f369098 in ares_buf_create_const (
    data=data@entry=0xc023ffb97874 "files dns\nnetworks:       files\n\nprotocols:      db files\nservices:       db files\nethers:         db files\nrpc:", ' ' <repeats 12 times>, "db files\n\nnetgroup:       nis\n", data_len=data_len@entry=5) at deps/c-ares/src/lib/str/ares_buf.c:65
#12 0x0000fcc58f36a38c in ares_buf_split (arr=0xffffc81b4180, max_sections=0, flags=<optimized out>, delims_len=2, delims=0xfcc58f370970 " \t", buf=0xc023ffafa0b0)
    at deps/c-ares/src/lib/str/ares_buf.c:950
#13 ares_buf_split (buf=0xc023ffafa0b0, delims=0xfcc58f370970 " \t", delims_len=2, flags=<optimized out>, max_sections=0, arr=0xffffc81b4180)
    at deps/c-ares/src/lib/str/ares_buf.c:875
#14 0x0000fcc58f36a648 in ares_buf_split_str_array (buf=buf@entry=0xc023ffafa0b0, delims=delims@entry=0xfcc58f370970 " \t", delims_len=<optimized out>,
    flags=flags@entry=ARES_BUF_SPLIT_TRIM, max_sections=max_sections@entry=0, arr=arr@entry=0xffffc81b41c8) at deps/c-ares/src/lib/str/ares_buf.c:1003
#15 0x0000fcc58f36a79c in ares_buf_split_str (buf=buf@entry=0xc023ffafa0b0, delims=delims@entry=0xfcc58f370970 " \t", delims_len=<optimized out>,
    flags=flags@entry=ARES_BUF_SPLIT_TRIM, max_sections=max_sections@entry=0, strs=strs@entry=0xffffc81b4220, nstrs=nstrs@entry=0xffffc81b4228)
    at deps/c-ares/src/lib/str/ares_buf.c:1056
#16 0x0000fcc58f3550cc in config_lookup (sysconfig=sysconfig@entry=0xffffc81b4368, buf=0xc023ffafa0b0, separators=separators@entry=0xfcc58f370970 " \t")
    at deps/c-ares/src/lib/ares_sysconfig_files.c:333
#17 0x0000fcc58f355394 in parse_nsswitch_line (channel=<optimized out>, sysconfig=0xffffc81b4368, line=<optimized out>)
    at deps/c-ares/src/lib/ares_sysconfig_files.c:666
#18 0x0000fcc58f356010 in ares_sysconfig_process_buf (channel=channel@entry=0xc023ffb9b410, sysconfig=sysconfig@entry=0xffffc81b4368, buf=buf@entry=0xc023ffa93930,
    cb=cb@entry=0xfcc58f355250 <parse_nsswitch_line>) at deps/c-ares/src/lib/ares_sysconfig_files.c:751
#19 0x0000fcc58f356184 in process_config_lines (cb=<optimized out>, sysconfig=0xffffc81b4368, filename=0xfcc58f370a80 "/etc/nsswitch.conf", channel=0xc023ffb9b410)
    at deps/c-ares/src/lib/ares_sysconfig_files.c:788
#20 ares_init_sysconfig_files (channel=channel@entry=0xc023ffb9b410, sysconfig=sysconfig@entry=0xffffc81b4368, process_resolvconf=process_resolvconf@entry=ARES_TRUE)
    at deps/c-ares/src/lib/ares_sysconfig_files.c:815
#21 0x0000fcc58f354ee0 in ares_init_by_sysconfig (channel=channel@entry=0xc023ffb9b410) at deps/c-ares/src/lib/ares_sysconfig.c:610
#22 0x0000fcc58f34e1fc in ares_init_options (optmask=4194304, options=0xc023ffb4dd70, channelptr=0xc023ffaa2d80) at deps/c-ares/src/lib/ares_init.c:331
#23 ares_init_options (channelptr=0xc023ffaa2d80, options=0xc023ffb4dd70, optmask=optmask@entry=4194304) at deps/c-ares/src/lib/ares_init.c:238

and

(gdb) thread 1
[Switching to thread 1 (Thread 0xf2b7d8b44020 (LWP 3858))]
#0  __pthread_kill_implementation (threadid=266871428628512, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (threadid=266871428628512, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x0000f2b7d82d0ab4 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x0000f2b7d828a72c in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x0000f2b7d827747c in __GI_abort () at ./stdlib/abort.c:79
#4  0x0000f2b7d82c4aac in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0xf2b7d83a6d28 "%s\n") at ../sysdeps/posix/libc_fatal.c:156
#5  0x0000f2b7d82daebc in malloc_printerr (str=str@entry=0xf2b7d83a2668 "malloc(): unaligned tcache chunk detected") at ./malloc/malloc.c:5660
#6  0x0000f2b7d82df314 in tcache_get (tc_idx=<optimized out>) at ./malloc/malloc.c:3189
#7  __GI___libc_malloc (bytes=16) at ./malloc/malloc.c:3307
#8  0x0000f2b7d72f928c in ares_htable_strvp_create (val_free=val_free@entry=0x0) at deps/c-ares/src/lib/dsa/ares_htable_strvp.c:83
#9  0x0000f2b7d72f17c4 in ares_qcache_create (rand_state=0xc6aa07ff8130, max_ttl=3600, cache_out=cache_out@entry=0xc6aa0800a2e0)
    at deps/c-ares/src/lib/ares_qcache.c:219
#10 0x0000f2b7d72ee1f0 in ares_init_options (optmask=4194304, options=0xc6aa07ff2050, channelptr=0xc6aa07f0b430) at deps/c-ares/src/lib/ares_init.c:324
#11 ares_init_options (channelptr=0xc6aa07f0b430, options=0xc6aa07ff2050, optmask=optmask@entry=4194304) at deps/c-ares/src/lib/ares_init.c:238
#12 0x0000f2b7d72e5690 in _cffi_f_ares_init_options (self=<optimized out>, args=<optimized out>) at build/temp.linux-aarch64-cpython-313/_cares.c:2619

It always seems to happen when initializing a new channel, not when destroying a dying one.

My test script:

import sys
import asyncio
import ipaddress

import aiodns

# TOR exit relay without a PTR record. This POC uses a single IP in a loop but the crash
# can be caused by issuing multiple PTR requests which do not have an record.
ip = '38.135.24.32'

# IPs with valid PTR records do not reproduce the error in my testing
# ip = '8.8.8.8'

# The nameserver of choice does not seem to impact the output
nameservers = ['8.8.8.8']
#nameservers = []

async def lookup(valu):
    res = aiodns.DNSResolver(nameservers=nameservers)
    answ = None
    try:
        answ = await res.query(valu, 'PTR')
    except aiodns.error.DNSError as e:
        if e.args[0] != 4:  # Domain name not found
            raise
    return answ

async def main():
    queries = 0
    valu = ipaddress.ip_address(ip).reverse_pointer
    while True:
        queries += 1
        print(f"Sending query {queries}")
        task = asyncio.create_task(lookup(valu))
        await task

if __name__ == '__main__':
    sys.exit(asyncio.run(main()))

I usually need > 1k iterations to trigger it.

saghul · 2025-08-05T08:36:50Z

Update! I pushed a new commit and with it I can no longer make it crash! I suspect what might happen is that the channel gets closed by refcounting / close while in a query callback, and something inside c-ares doesn't quite like that.

Delaying cancellation until destruction seems like a reasonabe compromise: we no longer need to sleep, so timing is not a problem, and we are guaranteed that no Python code is touching the channel anymore.

bdraco · 2025-08-05T17:43:41Z

Tested. Cannot get it to crash now.

Small suggestion to avoid the lock churn since its pretty clear that users create/destroy the objects frequently.

diff --git a/src/pycares/__init__.py b/src/pycares/__init__.py
index 3255a85..f0476a1 100644
--- a/src/pycares/__init__.py
+++ b/src/pycares/__init__.py
@@ -340,7 +340,6 @@ class _ChannelShutdownManager:
         self._queue: SimpleQueue = SimpleQueue()
         self._thread: Optional[threading.Thread] = None
         self._start_lock = threading.Lock()
-        self._thread_started = False
 
     def _run_safe_shutdown_loop(self) -> None:
         """Process channel destruction requests from the queue."""
@@ -360,11 +359,13 @@ class _ChannelShutdownManager:
 
     def start(self) -> None:
         """Start the background thread if not already started."""
+        if self._thread is not None:
+            return   # Already started
         with self._start_lock:
-            if not self._thread_started:
+            if self._thread is not None:
+                return  # Started by another thread while waiting for the lock
             self._thread = threading.Thread(target=self._run_safe_shutdown_loop, daemon=True)
             self._thread.start()
-                self._thread_started = True
 
     def destroy_channel(self, channel) -> None:
         """

saghul · 2025-08-05T17:47:58Z

That makes it thread unsafe though.

The lock is not contended so it's essentially free performance wise.

bdraco · 2025-08-05T17:55:04Z

The actual thread creation and assignment happen inside the lock, preventing multiple threads from creating the thread.

Even if Thread A doesn't see the updated self._thread value in the first check due to memory visibility issues, it will definitely see it in the second check after acquiring the lock.

The worst-case scenario is that multiple threads unnecessarily acquire the lock, but only one will create the thread.

Python's threading module ensures proper memory barriers when acquiring/releasing locks

Consider this scenario with two threads:

Thread A                        Thread B
--------                        --------
Check self._thread (None)       
                               Check self._thread (None)
                               Acquire lock
                               Check self._thread (None)
                               Create & start thread
                               self._thread = <thread>
                               Release lock
Acquire lock
Check self._thread (NOT None)
Return

The second check under the lock ensures Thread A sees the updated value and returns without creating a duplicate thread.

bdraco · 2025-08-05T17:58:39Z

Even uncontended locks aren't free - they require syscalls (futex on Linux, pthread_mutex on macOS) which involve kernel transitions.

saghul · 2025-08-05T19:20:19Z

Ah I hadn't seen you check twice. Sounds good, I'll apply the fix and create a new release .

- Use ares_queue_wait_empty to wait for queries to be complete before destruction - Make sure NO queries are cancelled as side effects on __del__ - Start the destruction thread early, as soon as a channel is created Cancelling pending queries while in a query callback seemingly causes heap corruption or double-free bugs, so delay the operation until no Python code if using the channel anymore, that is, the destructor thread. Fixes: aio-libs/aiodns#175 Fixes: #248

bdraco · 2025-08-05T21:34:16Z

Thanks!

saghul force-pushed the fix-fd-leaks branch from b54fa0e to 48e9047 Compare August 4, 2025 08:41

saghul commented Aug 4, 2025

View reviewed changes

saghul force-pushed the fix-fd-leaks branch from c641fbf to 846ad4e Compare August 5, 2025 13:24

bdraco mentioned this pull request Aug 5, 2025

Fix channel destruction race conditions with c-ares queue synchronization and safety delay #249

Closed

saghul force-pushed the fix-fd-leaks branch from 846ad4e to ec2b997 Compare August 5, 2025 19:23

saghul merged commit c3c931f into master Aug 5, 2025
29 checks passed

saghul deleted the fix-fd-leaks branch August 5, 2025 19:56

Refactor channel destruction logic #253

Refactor channel destruction logic #253

Uh oh!

Conversation

saghul commented Aug 4, 2025

Uh oh!

saghul commented Aug 4, 2025

Uh oh!

saghul Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

saghul Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

bdraco commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saghul commented Aug 4, 2025

Uh oh!

Vizonex commented Aug 4, 2025

Uh oh!

bdraco commented Aug 4, 2025

Uh oh!

bdraco commented Aug 4, 2025

Uh oh!

saghul commented Aug 4, 2025

Uh oh!

saghul commented Aug 5, 2025

Uh oh!

saghul commented Aug 5, 2025

Uh oh!

bdraco commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saghul commented Aug 5, 2025

Uh oh!

bdraco commented Aug 5, 2025

Uh oh!

bdraco commented Aug 5, 2025

Uh oh!

saghul commented Aug 5, 2025

Uh oh!

Uh oh!

bdraco commented Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bdraco commented Aug 4, 2025 •

edited

Loading

bdraco commented Aug 5, 2025 •

edited

Loading