Skip to content

Conversation

@saghul
Copy link
Owner

@saghul saghul commented Aug 4, 2025

  • Use ares_queue_wait_empty to wait for queries to be complete before destruction
  • Make sure all queries are cancelled also on del
  • Start the destruction thread early, as soon as a channel is created

Fixes: aio-libs/aiodns#175
Fixes: #248

@saghul
Copy link
Owner Author

saghul commented Aug 4, 2025

@bdraco Please take a look. I think ares_queue_wait_empty is what we wanted that I wasn't aware of before.

I can no longer make the test script fail on macOS, but I don't have a Linux machine handy right now to give it a full test.

# https://github.com/c-ares/c-ares/blob/4f42928848e8b73d322b15ecbe3e8d753bf8734e/src/lib/ares_process.c#L1422
time.sleep(1.0)
# Wait for all queries to finish
_lib.ares_queue_wait_empty(channel[0], -1)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are here nothing can be holding on to a channel, so we should be able to safely wait for the cancelled queries to end. I don't see any option other than a c-ares internal bug breaking this.

if local_dev:
self.set_local_dev(local_dev)

# Ensure the shutdown thread is started
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than swallow the error in case we try to cleanup on interpreter shutdown, I chose to start the thread as soon as the channel is ready, since we are going to need it anyway.

@bdraco
Copy link
Collaborator

bdraco commented Aug 4, 2025

I can no longer make the test script fail on macOS, but I don't have a Linux machine handy right now to give it a full test.

Its late here. I'll try to find some time to test tomorrow.

For linux, this should work on macOS:

docker run --rm -it --entrypoint /bin/bash -v `pwd`:/tmp/pycares python:3.13.3

@saghul
Copy link
Owner Author

saghul commented Aug 4, 2025

Yeah I thought of docker but wanted to avoid the extra layer of uncertainty :-P

@Vizonex
Copy link
Contributor

Vizonex commented Aug 4, 2025

Seems there was a little hiccup on windows-11-arm 3.13.3 otherwise I think this looks pretty good I think ares_queue_wait_empty is a good idea for channel cleanup. I wonder if we could also implement something like this for users to also control if an event-thread is in use?

@bdraco
Copy link
Collaborator

bdraco commented Aug 4, 2025

Tested on Mac, no issues.

Tested on Linux, I get a double free

# gdb python3
GNU gdb (Debian 13.1-3) 13.1
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...
(gdb) set args aiodns_bug.py
(gdb) run
Starting program: /usr/local/bin/python3 aiodns_bug.py
....
__pthread_kill_implementation (threadid=281472982380928, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) back
#0  __pthread_kill_implementation (threadid=281472982380928, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x0000ffff8aed0ab4 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x0000ffff8ae8a72c in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x0000ffff8ae7747c in __GI_abort () at ./stdlib/abort.c:79
#4  0x0000ffff8aec4aac in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0xffff8afa6d28 "%s\n") at ../sysdeps/posix/libc_fatal.c:156
#5  0x0000ffff8aedaebc in malloc_printerr (str=str@entry=0xffff8afa1fb8 "free(): double free detected in tcache 2") at ./malloc/malloc.c:5660
#6  0x0000ffff8aedd0a8 in _int_free (av=0xffff8aff0af0 <main_arena>, p=p@entry=0xaaaac4225980, have_lock=have_lock@entry=0) at ./malloc/malloc.c:4469
#7  0x0000ffff8aedf77c in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3385
#8  0x0000ffff89e81f44 in ares_dnsrec_convert_cb (arg=0xaaaac4225990, status=ARES_ENOTFOUND, timeouts=0, dnsrec=<optimized out>) at deps/c-ares/src/lib/ares_search.c:425
#9  0x0000ffff89e81c40 in ares_query_dnsrec_cb (arg=0xaaaac42259d0, status=<optimized out>, timeouts=0, dnsrec=0xffff68021050) at deps/c-ares/src/lib/ares_query.c:56
#10 0x0000ffff89e80f60 in end_query (dnsrec=0xffff68021050, status=ARES_SUCCESS, query=0xaaaac4041370, server=0xaaaac4224b00, channel=0xaaaac4222c10)
    at deps/c-ares/src/lib/ares_process.c:1421
#11 process_answer (requeue=<optimized out>, now=<optimized out>, conn=<optimized out>, alen=<optimized out>, abuf=<optimized out>, channel=<optimized out>)
    at deps/c-ares/src/lib/ares_process.c:890
#12 read_answers (now=0xffff8920e7a0, conn=<optimized out>) at deps/c-ares/src/lib/ares_process.c:582
#13 process_read (now=0xffff8920e7a0, read_fd=<optimized out>, channel=0xaaaac4222c10) at deps/c-ares/src/lib/ares_process.c:644
#14 ares_process_fds_nolock (channel=channel@entry=0xaaaac4222c10, events=events@entry=0xffff8920e818, nevents=nevents@entry=1, flags=flags@entry=1)
    at deps/c-ares/src/lib/ares_process.c:226
#15 0x0000ffff89e810e0 in ares_process_fds_nolock (flags=1, nevents=1, events=0xffff8920e818, channel=0xaaaac4222c10) at deps/c-ares/src/lib/ares_process.c:200
#16 ares_process_fds (channel=0xaaaac4222c10, events=events@entry=0xffff8920e818, nevents=nevents@entry=1, flags=flags@entry=1) at deps/c-ares/src/lib/ares_process.c:258
#17 0x0000ffff89e8b5b0 in ares_event_thread_process_fd (e=<optimized out>, fd=<optimized out>, data=<optimized out>, flags=<optimized out>)
    at deps/c-ares/src/lib/event/ares_event_thread.c:204
#18 0x0000ffff89e8adfc in ares_evsys_epoll_wait (e=0xaaaac41fe2c0, timeout_ms=<optimized out>) at deps/c-ares/src/lib/event/ares_event_epoll.c:179
#19 0x0000ffff89e8b8c8 in ares_event_thread (arg=0xaaaac41fe2c0) at deps/c-ares/src/lib/event/ares_event_thread.c:347
#20 0x0000ffff8aeceea0 in start_thread (arg=0xffffdb7802a7) at ./nptl/pthread_create.c:442
#21 0x0000ffff8af37b1c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79

@bdraco
Copy link
Collaborator

bdraco commented Aug 4, 2025

Its also reproducible in docker on MacOS

@saghul
Copy link
Owner Author

saghul commented Aug 4, 2025

Thanks for giving it a try, I'll test on docker.

@saghul
Copy link
Owner Author

saghul commented Aug 5, 2025

Surprisingly I'm not getting a double-free, but some form of heap corruption:

Thread 1 "python" received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=277925466865696, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (threadid=277925466865696, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x0000fcc590a40ab4 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x0000fcc5909fa72c in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x0000fcc5909e747c in __GI_abort () at ./stdlib/abort.c:79
#4  0x0000fcc590a34aac in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0xfcc590b16d28 "%s\n") at ../sysdeps/posix/libc_fatal.c:156
#5  0x0000fcc590a4aebc in malloc_printerr (str=str@entry=0xfcc590b11dc0 "corrupted double-linked list") at ./malloc/malloc.c:5660
#6  0x0000fcc590a4b6fc in unlink_chunk (p=p@entry=0xc023ffae6de0, av=0xfcc590b60af0 <main_arena>) at ./malloc/malloc.c:1629
#7  0x0000fcc590a4e240 in _int_malloc (av=av@entry=0xfcc590b60af0 <main_arena>, bytes=bytes@entry=48) at ./malloc/malloc.c:4303
#8  0x0000fcc590a4efd0 in __GI___libc_malloc (bytes=48) at ./malloc/malloc.c:3323
#9  0x0000fcc58f34e8a4 in ares_malloc_zero (size=size@entry=48) at deps/c-ares/src/lib/ares_library_init.c:86
#10 0x0000fcc58f369060 in ares_buf_create () at deps/c-ares/src/lib/str/ares_buf.c:48
#11 0x0000fcc58f369098 in ares_buf_create_const (
    data=data@entry=0xc023ffb97874 "files dns\nnetworks:       files\n\nprotocols:      db files\nservices:       db files\nethers:         db files\nrpc:", ' ' <repeats 12 times>, "db files\n\nnetgroup:       nis\n", data_len=data_len@entry=5) at deps/c-ares/src/lib/str/ares_buf.c:65
#12 0x0000fcc58f36a38c in ares_buf_split (arr=0xffffc81b4180, max_sections=0, flags=<optimized out>, delims_len=2, delims=0xfcc58f370970 " \t", buf=0xc023ffafa0b0)
    at deps/c-ares/src/lib/str/ares_buf.c:950
#13 ares_buf_split (buf=0xc023ffafa0b0, delims=0xfcc58f370970 " \t", delims_len=2, flags=<optimized out>, max_sections=0, arr=0xffffc81b4180)
    at deps/c-ares/src/lib/str/ares_buf.c:875
#14 0x0000fcc58f36a648 in ares_buf_split_str_array (buf=buf@entry=0xc023ffafa0b0, delims=delims@entry=0xfcc58f370970 " \t", delims_len=<optimized out>,
    flags=flags@entry=ARES_BUF_SPLIT_TRIM, max_sections=max_sections@entry=0, arr=arr@entry=0xffffc81b41c8) at deps/c-ares/src/lib/str/ares_buf.c:1003
#15 0x0000fcc58f36a79c in ares_buf_split_str (buf=buf@entry=0xc023ffafa0b0, delims=delims@entry=0xfcc58f370970 " \t", delims_len=<optimized out>,
    flags=flags@entry=ARES_BUF_SPLIT_TRIM, max_sections=max_sections@entry=0, strs=strs@entry=0xffffc81b4220, nstrs=nstrs@entry=0xffffc81b4228)
    at deps/c-ares/src/lib/str/ares_buf.c:1056
#16 0x0000fcc58f3550cc in config_lookup (sysconfig=sysconfig@entry=0xffffc81b4368, buf=0xc023ffafa0b0, separators=separators@entry=0xfcc58f370970 " \t")
    at deps/c-ares/src/lib/ares_sysconfig_files.c:333
#17 0x0000fcc58f355394 in parse_nsswitch_line (channel=<optimized out>, sysconfig=0xffffc81b4368, line=<optimized out>)
    at deps/c-ares/src/lib/ares_sysconfig_files.c:666
#18 0x0000fcc58f356010 in ares_sysconfig_process_buf (channel=channel@entry=0xc023ffb9b410, sysconfig=sysconfig@entry=0xffffc81b4368, buf=buf@entry=0xc023ffa93930,
    cb=cb@entry=0xfcc58f355250 <parse_nsswitch_line>) at deps/c-ares/src/lib/ares_sysconfig_files.c:751
#19 0x0000fcc58f356184 in process_config_lines (cb=<optimized out>, sysconfig=0xffffc81b4368, filename=0xfcc58f370a80 "/etc/nsswitch.conf", channel=0xc023ffb9b410)
    at deps/c-ares/src/lib/ares_sysconfig_files.c:788
#20 ares_init_sysconfig_files (channel=channel@entry=0xc023ffb9b410, sysconfig=sysconfig@entry=0xffffc81b4368, process_resolvconf=process_resolvconf@entry=ARES_TRUE)
    at deps/c-ares/src/lib/ares_sysconfig_files.c:815
#21 0x0000fcc58f354ee0 in ares_init_by_sysconfig (channel=channel@entry=0xc023ffb9b410) at deps/c-ares/src/lib/ares_sysconfig.c:610
#22 0x0000fcc58f34e1fc in ares_init_options (optmask=4194304, options=0xc023ffb4dd70, channelptr=0xc023ffaa2d80) at deps/c-ares/src/lib/ares_init.c:331
#23 ares_init_options (channelptr=0xc023ffaa2d80, options=0xc023ffb4dd70, optmask=optmask@entry=4194304) at deps/c-ares/src/lib/ares_init.c:238

and

(gdb) thread 1
[Switching to thread 1 (Thread 0xf2b7d8b44020 (LWP 3858))]
#0  __pthread_kill_implementation (threadid=266871428628512, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (threadid=266871428628512, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x0000f2b7d82d0ab4 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x0000f2b7d828a72c in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x0000f2b7d827747c in __GI_abort () at ./stdlib/abort.c:79
#4  0x0000f2b7d82c4aac in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0xf2b7d83a6d28 "%s\n") at ../sysdeps/posix/libc_fatal.c:156
#5  0x0000f2b7d82daebc in malloc_printerr (str=str@entry=0xf2b7d83a2668 "malloc(): unaligned tcache chunk detected") at ./malloc/malloc.c:5660
#6  0x0000f2b7d82df314 in tcache_get (tc_idx=<optimized out>) at ./malloc/malloc.c:3189
#7  __GI___libc_malloc (bytes=16) at ./malloc/malloc.c:3307
#8  0x0000f2b7d72f928c in ares_htable_strvp_create (val_free=val_free@entry=0x0) at deps/c-ares/src/lib/dsa/ares_htable_strvp.c:83
#9  0x0000f2b7d72f17c4 in ares_qcache_create (rand_state=0xc6aa07ff8130, max_ttl=3600, cache_out=cache_out@entry=0xc6aa0800a2e0)
    at deps/c-ares/src/lib/ares_qcache.c:219
#10 0x0000f2b7d72ee1f0 in ares_init_options (optmask=4194304, options=0xc6aa07ff2050, channelptr=0xc6aa07f0b430) at deps/c-ares/src/lib/ares_init.c:324
#11 ares_init_options (channelptr=0xc6aa07f0b430, options=0xc6aa07ff2050, optmask=optmask@entry=4194304) at deps/c-ares/src/lib/ares_init.c:238
#12 0x0000f2b7d72e5690 in _cffi_f_ares_init_options (self=<optimized out>, args=<optimized out>) at build/temp.linux-aarch64-cpython-313/_cares.c:2619

It always seems to happen when initializing a new channel, not when destroying a dying one.

My test script:

import sys
import asyncio
import ipaddress

import aiodns

# TOR exit relay without a PTR record. This POC uses a single IP in a loop but the crash
# can be caused by issuing multiple PTR requests which do not have an record.
ip = '38.135.24.32'

# IPs with valid PTR records do not reproduce the error in my testing
# ip = '8.8.8.8'

# The nameserver of choice does not seem to impact the output
nameservers = ['8.8.8.8']
#nameservers = []

async def lookup(valu):
    res = aiodns.DNSResolver(nameservers=nameservers)
    answ = None
    try:
        answ = await res.query(valu, 'PTR')
    except aiodns.error.DNSError as e:
        if e.args[0] != 4:  # Domain name not found
            raise
    return answ

async def main():
    queries = 0
    valu = ipaddress.ip_address(ip).reverse_pointer
    while True:
        queries += 1
        print(f"Sending query {queries}")
        task = asyncio.create_task(lookup(valu))
        await task

if __name__ == '__main__':
    sys.exit(asyncio.run(main()))

I usually need > 1k iterations to trigger it.

@saghul
Copy link
Owner Author

saghul commented Aug 5, 2025

Update! I pushed a new commit and with it I can no longer make it crash! I suspect what might happen is that the channel gets closed by refcounting / close while in a query callback, and something inside c-ares doesn't quite like that.

Delaying cancellation until destruction seems like a reasonabe compromise: we no longer need to sleep, so timing is not a problem, and we are guaranteed that no Python code is touching the channel anymore.

@bdraco
Copy link
Collaborator

bdraco commented Aug 5, 2025

Tested. Cannot get it to crash now.

Small suggestion to avoid the lock churn since its pretty clear that users create/destroy the objects frequently.

diff --git a/src/pycares/__init__.py b/src/pycares/__init__.py
index 3255a85..f0476a1 100644
--- a/src/pycares/__init__.py
+++ b/src/pycares/__init__.py
@@ -340,7 +340,6 @@ class _ChannelShutdownManager:
         self._queue: SimpleQueue = SimpleQueue()
         self._thread: Optional[threading.Thread] = None
         self._start_lock = threading.Lock()
-        self._thread_started = False
 
     def _run_safe_shutdown_loop(self) -> None:
         """Process channel destruction requests from the queue."""
@@ -360,11 +359,13 @@ class _ChannelShutdownManager:
 
     def start(self) -> None:
         """Start the background thread if not already started."""
+        if self._thread is not None:
+            return   # Already started
         with self._start_lock:
-            if not self._thread_started:
+            if self._thread is not None:
+                return  # Started by another thread while waiting for the lock
             self._thread = threading.Thread(target=self._run_safe_shutdown_loop, daemon=True)
             self._thread.start()
-                self._thread_started = True
 
     def destroy_channel(self, channel) -> None:
         """

@saghul
Copy link
Owner Author

saghul commented Aug 5, 2025

That makes it thread unsafe though.

The lock is not contended so it's essentially free performance wise.

@bdraco
Copy link
Collaborator

bdraco commented Aug 5, 2025

The actual thread creation and assignment happen inside the lock, preventing multiple threads from creating the thread.

Even if Thread A doesn't see the updated self._thread value in the first check due to memory visibility issues, it will definitely see it in the second check after acquiring the lock.

The worst-case scenario is that multiple threads unnecessarily acquire the lock, but only one will create the thread.

Python's threading module ensures proper memory barriers when acquiring/releasing locks

Consider this scenario with two threads:

Thread A                        Thread B
--------                        --------
Check self._thread (None)       
                               Check self._thread (None)
                               Acquire lock
                               Check self._thread (None)
                               Create & start thread
                               self._thread = <thread>
                               Release lock
Acquire lock
Check self._thread (NOT None)
Return

The second check under the lock ensures Thread A sees the updated value and returns without creating a duplicate thread.

@bdraco
Copy link
Collaborator

bdraco commented Aug 5, 2025

Even uncontended locks aren't free - they require syscalls (futex on Linux, pthread_mutex on macOS) which involve kernel transitions.

@saghul
Copy link
Owner Author

saghul commented Aug 5, 2025

Ah I hadn't seen you check twice. Sounds good, I'll apply the fix and create a new release .

- Use ares_queue_wait_empty to wait for queries to be complete before
  destruction
- Make sure NO queries are cancelled as side effects on __del__
- Start the destruction thread early, as soon as a channel is created

Cancelling pending queries while in a query callback seemingly causes
heap corruption or double-free bugs, so delay the operation until no
Python code if using the channel anymore, that is, the destructor
thread.

Fixes: aio-libs/aiodns#175
Fixes: #248
@saghul saghul merged commit c3c931f into master Aug 5, 2025
29 checks passed
@saghul saghul deleted the fix-fd-leaks branch August 5, 2025 19:56
@bdraco
Copy link
Collaborator

bdraco commented Aug 5, 2025

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

File Descriptor Leak in Pyacres v4.9.0 During Asynchronous HTTP Requests inotify leaks

4 participants