-
Notifications
You must be signed in to change notification settings - Fork 81
Fix channel destruction race conditions with c-ares queue synchronization and safety delay #249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
I can still get a crash with this |
|
test script |
|
here is the trace on macos |
|
Actually its a different issue. The callback was freed but it was already freed by python destruction |
|
even with the check and a 5s delay I still get a crash |
|
The original PR fixed the crash on 1422 but this crash happens on 1421 |
|
|
|
no matter what I do I can't solve this. I think we need to dump the event_thread as its fundamentally unsafe. I'm going to give up on this as I think changes have to be made upstream, and trying to solve it here is just an exercise in frustration. |
|
So the original PR did fix the crash online 1422, but we still have a race crash from the double free on 1421 At this point, I'm convinced that the event thread implementation cannot be made safe unless c-ares makes changes |
|
I still plan on giving this another shot before giving up on the event thread but I haven't had free cycles to work on it between my other commitments |
|
Oh, looks like a made a PR following roughly the same approach but I hadn't seen this one from you, sorry! |
no worries. happy to have it solved by any means possible 👍 |
|
replaced by #253 |
PR Summary
This PR fixes critical race conditions in channel destruction by combining c-ares's
ares_queue_wait_empty()with a minimal safety delay. This approach eliminates crashes while maintaining high throughput for channel destruction.Problem
The current implementation has fundamental safety issues:
ares_queue_wait_empty()alone isn't sufficient because the event thread may be in a critical section after removing queries from the queueRoot Cause
When using
event_thread=True, there's a race condition where:ares_queue_wait_empty()return success)end_query()processing callbacks or freeing memoryares_destroy()is called at this moment, it causes use-after-free crashesSolution
Implement a two-phase approach:
ares_queue_wait_empty()to detect when queries are completeares_destroy()This combines deterministic waiting with a safety margin for the critical section.
Key Changes
ares_queue_wait_empty()binding - Exposes c-ares function to check when queries are donethreading.LockBenefits
Why This Approach
ares_queue_wait_empty()alone is insufficient - The event thread can still be in critical sectionsTesting
This change has been designed to handle the specific race conditions in c-ares's event thread implementation. The combination of queue state checking and minimal delay provides the safety needed while maintaining good performance.