-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix segfault when database connections timeout #218
Conversation
@akash-akya can you point the branch at Also thanks for fixing the CI goof up. |
c_src/sqlite3_nif.c
Outdated
conn->mutex = enif_mutex_create("exqlite:connection"); | ||
if (conn->mutex == NULL) { | ||
enif_release_resource(conn); | ||
return make_error_tuple(env, "failed to create mutex"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this gets turned into an atom. {:error, :failed_to_create_mutex}
return make_error_tuple(env, "failed to create mutex"); | |
return make_error_tuple(env, "failed_to_create_mutex"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
b6d4915
to
1fc1de3
Compare
1fc1de3
to
6cacbde
Compare
Hey @warmwaffles, made changes as suggested. Disabling the segfault test by default since it is very slow. Let me know if you want to enable it. Regarding the current approach -- I am not sure if this is the right approach to solve the issue. Since to properly fix it I think we have to wrap every callback inside the connection mutex, not just prepare. Alternatively, how about this? When connection timeout, we just roll back the trx, but we don't close the connection at that point in time. Since a timed out connection will be eventually garbage collected and since at-max there can only be one parallel command, we can just let |
Let's leave it disabled. I'll move it to a regression / integration level.
That seems like a good solution versus the mutex manipulations. But what I fear is that you can still segfault this manually using Exqlite directly. Is that okay if it segfaults if the library is used incorrectly? |
Can you put a couple flags in the connection, something like |
Yes, it won't fix those use cases. Imo it is okay to just warn the user, make unsafe APIs private and leave it at that. Just curious, is there a case where one would want to make raw calls concurrently using a single connection? AFAIK db_connection is already lightweight from process footprint point of view. Also, It's very hard to cover all such cases right? for example, parallel |
@greg-rychlewski are you suggesting that on the elixir side or the NIF side? On the NIF side, we'd have to use a mutex to ensure that a connection isn't double closed.
I think someone who wants to skip using db_connection should be able to. But how realistic is it that someone would use the raw
Definitely hard to cover, but a mutex is probably the right solution to always check before calling @akash-akya another thing is, could that mutex checking be adding enough latency to not surface the race condition? When I was throwing |
Yes, adding a delay in concurrent calls can make it hard to reach the exact race condition. I tried adding sleep before prepare and sure enough, it makes the issue disappear. The question now is if mutex is truly fixing the issue or just hiding the race condition because of latency it is adding right? I think if we can reliably prove the exact race condition sequence, then we can be sure about the fix. Can you share your changes with |
@akash-akya I shimmed it in as the first call in |
Trying to reproduce the issue deterministically for a single connection. But it is hard to get the order of thread execution correct without any additional tooling. Fortunately, since SQLite source code itself is embedded, we can just add sleep in-between execution of prepare & close, and get the deterministic order we want.
Without mutex fix
With mutex fix
The fix does fixes the segfault in the test, and adding fprintf does not fix the above segfault. Of course, among the set of all possible code path that can cause segfault, this is just one, there might be more. |
Great work! I say let's get this merged, cut a patch release and I can explore adding more mutex lock checking in all of the critical sections. There is definitely still a chance where binding values to a statement could result in a segfault because it does not utilize the mutex locking part. |
Yes, sounds good to me. Let me know if any more changes are pending for this to get merged. |
Released under |
Hypothesis
the issue seems to be that there is a race condition between
exqlite_close
andexqlite_prepare
. The issue happens when we executesqlite3_prepare_v3
on a connection after we close the connection withsqlite3_close_v2
.sqlite3_close_v2
does handle previously prepared statements, but here we are preparing the statement in-parallel while we are closing the connection. Which is causing occasional SQLITE_MISUSE.How can we reach this state?
Given
exqlite_close
is executed by connection processhandle_prepare
and other callbacks are called in the client processScenario
exqlite_prepare
NIF call but fail to complete within connection timeout duration. It 1ms in the test case. This NIF call is still running or scheduled to run even after the timeout.disconnect
callback)exqlite_close
NIF2 & 4 are run in-parallel leading to race-condition mentioned previously causing segfault.
Fix
I tried to fix the race-condition using mutex, and sofar haven't seen a single failure.
This is PoC, I guess we should handle multiple other cases as well, not just the one between close & prepare if this is the issue.
Should fix #190