Fix segfault when database connections timeout #218

akash-akya · 2022-10-25T09:27:37Z

Hypothesis

the issue seems to be that there is a race condition between exqlite_close and exqlite_prepare. The issue happens when we execute sqlite3_prepare_v3 on a connection after we close the connection with sqlite3_close_v2.

sqlite3_close_v2 does handle previously prepared statements, but here we are preparing the statement in-parallel while we are closing the connection. Which is causing occasional SQLITE_MISUSE.

How can we reach this state?

Given

exqlite_close is executed by connection process
while handle_prepare and other callbacks are called in the client process
a NIF call can not be interrupted/cancelled once started, irrespective of client side process timeout, or process kill.

Scenario

client checkout a connection and makes a query
client makes exqlite_prepare NIF call but fail to complete within connection timeout duration. It 1ms in the test case. This NIF call is still running or scheduled to run even after the timeout.
due to connection timeout, connection-process receives timeout and initiate connection close (disconnect callback)
connection-process calls exqlite_close NIF

2 & 4 are run in-parallel leading to race-condition mentioned previously causing segfault.

Fix

I tried to fix the race-condition using mutex, and sofar haven't seen a single failure.

This is PoC, I guess we should handle multiple other cases as well, not just the one between close & prepare if this is the issue.

Should fix #190

warmwaffles · 2022-10-25T13:32:32Z

@akash-akya can you point the branch at main and cherry-pick in the test I made? I'd rather you take the credit for solving this. We utilize a squash merge in this project.

Also thanks for fixing the CI goof up.

warmwaffles · 2022-10-25T13:23:20Z

c_src/sqlite3_nif.c

+    conn->mutex = enif_mutex_create("exqlite:connection");
+    if (conn->mutex == NULL) {
+        enif_release_resource(conn);
+        return make_error_tuple(env, "failed to create mutex");


Since this gets turned into an atom. {:error, :failed_to_create_mutex}

Suggested change

return make_error_tuple(env, "failed to create mutex");

return make_error_tuple(env, "failed_to_create_mutex");

akash-akya · 2022-10-25T19:44:57Z

Hey @warmwaffles, made changes as suggested.

Disabling the segfault test by default since it is very slow. Let me know if you want to enable it.

Regarding the current approach -- I am not sure if this is the right approach to solve the issue. Since to properly fix it I think we have to wrap every callback inside the connection mutex, not just prepare.

Alternatively, how about this? When connection timeout, we just roll back the trx, but we don't close the connection at that point in time.

Since a timed out connection will be eventually garbage collected and since at-max there can only be one parallel command, we can just let connection_type_destructor function close the connection at the end after the parallel command finishes. That way we don't need any mutex & wrapping, wdys?

warmwaffles · 2022-10-25T20:03:31Z

Disabling the segfault test by default since it is very slow. Let me know if you want to enable it.

Let's leave it disabled. I'll move it to a regression / integration level.

Alternatively, how about this? When connection timeout, we just roll back the trx, but we don't close the connection at that point in time.

That seems like a good solution versus the mutex manipulations. But what I fear is that you can still segfault this manually using Exqlite directly. Is that okay if it segfaults if the library is used incorrectly?

greg-rychlewski · 2022-10-25T20:22:31Z

Can you put a couple flags in the connection, something like in_use and marked_for_closure. If not in use then the timeout closes the connection. If in use then set marked_for_closure to true and the command will close the connection after it's done.

akash-akya · 2022-10-25T20:27:19Z

That seems like a good solution versus the mutex manipulations. But what I fear is that you can still segfault this manually using Exqlite directly. Is that okay if it segfaults if the library is used incorrectly?

Yes, it won't fix those use cases. Imo it is okay to just warn the user, make unsafe APIs private and leave it at that.

Just curious, is there a case where one would want to make raw calls concurrently using a single connection? AFAIK db_connection is already lightweight from process footprint point of view.

Also, It's very hard to cover all such cases right? for example, parallel sqlite3_close_v2 calls on the same connection. Currently we are not handling that I think even with mutex.

warmwaffles · 2022-10-25T20:33:24Z

@greg-rychlewski are you suggesting that on the elixir side or the NIF side?

On the NIF side, we'd have to use a mutex to ensure that a connection isn't double closed.

Just curious, is there a case where one would want to make raw calls concurrently using a single connection? AFAIK db_connection is already lightweight from process footprint point of view.

I think someone who wants to skip using db_connection should be able to. But how realistic is it that someone would use the raw Exqlite.Sqlite interface outside of ecto? Probably low.

Also, It's very hard to cover all such cases right? for example, parallel sqlite3_close_v2 calls on the same connection. Currently we are not handling that I think even with mutex.

Definitely hard to cover, but a mutex is probably the right solution to always check before calling sqlite3_close_v2

@akash-akya another thing is, could that mutex checking be adding enough latency to not surface the race condition? When I was throwing fprintf debug statements in, it slowed the calls down enough to make the issue not present itself.

akash-akya · 2022-10-26T19:57:41Z

Yes, adding a delay in concurrent calls can make it hard to reach the exact race condition. I tried adding sleep before prepare and sure enough, it makes the issue disappear.

The question now is if mutex is truly fixing the issue or just hiding the race condition because of latency it is adding right? I think if we can reliably prove the exact race condition sequence, then we can be sure about the fix.

Can you share your changes with fprintf? Since where we are placing it does matter

warmwaffles · 2022-10-26T20:17:30Z

@akash-akya I shimmed it in as the first call in exqlite_close and one on exqlite_open.

akash-akya · 2022-10-27T11:43:33Z

Trying to reproduce the issue deterministically for a single connection. But it is hard to get the order of thread execution correct without any additional tooling.

Fortunately, since SQLite source code itself is embedded, we can just add sleep in-between execution of prepare & close, and get the deterministic order we want.

branch with the deterministic test: https://github.com/akash-akya/exqlite/tree/deterministic-segfault
branch with the deterministic test and fix: https://github.com/akash-akya/exqlite/tree/deterministic-segfault-with-fix

Without mutex fix

$ mix test test/exqlite/timeout_segfault_test.exs:50
Excluding tags: [:test]
Including tags: [line: "50"]

sqlite3LockAndPrepare(): do not acquire the mutex, wait for sqlite3LeaveMutexAndCloseZombie() execute
sqlite3LeaveMutexAndCloseZombie(): critical-section execution is done, the mutex is released. Wait for sqlite3LockAndPrepare() to proceeds before we free db
sqlite3LockAndPrepare(): done waiting, proceeding with the execution
[1]    97570 segmentation fault  mix test test/exqlite/timeout_segfault_test.exs:50

With mutex fix

$ mix test test/exqlite/timeout_segfault_test.exs:50
Excluding tags: [:test]
Including tags: [line: "50"]

sqlite3LockAndPrepare(): do not acquire the mutex, wait for sqlite3LeaveMutexAndCloseZombie() execute
sqlite3LockAndPrepare(): done waiting, proceeding with the execution
sqlite3LeaveMutexAndCloseZombie(): critical-section execution is done, the mutex is released. Wait for sqlite3LockAndPrepare() to proceeds before we free db
.
Finished in 2.6 seconds (0.00s async, 2.6s sync)
2 tests, 0 failures, 1 excluded

The fix does fixes the segfault in the test, and adding fprintf does not fix the above segfault.

Of course, among the set of all possible code path that can cause segfault, this is just one, there might be more.

warmwaffles · 2022-10-27T13:21:11Z

Great work! I say let's get this merged, cut a patch release and I can explore adding more mutex lock checking in all of the critical sections.

There is definitely still a chance where binding values to a statement could result in a segfault because it does not utilize the mutex locking part.

akash-akya · 2022-10-27T14:48:28Z

Yes, sounds good to me.

Let me know if any more changes are pending for this to get merged.

warmwaffles · 2022-10-27T15:21:28Z

Released under v0.11.7, thanks a ton @akash-akya

Reproduce segfault

4fc3a55

akash-akya changed the title ~~DRAFT: potential fix for segfault~~ DRAFT: potential explanation for segfault & PoC fix Oct 25, 2022

warmwaffles reviewed Oct 25, 2022

View reviewed changes

akash-akya force-pushed the segfault-hunting branch from b6d4915 to 1fc1de3 Compare October 25, 2022 19:29

akash-akya added 5 commits October 26, 2022 01:04

Close connection in critical section to avoid race conditions

01cc040

Increase test timeout

d20810a

fix ci

219b63a

Skip segfault test by default

c980733

Fix lint

6cacbde

akash-akya force-pushed the segfault-hunting branch from 1fc1de3 to 6cacbde Compare October 25, 2022 19:35

akash-akya changed the base branch from segfault-hunting to main October 25, 2022 19:35

warmwaffles changed the title ~~DRAFT: potential explanation for segfault & PoC fix~~ Fix segfault when database connections timeout Oct 27, 2022

warmwaffles merged commit 4b97a95 into elixir-sqlite:main Oct 27, 2022

akash-akya deleted the segfault-hunting branch October 27, 2022 18:31

alanj853 mentioned this pull request Apr 28, 2023

Upstream merger se-apc/exqlite#3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix segfault when database connections timeout #218

Fix segfault when database connections timeout #218

Uh oh!

akash-akya commented Oct 25, 2022 •

edited

Loading

Uh oh!

warmwaffles commented Oct 25, 2022 •

edited

Loading

Uh oh!

warmwaffles Oct 25, 2022

Uh oh!

akash-akya Oct 25, 2022

Uh oh!

akash-akya commented Oct 25, 2022 •

edited

Loading

Uh oh!

warmwaffles commented Oct 25, 2022

Uh oh!

greg-rychlewski commented Oct 25, 2022

Uh oh!

akash-akya commented Oct 25, 2022 •

edited

Loading

Uh oh!

warmwaffles commented Oct 25, 2022

Uh oh!

akash-akya commented Oct 26, 2022 •

edited

Loading

Uh oh!

warmwaffles commented Oct 26, 2022

Uh oh!

akash-akya commented Oct 27, 2022 •

edited

Loading

Uh oh!

warmwaffles commented Oct 27, 2022

Uh oh!

akash-akya commented Oct 27, 2022

Uh oh!

warmwaffles commented Oct 27, 2022

Uh oh!

Uh oh!

	return make_error_tuple(env, "failed to create mutex");
	return make_error_tuple(env, "failed_to_create_mutex");

Fix segfault when database connections timeout #218

Fix segfault when database connections timeout #218

Uh oh!

Conversation

akash-akya commented Oct 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hypothesis

Scenario

Fix

Uh oh!

warmwaffles commented Oct 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

warmwaffles Oct 25, 2022

Choose a reason for hiding this comment

Uh oh!

akash-akya Oct 25, 2022

Choose a reason for hiding this comment

Uh oh!

akash-akya commented Oct 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

warmwaffles commented Oct 25, 2022

Uh oh!

greg-rychlewski commented Oct 25, 2022

Uh oh!

akash-akya commented Oct 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

warmwaffles commented Oct 25, 2022

Uh oh!

akash-akya commented Oct 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

warmwaffles commented Oct 26, 2022

Uh oh!

akash-akya commented Oct 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Without mutex fix

With mutex fix

Uh oh!

warmwaffles commented Oct 27, 2022

Uh oh!

akash-akya commented Oct 27, 2022

Uh oh!

warmwaffles commented Oct 27, 2022

Uh oh!

Uh oh!

akash-akya commented Oct 25, 2022 •

edited

Loading

warmwaffles commented Oct 25, 2022 •

edited

Loading

akash-akya commented Oct 25, 2022 •

edited

Loading

akash-akya commented Oct 25, 2022 •

edited

Loading

akash-akya commented Oct 26, 2022 •

edited

Loading

akash-akya commented Oct 27, 2022 •

edited

Loading