Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix segfault when database connections timeout #218

Merged
merged 6 commits into from
Oct 27, 2022

Conversation

akash-akya
Copy link
Contributor

@akash-akya akash-akya commented Oct 25, 2022

Hypothesis

the issue seems to be that there is a race condition between exqlite_close and exqlite_prepare. The issue happens when we execute sqlite3_prepare_v3 on a connection after we close the connection with sqlite3_close_v2.

sqlite3_close_v2 does handle previously prepared statements, but here we are preparing the statement in-parallel while we are closing the connection. Which is causing occasional SQLITE_MISUSE.

How can we reach this state?

Given

  • exqlite_close is executed by connection process
  • while handle_prepare and other callbacks are called in the client process
  • a NIF call can not be interrupted/cancelled once started, irrespective of client side process timeout, or process kill.

Scenario

  1. client checkout a connection and makes a query
  2. client makes exqlite_prepare NIF call but fail to complete within connection timeout duration. It 1ms in the test case. This NIF call is still running or scheduled to run even after the timeout.
  3. due to connection timeout, connection-process receives timeout and initiate connection close (disconnect callback)
  4. connection-process calls exqlite_close NIF

2 & 4 are run in-parallel leading to race-condition mentioned previously causing segfault.

Fix

I tried to fix the race-condition using mutex, and sofar haven't seen a single failure.

This is PoC, I guess we should handle multiple other cases as well, not just the one between close & prepare if this is the issue.

Should fix #190

@akash-akya akash-akya changed the title DRAFT: potential fix for segfault DRAFT: potential explanation for segfault & PoC fix Oct 25, 2022
@warmwaffles
Copy link
Member

warmwaffles commented Oct 25, 2022

@akash-akya can you point the branch at main and cherry-pick in the test I made? I'd rather you take the credit for solving this. We utilize a squash merge in this project.

Also thanks for fixing the CI goof up.

conn->mutex = enif_mutex_create("exqlite:connection");
if (conn->mutex == NULL) {
enif_release_resource(conn);
return make_error_tuple(env, "failed to create mutex");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this gets turned into an atom. {:error, :failed_to_create_mutex}

Suggested change
return make_error_tuple(env, "failed to create mutex");
return make_error_tuple(env, "failed_to_create_mutex");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@akash-akya akash-akya changed the base branch from segfault-hunting to main October 25, 2022 19:35
@akash-akya
Copy link
Contributor Author

akash-akya commented Oct 25, 2022

Hey @warmwaffles, made changes as suggested.

Disabling the segfault test by default since it is very slow. Let me know if you want to enable it.

Regarding the current approach -- I am not sure if this is the right approach to solve the issue. Since to properly fix it I think we have to wrap every callback inside the connection mutex, not just prepare.

Alternatively, how about this? When connection timeout, we just roll back the trx, but we don't close the connection at that point in time.

Since a timed out connection will be eventually garbage collected and since at-max there can only be one parallel command, we can just let connection_type_destructor function close the connection at the end after the parallel command finishes. That way we don't need any mutex & wrapping, wdys?

@warmwaffles
Copy link
Member

Disabling the segfault test by default since it is very slow. Let me know if you want to enable it.

Let's leave it disabled. I'll move it to a regression / integration level.

Alternatively, how about this? When connection timeout, we just roll back the trx, but we don't close the connection at that point in time.

That seems like a good solution versus the mutex manipulations. But what I fear is that you can still segfault this manually using Exqlite directly. Is that okay if it segfaults if the library is used incorrectly?

@greg-rychlewski
Copy link
Contributor

Can you put a couple flags in the connection, something like in_use and marked_for_closure. If not in use then the timeout closes the connection. If in use then set marked_for_closure to true and the command will close the connection after it's done.

@akash-akya
Copy link
Contributor Author

akash-akya commented Oct 25, 2022

That seems like a good solution versus the mutex manipulations. But what I fear is that you can still segfault this manually using Exqlite directly. Is that okay if it segfaults if the library is used incorrectly?

Yes, it won't fix those use cases. Imo it is okay to just warn the user, make unsafe APIs private and leave it at that.

Just curious, is there a case where one would want to make raw calls concurrently using a single connection? AFAIK db_connection is already lightweight from process footprint point of view.

Also, It's very hard to cover all such cases right? for example, parallel sqlite3_close_v2 calls on the same connection. Currently we are not handling that I think even with mutex.

@warmwaffles
Copy link
Member

@greg-rychlewski are you suggesting that on the elixir side or the NIF side?

On the NIF side, we'd have to use a mutex to ensure that a connection isn't double closed.

Just curious, is there a case where one would want to make raw calls concurrently using a single connection? AFAIK db_connection is already lightweight from process footprint point of view.

I think someone who wants to skip using db_connection should be able to. But how realistic is it that someone would use the raw Exqlite.Sqlite interface outside of ecto? Probably low.

Also, It's very hard to cover all such cases right? for example, parallel sqlite3_close_v2 calls on the same connection. Currently we are not handling that I think even with mutex.

Definitely hard to cover, but a mutex is probably the right solution to always check before calling sqlite3_close_v2

@akash-akya another thing is, could that mutex checking be adding enough latency to not surface the race condition? When I was throwing fprintf debug statements in, it slowed the calls down enough to make the issue not present itself.

@akash-akya
Copy link
Contributor Author

akash-akya commented Oct 26, 2022

Yes, adding a delay in concurrent calls can make it hard to reach the exact race condition. I tried adding sleep before prepare and sure enough, it makes the issue disappear.

The question now is if mutex is truly fixing the issue or just hiding the race condition because of latency it is adding right? I think if we can reliably prove the exact race condition sequence, then we can be sure about the fix.

Can you share your changes with fprintf? Since where we are placing it does matter

@warmwaffles
Copy link
Member

@akash-akya I shimmed it in as the first call in exqlite_close and one on exqlite_open.

@akash-akya
Copy link
Contributor Author

akash-akya commented Oct 27, 2022

Trying to reproduce the issue deterministically for a single connection. But it is hard to get the order of thread execution correct without any additional tooling.

Fortunately, since SQLite source code itself is embedded, we can just add sleep in-between execution of prepare & close, and get the deterministic order we want.

Without mutex fix

$ mix test test/exqlite/timeout_segfault_test.exs:50
Excluding tags: [:test]
Including tags: [line: "50"]

sqlite3LockAndPrepare(): do not acquire the mutex, wait for sqlite3LeaveMutexAndCloseZombie() execute
sqlite3LeaveMutexAndCloseZombie(): critical-section execution is done, the mutex is released. Wait for sqlite3LockAndPrepare() to proceeds before we free db
sqlite3LockAndPrepare(): done waiting, proceeding with the execution
[1]    97570 segmentation fault  mix test test/exqlite/timeout_segfault_test.exs:50

With mutex fix

$ mix test test/exqlite/timeout_segfault_test.exs:50
Excluding tags: [:test]
Including tags: [line: "50"]

sqlite3LockAndPrepare(): do not acquire the mutex, wait for sqlite3LeaveMutexAndCloseZombie() execute
sqlite3LockAndPrepare(): done waiting, proceeding with the execution
sqlite3LeaveMutexAndCloseZombie(): critical-section execution is done, the mutex is released. Wait for sqlite3LockAndPrepare() to proceeds before we free db
.
Finished in 2.6 seconds (0.00s async, 2.6s sync)
2 tests, 0 failures, 1 excluded

The fix does fixes the segfault in the test, and adding fprintf does not fix the above segfault.

Of course, among the set of all possible code path that can cause segfault, this is just one, there might be more.

@warmwaffles
Copy link
Member

Great work! I say let's get this merged, cut a patch release and I can explore adding more mutex lock checking in all of the critical sections.

There is definitely still a chance where binding values to a statement could result in a segfault because it does not utilize the mutex locking part.

@warmwaffles warmwaffles changed the title DRAFT: potential explanation for segfault & PoC fix Fix segfault when database connections timeout Oct 27, 2022
@akash-akya
Copy link
Contributor Author

Yes, sounds good to me.

Let me know if any more changes are pending for this to get merged.

@warmwaffles warmwaffles merged commit 4b97a95 into elixir-sqlite:main Oct 27, 2022
@warmwaffles
Copy link
Member

Released under v0.11.7, thanks a ton @akash-akya

@akash-akya akash-akya deleted the segfault-hunting branch October 27, 2022 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segfault when exceeding DBConnection timeout
3 participants