-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I think there is a race condition with limit_concurrency #224
Comments
Hey @hms, thanks for writing this up. I don't have a lot of experience with PostgreSQL but I think the whole transaction wrapping the 3 jobs being enqueued might be the problem.
This is what is being done and the basis of the concurrency limits. The unique key is precisely to avoid race conditions with multiple jobs that share a concurrency key being enqueued at the same time.
I imagine that in this case, the expected
Have you tried just running |
@rosa always appreciate the amount of support here!
I understand the point of a unique key constraint, but as I had mentioned, it really depends on how you are using it. I believe there are 2 distinct possibilities here:
You are correct that my wrapping 3 jobs in a transaction is not making things easier. That was simply a response to this randomly loosing one of my job requests. A better way to handle this issue of random failures to enqueue is to change my code support a retry on a per job request. As for my ability to get a working test environment up and working. bin/setup has always blown up on my laptop with the following. And since I've never worked in a docker environment, this is befuddling to me:
|
I spent a little time running 'by-hand' tests to confirm my suspicions. Here is the TL;DR; version of things:
Unfortunately, as stated above, there is something in my environment that prevents me from building the Solid test environment. But I was able to test via 'hand' via:
Without a savepoint, one of the two sessions would become 'dead/unstable' and generate the same error I'm getting in my logs from SolidQueue (i.e., both inserts would fail and I would get the Pg:: InFailedTransaction: ERROR: current transaction is aborted, commands ignored until end of transaction block error message) With the save point as suggested below, both transactions would run to completion -- one transaction writing the conflict record and both transactions writing the validation records. I believe this change will fix this issue (maybe Postgresql only????): def attempt_creation
Semaphore.transaction(request_new: true) do # this creates the required save point
Semaphore.create!(key: key, value: limit - 1, expires_at: expires_at)
true
end
rescue ActiveRecord::RecordNotUnique # this automatically rollbacks to 'savepoint'
# Any database interactions from here on are valid and will work
limit == 1 ? false : attempt_decrement
end |
@rosa I managed to get my bin/setup working. But I could use some help designing a test that works within the solid environment. |
We have the same problem. We are experiencing this issue with Twilio webhook. onConversationAdded and onMessageAdded trigger at the same time. When onConversationAdded is added as a job and onMessageAdded is added, it throws the same error when both try to go to the db at the same time.
|
@MuhammetDilmac i don't know if you are triggering the problem in the same way as I am, but the code above solves for me. I have that code in a PR without any tests if you would like me to push it up |
When you have a minute, I'm stuck with how to write a test that demonstrates this issue given the code change doesn't lend itself to flags or easy access any other other standard tricks for testing two implementations of the same functionality. I have a fix that I've been running for about 2 weeks and it seems to resolve the issue. But I'd like to have tests and not trust the "it was just your lucky two weeks faries". Also, when attempting to run the tests on main, I am unable to run them to completion successfully. So, clearly I have to start there and it would help know if these errors are to be expected or if I'm "being special again". Hal |
Hey @hms, sorry for the delay! I'm on-call this week and haven't had time to dedicate to Solid Queue 😳 I'll try on Friday. I'm not sure it'll be possible to write a test for this scenario, TBH. What error do you get when you run the tests on main? You can also try running them for Postgres only, with the environment variable |
Per all of my earlier discussion, Postgres transactions become invalid upon any database contraint invalidating an operation and make all follow on database interactions invalid until a rollback is executed. Semephore#attempt_creation uses a rather common database pattern to rely on database locks to reliably syncrhronize based on uniqueness of values. The catch is that Postgresql requires a little more handholding to use this pattern. By wrapping just the Semaphore.create statement in a SAVEPOINT (a pretend nested transaction), we have what appears to be a transparent operation that works across all of the other supported databases without any known consequences AND allows Postgres to continue to work as expected per the existing code. This change works across the Solid supported databases (at least based on the tests) and I believe the extra overhead for the non-postgres databases is small enough that special casing the code or resorting to the complexity of dependency injection is just not worth it. I added two tests to the concurrency_controls_test.rb. One to show the error is real and is easy to reproduce. The other to show the fix fixes it.
@rosa if my response came across as demanding or expecting, I'm sorry. I know everyone here is on an all volunteer basis with day jobs. |
Hey @hms, no, not at all! Don't worry at all, sorry if I implied I was bothered or something! 😅 I'm fortunate that I can work on Solid Queue as part of my day job, but lately I've been busy with other priorities there, including on-call, but hopefully I'll manage to find time soon 🤞 Great job on this issue and the associated PR. |
Postgres, unlike mysql and sqlite, invalidates the current transaction on a database contraint violation such as an attempt to insert a duplicate key into a unique index. SolidQueue uses a pretty standard design pattern of application level conflict detection and resolution via error handling as part of it's concurrency management in Semaphore::Proxy#attempt create (via create! and rescue). For Postgres based implementations when: * enqueing a job inside a transaction * under load such that the race condition triggerng the simultanious conflicted insert the transaction would become unrecoverable at the application level. There are two simple paths to address this issue in Postgres: * Nested transactions * Insert using 'on conflict do nothing' Nested transaction are the easiest to implement and are portable across all 3 of the Rails standard databases. Unfortunately, nested transactions have a known and very signifant performance problem, especially for databases under load and/or replicated databases with long running queries. Each of the big three database inplements the ANSI standard Insert on conflict do nothing in a very different way. Because of this, this PR only replaces the current implementation of application level conflict handling for Postgres. It takes advantage of the Rails standard ActiveRecord::Persistence::Insert method supporting on conflict semantics.
Postgres only Issue: Jobs that meet all of the following conditions will result in the database connection entering an invalid and unrecoverable state, requiring a rollback before any further database requests (reads or writes) can be made: 1. Use SolidQueue's 'limits_concurrency' macro 2. Are enqueued inside an application-level transaction 3. The limits_concurrency key already exists in the Semaphore table (i.e., this is job 2 - N for a given key) SolidQueue uses the following design pattern to implement the conflict detection of the limits_concurrency macro which works 100% correctly, 100% of the time, with MySQL and SQLite: begin Semaphore.create!(...) no_conflict_path() rescue ActiveRecord::RecordNotUnique handle_concurrency_conflict() end The problem is Postgres: It's not possible to rescue and recover from an insert failing due to an conflict on unique index (or any other database constraint). Until a rollback is executed, the database connection is in an invalid state and unusable. Possible Solutions: 1. Nested transactions + Easiest to implement + Portable across all three standard Rails databases - Has significant performance issues, especially for databases under load or replicated databases with long-running queries (Postgres specific) - Requires using Raise and Rescue for flow of control (performance, less than ideal coding practice) - Requires issuing a rollback (performance, additional command that must round trip from the client to database and back) 2. Insert into the Semaphore table using 'INSERT INTO table (..) VALUES (...) ON CONFLICT DO NOTHING' syntax + ANSI standard syntax (not a Postgres 'one off' solution) + Rails natively supports identifying database adaptors that support this syntax, making the implementation portable and maintainable + Supported by Postgres and allows this issue to be addressed + Does not require Raise and Rescue for flow of control + Performant (native, fast path database functionality) Solution: Leverage Rails connection adaptors allowing code to easily identifying supported database features/functionality to implement strategy rails#2 (INSERT ON CONFLICT) for those databases that support it (i.e., Postgres) and leave the original implementation of rescuing RecordNotUnique for those that do not. Note: Although I'd love to take credit for the "quality" of this implementation, all that credit belongs to @rosa who's excellent feedback on an earlier implementation resulted in this significantly better implementation.
Postgres only Issue: Jobs that meet all of the following conditions will result in the database connection entering an invalid and unrecoverable state, requiring a rollback before any further database requests (reads or writes) can be made: 1. Use SolidQueue's 'limits_concurrency' macro 2. Are enqueued inside an application-level transaction 3. The limits_concurrency key already exists in the Semaphore table (i.e., this is job 2 - N for a given key) SolidQueue uses the following design pattern to implement the conflict detection of the limits_concurrency macro which works 100% correctly, 100% of the time, with MySQL and SQLite: begin Semaphore.create!(...) no_conflict_path() rescue ActiveRecord::RecordNotUnique handle_concurrency_conflict() end The problem is Postgres: It's not possible to rescue and recover from an insert failing due to an conflict on unique index (or any other database constraint). Until a rollback is executed, the database connection is in an invalid state and unusable. Possible Solutions: 1. Nested transactions + Easiest to implement + Portable across all three standard Rails databases - Has significant performance issues, especially for databases under load or replicated databases with long-running queries (Postgres specific) - Requires using Raise and Rescue for flow of control (performance, less than ideal coding practice) - Requires issuing a rollback (performance, additional command that must round trip from the client to database and back) 2. Insert into the Semaphore table using 'INSERT INTO table (..) VALUES (...) ON CONFLICT DO NOTHING' syntax + ANSI standard syntax (not a Postgres 'one off' solution) + Rails natively supports identifying database adaptors that support this syntax, making the implementation portable and maintainable + Supported by Postgres and allows this issue to be addressed + Does not require Raise and Rescue for flow of control + Performant (native, fast path database functionality) Solution: Leverage Rails connection adaptors allowing code to easily identifying supported database features/functionality to implement strategy rails#2 (INSERT ON CONFLICT) for those databases that support it (i.e., Postgres) and leave the original implementation of rescuing RecordNotUnique for those that do not. Note: Although I'd love to take credit for the "quality" of this implementation, all that credit belongs to @rosa who's excellent feedback on an earlier implementation resulted in this significantly better implementation.
Fixes #224. I believe this is a Postgres only issue
Environment:
My workflow that seems to reliably cause the issue:
One of the Jobs fairly consistently generates an error on 'INSERT INTO "solid_queue_semaphores".
For each of the block of 10 or so creates, the offending jobs solid_queue_semephore "key" value is JobName/User.id, and all 10 of the commits will have the same key.
What I believe I'm seeing via the logs is:
I haven't dug down into any of the SolidQueue code around enqueuing a job with limits_concurrency, but, at least with Postgres and it's version of MVCC, there is no way to preemptively detect a potential duplicate key prior to insert via a query. It either has to be rescued or handled with an 'on conflict' clause (or the mySQL / sqlLite equivalent).
If my somewhat rushed conclusion is correct, then given limits_concurrency is a SolidQueue feature that enhances an enqueue request, it would be highly desirable for Solid to detect and retry internally. Less desirable, but absolutely functional would be to raise an error that's specifically for this issue because the Job.perform_later is not the root cause, there isn't anything wrong with the database, you just got stepped on by combination of a database constraints and unfortunate timing, so the right answer is to resubmit job request it will likely work.
When I figure out how to get SolidQueue test environment working (I'm on a Mac and homebrew seems to require some kind of weird incantation to build a working version of mysql that works with docker), I'll see if I can develop a test for this specific case.
The text was updated successfully, but these errors were encountered: