-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Conversation
Codecov Report
@@ Coverage Diff @@
## dev #1940 +/- ##
============================================
- Coverage 65.57% 65.45% -0.12%
- Complexity 3928 3941 +13
============================================
Files 301 301
Lines 18527 18557 +30
Branches 1685 1694 +9
============================================
- Hits 12149 12147 -2
- Misses 5537 5568 +31
- Partials 841 842 +1
Continue to review full report at Codecov.
|
a6d0f31
to
a839013
Compare
by adding proper FOR SHARE / FOR UPDATE / SKIP LOCKED locks to DB queries where it makes sense in order to reduce conflicts/deadlocks in DB. Most important case is when workers poll for work, they are not interested in tasks that are locked (currently being updated) so they can leverage 'SKIP LOCKED' to prevent DB locks and tx deadlocks. This increases the performance of postgres dao Additional improvements: + move thread.sleep out of DB transaction when tasks are being polled with timeout parameter + add serialization_error to the list of causes triggerring TX retry (this is happening under heavy load) + fix processUnack condition: it used to do the opposite. This has been also fixed in the meantime by u447 <rick.fishman@bcbsfl.com> + add a performance test. This test can be executed manually, but should not be automated + add retries to containsMessage method Signed-off-by: Maros Marsalek <mmarsalek@frinx.io>
a839013
to
c15acf2
Compare
@rickfish Can you please review this? |
@marosmars, this looks good and I am glad you are using the lock/share attributes. I wanted to do that but it is in the guts of the queue dao. we have so much volume and so many polling operations that I am a little scared to introduce it into high-volume environments where we might miss race conditions and get killed in production. In any case, it looks good on the surface but I will have to dig into it heavily and actually deploy to one of our lower environments that has significant volume. I can't do that until after next week because we are all working on a production deployment on the 13th of the month. One question: why the double query for LOCK_TASKS in the processUnacks methods? |
@rickfish You can use the PerformanceTest (https://github.com/Netflix/conductor/pull/1940/files#diff-8a372fc5d838b1a09cac3814bb9cb8df1819e64b726eb2cd2988743b199b35f8) in this PR to stress test the Queue Dao. It is not production env, but it can simulate a lot of load with multiple producers/consumers and thus approximate your production env. I was using that while developing this PR. The test by default spawns 4 producer threads, 8 consumer threads (workers) and expects a set number of messages to be pushed into the queue and successfully processed. In addition, there's a few threads invoking queueDetail() and a few additional threads invoking processUnacks() to put additional load on the dao. The test is a unit test, but expects a real PSQL to be deployed manually on the host. In addition to the attached performance test (testing just the Queue Dao) I was running a performance test executing workflows against a full conductor + psql + 8 worker nodes setup. This external test also behaved well and the performance was higher with this PR applied. So I totally understand your concerns, but the testing mentioned above should increase confidence in this PR a bit Note: I was actually running into problems in the queue dao without this PR when executing that performance test ... e.g. https://github.com/Netflix/conductor/pull/1940/files#diff-fab744a4a59813bc19b455dbc4f34e34e3657655dde025e0177f9621dfca72f4
That's a mistake, thanks for pointing it out. Will remove the double query. |
@rickfish I'd be happy to wait until you test it in your production env. no rush and thx for review |
That was a mistake in the original commit, no need to do the same thing twice
@marosmars these code changes have been in our 'test' environment for a week with no issues, now putting it in our high-volume 'stage' environment which processes approximately 100,000 workflows and 250,000 tasks per day. We have 4 REST service pods and 2 sweeper pods in that environment so it will exercise race conditions due to row locking. If there are any issues with the new code they should surface in that environment in the next week. If no issues by 11/23 I will give the thumbs up from my point of view. Thanks so much for the code - it should help with some of the unexplainable issues that happen from time to time due to no row locking being in place. |
@marosmars no issues in our stage environment. I think it is safe to go ahead with this. Thanks so much for this. |
@rickfish thank you for review and testing. @apanicker-nflx can we get this in ? |
@marosmars Thanks for the contribution and @rickfish thanks for reviewing and testing this out. |
* Improve postgresql DAO performance by adding proper FOR SHARE / FOR UPDATE / SKIP LOCKED locks to DB queries where it makes sense in order to reduce conflicts/deadlocks in DB. Most important case is when workers poll for work, they are not interested in tasks that are locked (currently being updated) so they can leverage 'SKIP LOCKED' to prevent DB locks and tx deadlocks. This increases the performance of postgres dao Additional improvements: + move thread.sleep out of DB transaction when tasks are being polled with timeout parameter + add serialization_error to the list of causes triggerring TX retry (this is happening under heavy load) + fix processUnack condition: it used to do the opposite. This has been also fixed in the meantime by u447 <rick.fishman@bcbsfl.com> + add a performance test. This test can be executed manually, but should not be automated + add retries to containsMessage method Signed-off-by: Maros Marsalek <mmarsalek@frinx.io> * Remove double query from processUnacks That was a mistake in the original commit, no need to do the same thing twice
* Improve postgresql DAO performance by adding proper FOR SHARE / FOR UPDATE / SKIP LOCKED locks to DB queries where it makes sense in order to reduce conflicts/deadlocks in DB. Most important case is when workers poll for work, they are not interested in tasks that are locked (currently being updated) so they can leverage 'SKIP LOCKED' to prevent DB locks and tx deadlocks. This increases the performance of postgres dao Additional improvements: + move thread.sleep out of DB transaction when tasks are being polled with timeout parameter + add serialization_error to the list of causes triggerring TX retry (this is happening under heavy load) + fix processUnack condition: it used to do the opposite. This has been also fixed in the meantime by u447 <rick.fishman@bcbsfl.com> + add a performance test. This test can be executed manually, but should not be automated + add retries to containsMessage method Signed-off-by: Maros Marsalek <mmarsalek@frinx.io> * Remove double query from processUnacks That was a mistake in the original commit, no need to do the same thing twice
by adding proper FOR SHARE / FOR UPDATE / SKIP LOCKED locks to DB queries
where it makes sense in order to reduce conflicts/deadlocks in DB
Most important case is when workers poll for work, they are not
interested in tasks that are locked (currently being updated) so they
can leverage 'SKIP LOCKED' to prevent DB locks and tx deadlocks.
This increases the performance of postgres dao
Additional improvements:
move thread.sleep out of DB transaction when tasks are being polled
with timeout parameter
add serialization_error to the list of causes triggerring TX retry
(this is happening under heavy load)
fix processUnack condition: it used to do the opposite. This has been
also fixed in the meantime by u447 rick.fishman@bcbsfl.com
add a performance test. This test can be executed manually, but should
not be automated
add retries to containsMessage method
Signed-off-by: Maros Marsalek mmarsalek@frinx.io