Skip to content
This repository has been archived by the owner on Dec 13, 2023. It is now read-only.

Improve postgresql DAO performance #1940

Merged
merged 2 commits into from
Dec 1, 2020

Conversation

marosmars
Copy link
Contributor

by adding proper FOR SHARE / FOR UPDATE / SKIP LOCKED locks to DB queries
where it makes sense in order to reduce conflicts/deadlocks in DB

Most important case is when workers poll for work, they are not
interested in tasks that are locked (currently being updated) so they
can leverage 'SKIP LOCKED' to prevent DB locks and tx deadlocks.

This increases the performance of postgres dao

Additional improvements:

  • move thread.sleep out of DB transaction when tasks are being polled
    with timeout parameter

  • add serialization_error to the list of causes triggerring TX retry
    (this is happening under heavy load)

  • fix processUnack condition: it used to do the opposite. This has been
    also fixed in the meantime by u447 rick.fishman@bcbsfl.com

  • add a performance test. This test can be executed manually, but should
    not be automated

  • add retries to containsMessage method

Signed-off-by: Maros Marsalek mmarsalek@frinx.io

@codecov
Copy link

codecov bot commented Oct 28, 2020

Codecov Report

Merging #1940 into dev will decrease coverage by 0.11%.
The diff coverage is 64.93%.

Impacted file tree graph

@@             Coverage Diff              @@
##                dev    #1940      +/-   ##
============================================
- Coverage     65.57%   65.45%   -0.12%     
- Complexity     3928     3941      +13     
============================================
  Files           301      301              
  Lines         18527    18557      +30     
  Branches       1685     1694       +9     
============================================
- Hits          12149    12147       -2     
- Misses         5537     5568      +31     
- Partials        841      842       +1     
Impacted Files Coverage Δ Complexity Δ
...etflix/conductor/dao/postgres/PostgresBaseDAO.java 66.29% <0.00%> (-0.76%) 17.00 <0.00> (ø)
...tflix/conductor/dao/postgres/PostgresQueueDAO.java 79.35% <51.92%> (-11.78%) 60.00 <15.00> (+6.00) ⬇️
...x/conductor/dao/postgres/PostgresExecutionDAO.java 71.30% <100.00%> (ø) 115.00 <8.00> (ø)
...ductor/dao/cassandra/CassandraEventHandlerDAO.java 65.78% <0.00%> (-10.53%) 15.00% <0.00%> (ø%)
.../conductor/dao/cassandra/CassandraMetadataDAO.java 53.88% <0.00%> (-4.15%) 24.00% <0.00%> (ø%)
...lix/conductor/core/execution/WorkflowExecutor.java 80.19% <0.00%> (+0.51%) 182.00% <0.00%> (+1.00%)
...onductor/core/execution/WorkflowRepairService.java 70.27% <0.00%> (+1.69%) 13.00% <0.00%> (+5.00%)
...e/execution/tasks/SystemTaskWorkerCoordinator.java 83.33% <0.00%> (+2.08%) 18.00% <0.00%> (+1.00%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b51ce0a...41cc8db. Read the comment docs.

@marosmars marosmars force-pushed the postgres-improvements branch 3 times, most recently from a6d0f31 to a839013 Compare October 28, 2020 14:20
by adding proper FOR SHARE / FOR UPDATE / SKIP LOCKED locks to DB queries
where it makes sense in order to reduce conflicts/deadlocks in DB.

Most important case is when workers poll for work, they are not
interested in tasks that are locked (currently being updated) so they
can leverage 'SKIP LOCKED' to prevent DB locks and tx deadlocks.

This increases the performance of postgres dao

Additional improvements:

+ move thread.sleep out of DB transaction when tasks are being polled
with timeout parameter

+ add serialization_error to the list of causes triggerring TX retry
(this is happening under heavy load)

+ fix processUnack condition: it used to do the opposite. This has been
also fixed in the meantime by u447 <rick.fishman@bcbsfl.com>

+ add a performance test. This test can be executed manually, but should
not be automated

+ add retries to containsMessage method

Signed-off-by: Maros Marsalek <mmarsalek@frinx.io>
@aravindanr
Copy link
Collaborator

@rickfish Can you please review this?

@rickfish
Copy link
Contributor

rickfish commented Nov 5, 2020

@marosmars, this looks good and I am glad you are using the lock/share attributes. I wanted to do that but it is in the guts of the queue dao. we have so much volume and so many polling operations that I am a little scared to introduce it into high-volume environments where we might miss race conditions and get killed in production.

In any case, it looks good on the surface but I will have to dig into it heavily and actually deploy to one of our lower environments that has significant volume. I can't do that until after next week because we are all working on a production deployment on the 13th of the month.

One question: why the double query for LOCK_TASKS in the processUnacks methods?

@marosmars
Copy link
Contributor Author

@rickfish You can use the PerformanceTest (https://github.com/Netflix/conductor/pull/1940/files#diff-8a372fc5d838b1a09cac3814bb9cb8df1819e64b726eb2cd2988743b199b35f8) in this PR to stress test the Queue Dao. It is not production env, but it can simulate a lot of load with multiple producers/consumers and thus approximate your production env. I was using that while developing this PR.

The test by default spawns 4 producer threads, 8 consumer threads (workers) and expects a set number of messages to be pushed into the queue and successfully processed. In addition, there's a few threads invoking queueDetail() and a few additional threads invoking processUnacks() to put additional load on the dao. The test is a unit test, but expects a real PSQL to be deployed manually on the host.

In addition to the attached performance test (testing just the Queue Dao) I was running a performance test executing workflows against a full conductor + psql + 8 worker nodes setup. This external test also behaved well and the performance was higher with this PR applied.

So I totally understand your concerns, but the testing mentioned above should increase confidence in this PR a bit

Note: I was actually running into problems in the queue dao without this PR when executing that performance test ... e.g. https://github.com/Netflix/conductor/pull/1940/files#diff-fab744a4a59813bc19b455dbc4f34e34e3657655dde025e0177f9621dfca72f4

One question: why the double query for LOCK_TASKS in the processUnacks methods?

That's a mistake, thanks for pointing it out. Will remove the double query.

@marosmars
Copy link
Contributor Author

@rickfish I'd be happy to wait until you test it in your production env. no rush and thx for review

That was a mistake in the original commit, no need to do the same thing twice
@rickfish
Copy link
Contributor

@marosmars these code changes have been in our 'test' environment for a week with no issues, now putting it in our high-volume 'stage' environment which processes approximately 100,000 workflows and 250,000 tasks per day. We have 4 REST service pods and 2 sweeper pods in that environment so it will exercise race conditions due to row locking. If there are any issues with the new code they should surface in that environment in the next week. If no issues by 11/23 I will give the thumbs up from my point of view. Thanks so much for the code - it should help with some of the unexplainable issues that happen from time to time due to no row locking being in place.

@rickfish
Copy link
Contributor

@marosmars no issues in our stage environment. I think it is safe to go ahead with this. Thanks so much for this.

@marosmars
Copy link
Contributor Author

@rickfish thank you for review and testing. @apanicker-nflx can we get this in ?

@apanicker-nflx
Copy link
Collaborator

@marosmars Thanks for the contribution and @rickfish thanks for reviewing and testing this out.

@apanicker-nflx apanicker-nflx merged commit ea9ad82 into Netflix:dev Dec 1, 2020
@marosmars marosmars deleted the postgres-improvements branch December 1, 2020 14:53
TwoUnderscorez pushed a commit to TwoUnderscorez/conductor that referenced this pull request Jul 23, 2021
* Improve postgresql DAO performance

by adding proper FOR SHARE / FOR UPDATE / SKIP LOCKED locks to DB queries
where it makes sense in order to reduce conflicts/deadlocks in DB.

Most important case is when workers poll for work, they are not
interested in tasks that are locked (currently being updated) so they
can leverage 'SKIP LOCKED' to prevent DB locks and tx deadlocks.

This increases the performance of postgres dao

Additional improvements:

+ move thread.sleep out of DB transaction when tasks are being polled
with timeout parameter

+ add serialization_error to the list of causes triggerring TX retry
(this is happening under heavy load)

+ fix processUnack condition: it used to do the opposite. This has been
also fixed in the meantime by u447 <rick.fishman@bcbsfl.com>

+ add a performance test. This test can be executed manually, but should
not be automated

+ add retries to containsMessage method

Signed-off-by: Maros Marsalek <mmarsalek@frinx.io>

* Remove double query from processUnacks

That was a mistake in the original commit, no need to do the same thing twice
pmchung pushed a commit to routific/conductor that referenced this pull request Sep 6, 2023
* Improve postgresql DAO performance

by adding proper FOR SHARE / FOR UPDATE / SKIP LOCKED locks to DB queries
where it makes sense in order to reduce conflicts/deadlocks in DB.

Most important case is when workers poll for work, they are not
interested in tasks that are locked (currently being updated) so they
can leverage 'SKIP LOCKED' to prevent DB locks and tx deadlocks.

This increases the performance of postgres dao

Additional improvements:

+ move thread.sleep out of DB transaction when tasks are being polled
with timeout parameter

+ add serialization_error to the list of causes triggerring TX retry
(this is happening under heavy load)

+ fix processUnack condition: it used to do the opposite. This has been
also fixed in the meantime by u447 <rick.fishman@bcbsfl.com>

+ add a performance test. This test can be executed manually, but should
not be automated

+ add retries to containsMessage method

Signed-off-by: Maros Marsalek <mmarsalek@frinx.io>

* Remove double query from processUnacks

That was a mistake in the original commit, no need to do the same thing twice
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants