Improve postgresql DAO performance #1940

marosmars · 2020-10-28T11:15:26Z

by adding proper FOR SHARE / FOR UPDATE / SKIP LOCKED locks to DB queries
where it makes sense in order to reduce conflicts/deadlocks in DB

Most important case is when workers poll for work, they are not
interested in tasks that are locked (currently being updated) so they
can leverage 'SKIP LOCKED' to prevent DB locks and tx deadlocks.

This increases the performance of postgres dao

Additional improvements:

move thread.sleep out of DB transaction when tasks are being polled
with timeout parameter
add serialization_error to the list of causes triggerring TX retry
(this is happening under heavy load)
fix processUnack condition: it used to do the opposite. This has been
also fixed in the meantime by u447 rick.fishman@bcbsfl.com
add a performance test. This test can be executed manually, but should
not be automated
add retries to containsMessage method

Signed-off-by: Maros Marsalek mmarsalek@frinx.io

codecov · 2020-10-28T11:36:47Z

Codecov Report

Merging #1940 into dev will decrease coverage by 0.11%.
The diff coverage is 64.93%.

@@             Coverage Diff              @@
##                dev    #1940      +/-   ##
============================================
- Coverage     65.57%   65.45%   -0.12%     
- Complexity     3928     3941      +13     
============================================
  Files           301      301              
  Lines         18527    18557      +30     
  Branches       1685     1694       +9     
============================================
- Hits          12149    12147       -2     
- Misses         5537     5568      +31     
- Partials        841      842       +1

Impacted Files	Coverage Δ	Complexity Δ
...etflix/conductor/dao/postgres/PostgresBaseDAO.java	`66.29% <0.00%> (-0.76%)`	`17.00 <0.00> (ø)`
...tflix/conductor/dao/postgres/PostgresQueueDAO.java	`79.35% <51.92%> (-11.78%)`	`60.00 <15.00> (+6.00)`	⬇️
...x/conductor/dao/postgres/PostgresExecutionDAO.java	`71.30% <100.00%> (ø)`	`115.00 <8.00> (ø)`
...ductor/dao/cassandra/CassandraEventHandlerDAO.java	`65.78% <0.00%> (-10.53%)`	`15.00% <0.00%> (ø%)`
.../conductor/dao/cassandra/CassandraMetadataDAO.java	`53.88% <0.00%> (-4.15%)`	`24.00% <0.00%> (ø%)`
...lix/conductor/core/execution/WorkflowExecutor.java	`80.19% <0.00%> (+0.51%)`	`182.00% <0.00%> (+1.00%)`
...onductor/core/execution/WorkflowRepairService.java	`70.27% <0.00%> (+1.69%)`	`13.00% <0.00%> (+5.00%)`
...e/execution/tasks/SystemTaskWorkerCoordinator.java	`83.33% <0.00%> (+2.08%)`	`18.00% <0.00%> (+1.00%)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b51ce0a...41cc8db. Read the comment docs.

by adding proper FOR SHARE / FOR UPDATE / SKIP LOCKED locks to DB queries where it makes sense in order to reduce conflicts/deadlocks in DB. Most important case is when workers poll for work, they are not interested in tasks that are locked (currently being updated) so they can leverage 'SKIP LOCKED' to prevent DB locks and tx deadlocks. This increases the performance of postgres dao Additional improvements: + move thread.sleep out of DB transaction when tasks are being polled with timeout parameter + add serialization_error to the list of causes triggerring TX retry (this is happening under heavy load) + fix processUnack condition: it used to do the opposite. This has been also fixed in the meantime by u447 <rick.fishman@bcbsfl.com> + add a performance test. This test can be executed manually, but should not be automated + add retries to containsMessage method Signed-off-by: Maros Marsalek <mmarsalek@frinx.io>

aravindanr · 2020-11-04T21:51:13Z

@rickfish Can you please review this?

rickfish · 2020-11-05T21:52:06Z

@marosmars, this looks good and I am glad you are using the lock/share attributes. I wanted to do that but it is in the guts of the queue dao. we have so much volume and so many polling operations that I am a little scared to introduce it into high-volume environments where we might miss race conditions and get killed in production.

In any case, it looks good on the surface but I will have to dig into it heavily and actually deploy to one of our lower environments that has significant volume. I can't do that until after next week because we are all working on a production deployment on the 13th of the month.

One question: why the double query for LOCK_TASKS in the processUnacks methods?

marosmars · 2020-11-06T09:07:57Z

@rickfish You can use the PerformanceTest (https://github.com/Netflix/conductor/pull/1940/files#diff-8a372fc5d838b1a09cac3814bb9cb8df1819e64b726eb2cd2988743b199b35f8) in this PR to stress test the Queue Dao. It is not production env, but it can simulate a lot of load with multiple producers/consumers and thus approximate your production env. I was using that while developing this PR.

The test by default spawns 4 producer threads, 8 consumer threads (workers) and expects a set number of messages to be pushed into the queue and successfully processed. In addition, there's a few threads invoking queueDetail() and a few additional threads invoking processUnacks() to put additional load on the dao. The test is a unit test, but expects a real PSQL to be deployed manually on the host.

In addition to the attached performance test (testing just the Queue Dao) I was running a performance test executing workflows against a full conductor + psql + 8 worker nodes setup. This external test also behaved well and the performance was higher with this PR applied.

So I totally understand your concerns, but the testing mentioned above should increase confidence in this PR a bit

Note: I was actually running into problems in the queue dao without this PR when executing that performance test ... e.g. https://github.com/Netflix/conductor/pull/1940/files#diff-fab744a4a59813bc19b455dbc4f34e34e3657655dde025e0177f9621dfca72f4

One question: why the double query for LOCK_TASKS in the processUnacks methods?

That's a mistake, thanks for pointing it out. Will remove the double query.

marosmars · 2020-11-06T09:09:59Z

@rickfish I'd be happy to wait until you test it in your production env. no rush and thx for review

That was a mistake in the original commit, no need to do the same thing twice

rickfish · 2020-11-16T12:36:24Z

@marosmars these code changes have been in our 'test' environment for a week with no issues, now putting it in our high-volume 'stage' environment which processes approximately 100,000 workflows and 250,000 tasks per day. We have 4 REST service pods and 2 sweeper pods in that environment so it will exercise race conditions due to row locking. If there are any issues with the new code they should surface in that environment in the next week. If no issues by 11/23 I will give the thumbs up from my point of view. Thanks so much for the code - it should help with some of the unexplainable issues that happen from time to time due to no row locking being in place.

rickfish · 2020-11-23T14:36:08Z

@marosmars no issues in our stage environment. I think it is safe to go ahead with this. Thanks so much for this.

marosmars · 2020-11-30T13:21:32Z

@rickfish thank you for review and testing. @apanicker-nflx can we get this in ?

apanicker-nflx · 2020-12-01T01:18:28Z

@marosmars Thanks for the contribution and @rickfish thanks for reviewing and testing this out.

* Improve postgresql DAO performance by adding proper FOR SHARE / FOR UPDATE / SKIP LOCKED locks to DB queries where it makes sense in order to reduce conflicts/deadlocks in DB. Most important case is when workers poll for work, they are not interested in tasks that are locked (currently being updated) so they can leverage 'SKIP LOCKED' to prevent DB locks and tx deadlocks. This increases the performance of postgres dao Additional improvements: + move thread.sleep out of DB transaction when tasks are being polled with timeout parameter + add serialization_error to the list of causes triggerring TX retry (this is happening under heavy load) + fix processUnack condition: it used to do the opposite. This has been also fixed in the meantime by u447 <rick.fishman@bcbsfl.com> + add a performance test. This test can be executed manually, but should not be automated + add retries to containsMessage method Signed-off-by: Maros Marsalek <mmarsalek@frinx.io> * Remove double query from processUnacks That was a mistake in the original commit, no need to do the same thing twice

marosmars force-pushed the postgres-improvements branch 3 times, most recently from a6d0f31 to a839013 Compare October 28, 2020 14:20

marosmars mentioned this pull request Oct 28, 2020

Deploying Conductor with HA and DR in mind #1934

Closed

marosmars force-pushed the postgres-improvements branch from a839013 to c15acf2 Compare October 28, 2020 14:46

aravindanr assigned aravindanr and unassigned aravindanr Nov 4, 2020

aravindanr added the enhancement label Nov 4, 2020

Remove double query from processUnacks

41cc8db

That was a mistake in the original commit, no need to do the same thing twice

apanicker-nflx merged commit ea9ad82 into Netflix:dev Dec 1, 2020

marosmars deleted the postgres-improvements branch December 1, 2020 14:53

marosmars mentioned this pull request Oct 27, 2021

[FEATURE] Implement PSQL based object storage for Conductor #2539

Closed

5 tasks

snyk-bot mentioned this pull request Mar 13, 2022

[Snyk] Fix for 2 vulnerabilities ekmixon/conductor#14

Open

ekmixon mentioned this pull request May 17, 2022

[Snyk] Fix for 2 vulnerabilities ekmixon/conductor#20

Open

ekmixon mentioned this pull request Oct 31, 2022

[Snyk] Security upgrade browser-sync from 2.26.7 to 2.27.8 ekmixon/conductor#26

Open

ekmixon mentioned this pull request Nov 22, 2022

[Snyk] Security upgrade browser-sync from 2.26.7 to 2.27.8 ekmixon/conductor#30

Open

snyk-bot mentioned this pull request Dec 20, 2022

[Snyk] Security upgrade browser-sync from 2.26.7 to 2.27.11 ekmixon/conductor#33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve postgresql DAO performance #1940

Improve postgresql DAO performance #1940

marosmars commented Oct 28, 2020

codecov bot commented Oct 28, 2020 •

edited

Loading

aravindanr commented Nov 4, 2020

rickfish commented Nov 5, 2020

marosmars commented Nov 6, 2020

marosmars commented Nov 6, 2020

rickfish commented Nov 16, 2020

rickfish commented Nov 23, 2020

marosmars commented Nov 30, 2020

apanicker-nflx commented Dec 1, 2020

Improve postgresql DAO performance #1940

Improve postgresql DAO performance #1940

Conversation

marosmars commented Oct 28, 2020

codecov bot commented Oct 28, 2020 • edited Loading

Codecov Report

aravindanr commented Nov 4, 2020

rickfish commented Nov 5, 2020

marosmars commented Nov 6, 2020

marosmars commented Nov 6, 2020

rickfish commented Nov 16, 2020

rickfish commented Nov 23, 2020

marosmars commented Nov 30, 2020

apanicker-nflx commented Dec 1, 2020

codecov bot commented Oct 28, 2020 •

edited

Loading