MySQL queues: reduce the impact of racing by popping the queue 1 by 1 #1355

ggrekhov · 2019-10-15T14:44:12Z

As I understand, the current implementation is done in this way (everything or nothing) just because there is no way to get the updated row IDs from a MySQL response. And therefore the only consistent behavior is to completely cancel the partial update as it conflicts with another worker.

But it hurts a lot if there is a huge queue of tasks and a lot of workers trying to poll - they get stuck in a deadlock trying to do set popped = true for a bunch of messages.

In this change the messages are marker as popped one by one for each row and the popped ones are returned.

EDIT
Closes #577

…es one by one

codecov · 2019-10-15T14:56:27Z

Codecov Report

Merging #1355 (be29a58) into dev (9d409e4) will increase coverage by 0.01%.
The diff coverage is 85.71%.

@@             Coverage Diff              @@
##                dev    #1355      +/-   ##
============================================
+ Coverage     64.30%   64.31%   +0.01%     
- Complexity     2846     2847       +1     
============================================
  Files           241      241              
  Lines         14317    14316       -1     
  Branches       1409     1410       +1     
============================================
+ Hits           9206     9208       +2     
+ Misses         4330     4327       -3     
  Partials        781      781

Impacted Files	Coverage Δ	Complexity Δ
...com/netflix/conductor/dao/mysql/MySQLQueueDAO.java	`84.21% <85.71%> (+2.47%)`	`46.00 <1.00> (+1.00)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d409e4...be29a58. Read the comment docs.

coveralls · 2019-10-15T14:57:53Z

Pull Request Test Coverage Report for Build 3229

7 of 7 (100.0%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.02%) to 70.195%

Totals
Change from base Build 3221:	0.02%
Covered Lines:	10266
Relevant Lines:	14625

💛 - Coveralls

apanicker-nflx · 2019-10-29T00:30:53Z

@s50600822 @jvemugunta @mashurex As primary contributors to the MySQL module, we would appreciate it if you could help us out with reviewing this pull request.

mashurex · 2019-11-07T16:25:13Z

@ggrekhov @s50600822 @kishorebanala

Here's what happens and why it's trying to do them all at once:

When pop gets called, it uses peekMessages to read messages on the queue, and it will continue to peek until count has been fulfilled or timeout has been met.
Once count has been fulfilled it wants to pop exactly what it peeked so there's no discordance.
The ApplicationException is thrown when we can't pop exactly the messages we've collected during our peek loop.

In a highly concurrent environment I can see why this leads to issues. The most satisfactory solution would probably not be to pop each individual one after collecting them as this is going to lead to data consistency issues and (probably) the same task being handled twice.

I think a better solution would be to implement a time based visibility column that would allow us to collect and safely mark a task in the queue as 'invisible' for some window of time (like the timeout parameter that is specified).

Then we could implement a more efficient batch pop group of statements that won't introduce the deadlocks.

EDIT
Alternatively we could actually pop while collecting messages and then put them back if necessary. While this approach could probably be implemented efficiently easier than using visibility methods, it could lead to transient message loss by not reverting the popped status due to exceptions or failures of some sort.

apanicker-nflx · 2019-11-19T00:12:30Z

@ggrekhov Please makes changes as suggested above.

kishorebanala · 2020-04-21T23:37:30Z

Hey @ggrekhov, Are you still considering working on this feature?

ggrekhov · 2020-04-27T12:32:05Z

Hi @kishorebanala. To be honest I didn't get why my solution is wrong.

@mashurex, you say: this is going to lead to data consistency issues and (probably) the same task being handled twice. I'm looking at the code now and I cannot notice a way how a message can be handled twice when actually it can be popped only by 1 worker. What consistency issues do you mean?

I admit that my implementation brings a bit new behavior of pop() - sometimes it can return less messages than a client asks for. I'm not sure how critical it is for the whole design - you know it much better than me. But it seems not critical to me - with my knowledge of the system I couldn't notice any bad impact of it, and I see the positive impact of the change in general as the messages don't get stuck in the queue forever.

By the way, this PR has been deployed in our production instance for several months already - we didn't experience any inconsistency issues in our workflows.

james-deee · 2020-11-08T17:14:37Z

@apanicker-nflx @kishorebanala @ggrekhov @mashurex

Could y'all follow up with @ggrekhov 's comment? It looks like the Postgres one was approved and merged: #1741

I'm not sure what the difference would be between Postgres and Mysql implementation where one would be ok, but other not in this instance.

apanicker-nflx · 2020-11-17T00:22:41Z

Merging this because the corresponding change in postgres was merged and the OSS community seems to agree. Please post an issue if this causes any issues with the MySQL persistence implementation.

MySQL queues: reduce the impact of racing by popping the queue messag…

be29a58

…es one by one

apanicker-nflx changed the base branch from master to dev October 29, 2019 00:19

kishorebanala approved these changes Oct 29, 2019

View reviewed changes

mashurex mentioned this pull request Nov 7, 2019

v1.10.0 BACKEND_ERROR - could not pop all messages for given ids: [d52c09c1-01ea-483d-84f2-04cc56d50bf3] (0 messages were popped) #577

Closed

rickfish mentioned this pull request Jun 17, 2020

Postgres queues: reduce the impact of racing by popping the queue 1 by 1 #1741

Merged

apanicker-nflx closed this Nov 17, 2020

apanicker-nflx reopened this Nov 17, 2020

apanicker-nflx merged commit 1e3e906 into Netflix:dev Nov 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MySQL queues: reduce the impact of racing by popping the queue 1 by 1 #1355

MySQL queues: reduce the impact of racing by popping the queue 1 by 1 #1355

ggrekhov commented Oct 15, 2019 •

edited by apanicker-nflx

Loading

codecov bot commented Oct 15, 2019 •

edited

Loading

coveralls commented Oct 15, 2019

apanicker-nflx commented Oct 29, 2019

mashurex commented Nov 7, 2019 •

edited

Loading

apanicker-nflx commented Nov 19, 2019

kishorebanala commented Apr 21, 2020

ggrekhov commented Apr 27, 2020

james-deee commented Nov 8, 2020

apanicker-nflx commented Nov 17, 2020

MySQL queues: reduce the impact of racing by popping the queue 1 by 1 #1355

MySQL queues: reduce the impact of racing by popping the queue 1 by 1 #1355

Conversation

ggrekhov commented Oct 15, 2019 • edited by apanicker-nflx Loading

codecov bot commented Oct 15, 2019 • edited Loading

Codecov Report

coveralls commented Oct 15, 2019

Pull Request Test Coverage Report for Build 3229

💛 - Coveralls

apanicker-nflx commented Oct 29, 2019

mashurex commented Nov 7, 2019 • edited Loading

apanicker-nflx commented Nov 19, 2019

kishorebanala commented Apr 21, 2020

ggrekhov commented Apr 27, 2020

james-deee commented Nov 8, 2020

apanicker-nflx commented Nov 17, 2020

ggrekhov commented Oct 15, 2019 •

edited by apanicker-nflx

Loading

codecov bot commented Oct 15, 2019 •

edited

Loading

mashurex commented Nov 7, 2019 •

edited

Loading