Skip to content
This repository has been archived by the owner on Dec 13, 2023. It is now read-only.

MySQL queues: reduce the impact of racing by popping the queue 1 by 1 #1355

Merged
merged 1 commit into from
Nov 17, 2020

Conversation

ggrekhov
Copy link
Contributor

@ggrekhov ggrekhov commented Oct 15, 2019

As I understand, the current implementation is done in this way (everything or nothing) just because there is no way to get the updated row IDs from a MySQL response. And therefore the only consistent behavior is to completely cancel the partial update as it conflicts with another worker.

But it hurts a lot if there is a huge queue of tasks and a lot of workers trying to poll - they get stuck in a deadlock trying to do set popped = true for a bunch of messages.

In this change the messages are marker as popped one by one for each row and the popped ones are returned.

EDIT
Closes #577

@codecov
Copy link

codecov bot commented Oct 15, 2019

Codecov Report

Merging #1355 (be29a58) into dev (9d409e4) will increase coverage by 0.01%.
The diff coverage is 85.71%.

Impacted file tree graph

@@             Coverage Diff              @@
##                dev    #1355      +/-   ##
============================================
+ Coverage     64.30%   64.31%   +0.01%     
- Complexity     2846     2847       +1     
============================================
  Files           241      241              
  Lines         14317    14316       -1     
  Branches       1409     1410       +1     
============================================
+ Hits           9206     9208       +2     
+ Misses         4330     4327       -3     
  Partials        781      781              
Impacted Files Coverage Δ Complexity Δ
...com/netflix/conductor/dao/mysql/MySQLQueueDAO.java 84.21% <85.71%> (+2.47%) 46.00 <1.00> (+1.00)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d409e4...be29a58. Read the comment docs.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 3229

  • 7 of 7 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.02%) to 70.195%

Totals Coverage Status
Change from base Build 3221: 0.02%
Covered Lines: 10266
Relevant Lines: 14625

💛 - Coveralls

@apanicker-nflx apanicker-nflx changed the base branch from master to dev October 29, 2019 00:19
@apanicker-nflx
Copy link
Collaborator

@s50600822 @jvemugunta @mashurex As primary contributors to the MySQL module, we would appreciate it if you could help us out with reviewing this pull request.

@mashurex
Copy link
Contributor

mashurex commented Nov 7, 2019

@ggrekhov @s50600822 @kishorebanala

Here's what happens and why it's trying to do them all at once:

  1. When pop gets called, it uses peekMessages to read messages on the queue, and it will continue to peek until count has been fulfilled or timeout has been met.
  2. Once count has been fulfilled it wants to pop exactly what it peeked so there's no discordance.
  3. The ApplicationException is thrown when we can't pop exactly the messages we've collected during our peek loop.

In a highly concurrent environment I can see why this leads to issues. The most satisfactory solution would probably not be to pop each individual one after collecting them as this is going to lead to data consistency issues and (probably) the same task being handled twice.

I think a better solution would be to implement a time based visibility column that would allow us to collect and safely mark a task in the queue as 'invisible' for some window of time (like the timeout parameter that is specified).

Then we could implement a more efficient batch pop group of statements that won't introduce the deadlocks.

EDIT
Alternatively we could actually pop while collecting messages and then put them back if necessary. While this approach could probably be implemented efficiently easier than using visibility methods, it could lead to transient message loss by not reverting the popped status due to exceptions or failures of some sort.

@apanicker-nflx
Copy link
Collaborator

@ggrekhov Please makes changes as suggested above.

@kishorebanala
Copy link
Contributor

Hey @ggrekhov, Are you still considering working on this feature?

@ggrekhov
Copy link
Contributor Author

Hi @kishorebanala. To be honest I didn't get why my solution is wrong.

@mashurex, you say: this is going to lead to data consistency issues and (probably) the same task being handled twice. I'm looking at the code now and I cannot notice a way how a message can be handled twice when actually it can be popped only by 1 worker. What consistency issues do you mean?

I admit that my implementation brings a bit new behavior of pop() - sometimes it can return less messages than a client asks for. I'm not sure how critical it is for the whole design - you know it much better than me. But it seems not critical to me - with my knowledge of the system I couldn't notice any bad impact of it, and I see the positive impact of the change in general as the messages don't get stuck in the queue forever.

By the way, this PR has been deployed in our production instance for several months already - we didn't experience any inconsistency issues in our workflows.

@james-deee
Copy link
Contributor

@apanicker-nflx @kishorebanala @ggrekhov @mashurex

Could y'all follow up with @ggrekhov 's comment? It looks like the Postgres one was approved and merged: #1741

I'm not sure what the difference would be between Postgres and Mysql implementation where one would be ok, but other not in this instance.

@apanicker-nflx
Copy link
Collaborator

Merging this because the corresponding change in postgres was merged and the OSS community seems to agree. Please post an issue if this causes any issues with the MySQL persistence implementation.

@apanicker-nflx apanicker-nflx merged commit 1e3e906 into Netflix:dev Nov 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants