-
Notifications
You must be signed in to change notification settings - Fork 802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
advance ack-level to avoid querying the same (empty) tasks next time #6258
advance ack-level to avoid querying the same (empty) tasks next time #6258
Conversation
Every time pollers are reaching out matching we acuire a new lease in the DB extending maxReadLevel. Unfortunately, if there is no writes happening to the task-list, we are pushing maxReadLevel further and further away from the previous ackLevel (there is nothing to ack!). At some moment after 1000 restarts of matching service we could have 1000 (empty) gettask requests and drain the whole matching.PersistenceMaxQPS which will reject writes to other [active] task-lists with "Max QPS reached" error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent find
As per your description, it' be good to verify via a deploy check and to ensure the tests are passing. But on the face of it nothing jumps out at me as a dangerous as far as I can immediately see. |
tr.taskAckManager.SetReadLevel(readLevel) | ||
// even though we didn't handle any tasks, we want to advance the ack-level | ||
// to avoid needless querying database the next time | ||
tr.taskAckManager.SetAckLevel(readLevel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when a new task is created at this point, is it guaranteed it is going to have an id > readLevel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is. That's how task-ID allocation works in taskWriter.
But I also changed the diff for this to be very explicit, and added a test (and the end of it I also check for this for sake of sanity).
The previous fix did not work for cases when we already had something un-acked, but read no tasks in the last batch. Tests found this, that's cool!
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files
... and 12 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
Also fix comments which apparently were copy-pasted
Every time pollers are reaching out matching we acuire a new lease in
the DB extending maxReadLevel.
Unfortunately, if there is no writes
happening to the task-list, we are pushing maxReadLevel further and
further away from the previous ackLevel (there is nothing to ack!).
At some moment after 1000 restarts of matching service we could have
1000 (empty) gettask requests and drain the whole
matching.PersistenceMaxQPS which will reject writes to other [active]
task-lists with "Max QPS reached" error.
We are advancing ack-level for task-lists even when read zero tasks.
This is required to prevent spiky load to DB / hitting rate-limit after cadence-matching restart.
Will do on staging environment, it is Draft by now.
We could skip some tasks making workflows to stuck
Release notes
Documentation Changes