Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

advance ack-level to avoid querying the same (empty) tasks next time #6258

Merged

Conversation

dkrotx
Copy link
Member

@dkrotx dkrotx commented Aug 29, 2024

Every time pollers are reaching out matching we acuire a new lease in
the DB extending maxReadLevel.
Unfortunately, if there is no writes
happening to the task-list, we are pushing maxReadLevel further and
further away from the previous ackLevel (there is nothing to ack!).
At some moment after 1000 restarts of matching service we could have
1000 (empty) gettask requests and drain the whole
matching.PersistenceMaxQPS which will reject writes to other [active]
task-lists with "Max QPS reached" error.

We are advancing ack-level for task-lists even when read zero tasks.

This is required to prevent spiky load to DB / hitting rate-limit after cadence-matching restart.

Will do on staging environment, it is Draft by now.

We could skip some tasks making workflows to stuck

Release notes

Documentation Changes

Every time pollers are reaching out matching we acuire a new lease in
the DB extending maxReadLevel.
Unfortunately, if there is no writes
happening to the task-list, we are pushing maxReadLevel further and
further away from the previous ackLevel (there is nothing to ack!).
At some moment after 1000 restarts of matching service we could have
1000 (empty) gettask requests and drain the whole
matching.PersistenceMaxQPS which will reject writes to other [active]
task-lists with "Max QPS reached" error.
Copy link
Member

@davidporter-id-au davidporter-id-au left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent find

@davidporter-id-au
Copy link
Member

As per your description, it' be good to verify via a deploy check and to ensure the tests are passing. But on the face of it nothing jumps out at me as a dangerous as far as I can immediately see.

tr.taskAckManager.SetReadLevel(readLevel)
// even though we didn't handle any tasks, we want to advance the ack-level
// to avoid needless querying database the next time
tr.taskAckManager.SetAckLevel(readLevel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when a new task is created at this point, is it guaranteed it is going to have an id > readLevel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. That's how task-ID allocation works in taskWriter.
But I also changed the diff for this to be very explicit, and added a test (and the end of it I also check for this for sake of sanity).

The previous fix did not work for cases when we already had something
un-acked, but read no tasks in the last batch.
Tests found this, that's cool!
Copy link

codecov bot commented Sep 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.08%. Comparing base (1b02d78) to head (799f951).
Report is 7 commits behind head on master.

Additional details and impacted files
Files with missing lines Coverage Δ
service/matching/tasklist/task_reader.go 73.60% <100.00%> (+0.74%) ⬆️

... and 12 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b02d78...799f951. Read the comment docs.

Also fix comments which apparently were copy-pasted
@dkrotx dkrotx merged commit 63a13f5 into cadence-workflow:master Sep 6, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants