advance ack-level to avoid querying the same (empty) tasks next time #6258

dkrotx · 2024-08-29T15:59:33Z

Every time pollers are reaching out matching we acuire a new lease in
the DB extending maxReadLevel.
Unfortunately, if there is no writes
happening to the task-list, we are pushing maxReadLevel further and
further away from the previous ackLevel (there is nothing to ack!).
At some moment after 1000 restarts of matching service we could have
1000 (empty) gettask requests and drain the whole
matching.PersistenceMaxQPS which will reject writes to other [active]
task-lists with "Max QPS reached" error.

We are advancing ack-level for task-lists even when read zero tasks.

This is required to prevent spiky load to DB / hitting rate-limit after cadence-matching restart.

Will do on staging environment, it is Draft by now.

We could skip some tasks making workflows to stuck

Release notes

Documentation Changes

Every time pollers are reaching out matching we acuire a new lease in the DB extending maxReadLevel. Unfortunately, if there is no writes happening to the task-list, we are pushing maxReadLevel further and further away from the previous ackLevel (there is nothing to ack!). At some moment after 1000 restarts of matching service we could have 1000 (empty) gettask requests and drain the whole matching.PersistenceMaxQPS which will reject writes to other [active] task-lists with "Max QPS reached" error.

davidporter-id-au

Excellent find

davidporter-id-au · 2024-08-29T22:28:31Z

As per your description, it' be good to verify via a deploy check and to ensure the tests are passing. But on the face of it nothing jumps out at me as a dangerous as far as I can immediately see.

taylanisikdemir · 2024-09-03T03:23:54Z

service/matching/tasklist/task_reader.go

-					tr.taskAckManager.SetReadLevel(readLevel)
+					// even though we didn't handle any tasks, we want to advance the ack-level
+					// to avoid needless querying database the next time
+					tr.taskAckManager.SetAckLevel(readLevel)


when a new task is created at this point, is it guaranteed it is going to have an id > readLevel?

It is. That's how task-ID allocation works in taskWriter.
But I also changed the diff for this to be very explicit, and added a test (and the end of it I also check for this for sake of sanity).

The previous fix did not work for cases when we already had something un-acked, but read no tasks in the last batch. Tests found this, that's cool!

codecov · 2024-09-04T14:17:59Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.08%. Comparing base (1b02d78) to head (799f951).
Report is 7 commits behind head on master.

Additional details and impacted files

Files with missing lines	Coverage Δ
service/matching/tasklist/task_reader.go	`73.60% <100.00%> (+0.74%)`	⬆️

... and 12 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b02d78...799f951. Read the comment docs.

service/matching/tasklist/task_reader.go

service/matching/tasklist/task_list_manager_test.go

Also fix comments which apparently were copy-pasted

davidporter-id-au approved these changes Aug 29, 2024

View reviewed changes

taylanisikdemir reviewed Sep 3, 2024

View reviewed changes

More correct fix of when we can advance ackLevel

7a049f0

The previous fix did not work for cases when we already had something un-acked, but read no tasks in the last batch. Tests found this, that's cool!

Added a test which targets the change

3fef206

dkrotx marked this pull request as ready for review September 5, 2024 15:48

dkrotx requested review from Shaddoll, neil-xie, Groxx, shijiesheng, agautam478, jakobht, 3vilhamster, sankari165 and demirkayaender as code owners September 5, 2024 15:48

taylanisikdemir approved these changes Sep 5, 2024

View reviewed changes

service/matching/tasklist/task_reader.go Show resolved Hide resolved

service/matching/tasklist/task_list_manager_test.go Outdated Show resolved Hide resolved

Extend awaitCondition' time 1s->10s to prevent flakyness

799f951

Also fix comments which apparently were copy-pasted

dkrotx merged commit 63a13f5 into cadence-workflow:master Sep 6, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

advance ack-level to avoid querying the same (empty) tasks next time #6258

advance ack-level to avoid querying the same (empty) tasks next time #6258

dkrotx commented Aug 29, 2024

davidporter-id-au left a comment

davidporter-id-au commented Aug 29, 2024

taylanisikdemir Sep 3, 2024

dkrotx Sep 5, 2024

codecov bot commented Sep 4, 2024 •

edited

Loading

advance ack-level to avoid querying the same (empty) tasks next time #6258

advance ack-level to avoid querying the same (empty) tasks next time #6258

Conversation

dkrotx commented Aug 29, 2024

davidporter-id-au left a comment

Choose a reason for hiding this comment

davidporter-id-au commented Aug 29, 2024

taylanisikdemir Sep 3, 2024

Choose a reason for hiding this comment

dkrotx Sep 5, 2024

Choose a reason for hiding this comment

codecov bot commented Sep 4, 2024 • edited Loading

Codecov Report

codecov bot commented Sep 4, 2024 •

edited

Loading