-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Database transaction get_unread_event_push_actions_by_room
churns database CPU
#11887
Comments
We did some investigation into this today and it was due to a room with a high rate of events, but with users with very old read receipts. This caused them to query in We pretty much did a: SELECT count(stream_ordering), user_id FROM event_push_actions WHERE room_id = '...' GROUP by user_id; This showed a few users with ~100k pending entries while most users were much smaller. This is due to those few users never reading that room (and thus never updating their read receipt). A short term fix was to delete the rows for their users, which will cause incorrect unread message counts for those rooms. A long term fix is probably to make the notification count calculation incremental instead of queuing all messages. This should make it robust against users not reading rooms. |
#11893 was also found while investigating this but we do not believe to be directly causing this issue. |
vaguely related: #5569 |
My understanding was that we moved push actions to a summary table after 24 hours? Though that doesn't happen for "highlight" push actions IIRC. Next time this goes off we should figure out why the query takes ages to run, is there lots of push actions in the last 24 hours or are they highlights? Sounds like last time it was filled with push actions for the affected user and so the table got table scanned. |
Maybe a solution is to aggregate more aggressively if there are lots of push actions, e.g. move to push actions summary table if there are more than 100 rows. The reason not to do that is that the counts won't be as accurate, but that hardly matters when the count hundreds. |
Also looks like we will rotate at maximum ~3333 events/s (a limit of 10000 rows then a delay of 3s). |
Just lots of regular unread messages.
|
Sounds like we need to aggregate more aggressively then. |
This is where we'd need to add the additional rotation logic: synapse/synapse/storage/databases/main/event_push_actions.py Lines 749 to 756 in 0d17357
Right now, it works by rotating all push actions that have a stream ordering that is older than one day, doing a max of 10000 rows per iteration. I think we can add a second loop where we check if there are lots of push actions for a given room, and rotate old rows until there is a max of e.g. 100 rows left in the
to instead use the old stream ordering that we store in |
Ok, so my suggestion in the previous comment doesn't work, as it results in table scanning |
New plan: Have a new table called To keep it updated:
|
Fixes #11887 hopefully. The core change here is that `event_push_summary` now holds a summary of counts up until a much more recent point, meaning that the range of rows we need to count in `event_push_actions` is much smaller. This needs two major changes: 1. When we get a receipt we need to recalculate `event_push_summary` rather than just delete it 2. The logic for deleting `event_push_actions` is now divorced from calculating `event_push_summary`. In future it would be good to calculate `event_push_summary` while we persist a new event (it should just be a case of adding one to the relevant rows in `event_push_summary`), as that will further simplify the get counts logic and remove the need for us to periodically update `event_push_summary` in a background job.
Description
We have a host in EMS with a room that is getting approx 2 events per second into it. A few users sitting in the room are churning the database CPU due to constant sync which is causing the synchrotron process to spend a large amount of transaction time in
get_unread_event_push_actions_by_room
.The database query is:
Two of the users in the room are causing an equal 50/50 of the database load.
Database metrics showing the query being mostly CPU time.
More context in https://matrix.to/#/!ixSKkmjfDhyFWsKSEY:federator.dev/$ZFJ7IDlk0gt_l8ht9hEK3DZVg5mhucbaEtM4qCKPiQk?via=matrix.org&via=vector.modular.im
Version information
Homeserver: EMS, customer in context link
Version: v1.51.0
Install method: Docker, official images
Platform: EMS
The text was updated successfully, but these errors were encountered: