-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Some Synapse instances have been hammering their database after v1.66.0 -> v1.68.0 update #13942
Description
Description
Some EMS hosted Synapse instances are hammering their database after upgrading from v1.66.0 to v1.68.0. The host concentrating here on is ecf6bc70-0bd7-11ec-8fb7-11c2f603e85f-live
(EMS internal host ID, please check with EMS team for real hostnames).
The offensive query is:
SELECT c.state_key FROM current_state_events as c
/* Get the depth of the event from the events table */
INNER JOIN events AS e USING (event_id)
WHERE c.type = ? AND c.room_id = ? AND membership = ?
/* Sorted by lowest depth first */
ORDER BY e.depth ASC
The background update running at the time was event_push_backfill_thread_id
, if relevant.
Graphs:
IOPS increase at upgrade. The initial plateau at 4K was due to the database being locked to 4K IOPS. Now it has 10K and has consistently continued to hammer the database after ~7 hours since the upgrade.
Degraded event send times especially when constrained to 4K IOPS, which the host has been running with for a long time fine.
Stateres worst-case seems to reflect the database usage, just side effect of a busy db?
DB usage for background jobs had a rather massive spike for notify_interested_appservices_ephemeral right after upgrade.
Taking that away from the graph, we see db usage for background jobs higher since upgrade all around.
DB transactions:
Cache eviction seems to indicate we should raise the get_local_users_in_room
cache as it is being evicted a lot by size. However, this has been the case pre-upgrade as well.
Appservice transactions have not changed during this time by a large factor (3 bridges):
A few other hosts manually found:
- 01bbd800-4670-11e9-8324-b54a9efc8abc-live
- db0718c0-2480-11e9-83c4-ad579ecfcc33-live
Day time based changes in traffic have been ruled out, all these issues started on upgrade with no other changes to the hosting or deployment stack. There are probably more hosts affected by the db usage increase.
Also discussed in backend internal.
Steps to reproduce
Uprgade from v1.66.0 to v1.68.0.
Homeserver
ecf6bc70-0bd7-11ec-8fb7-11c2f603e85f-live, 01bbd800-4670-11e9-8324-b54a9efc8abc-live, db0718c0-2480-11e9-83c4-ad579ecfcc33-live
Synapse Version
v1.68.0
Installation Method
Other (please mention below)
Platform
EMS flavour Docker images built from upstream images. Kubernetes cluster.
Relevant log output
-
Anything else that would be useful to know?
No response