-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
SerializationFailure in notify_interested_services_ephemeral under heavy load #11195
Description
Description
We are seeing periodic SerializationFailure errors in Sentry in the notify_interested_services_ephemeral background process. This is running in a dedicated appservice-pusher worker instance. At peak we see around 30 ephemeral events per second being sent to the appservice (presence is disabled so these should all be read receipts), the error appears much more regularly during these peaks. Some initial investigation:
The query in question is invoked in set_type_stream_id_for_appservice_txn, updating the stream position for the appservice:
synapse/synapse/storage/databases/main/appservice.py
Lines 423 to 429 in c7a5e49
| def set_type_stream_id_for_appservice_txn(txn): | |
| stream_id_type = "%s_stream_id" % type | |
| txn.execute( | |
| "UPDATE application_services_state SET %s = ? WHERE as_id=?" | |
| % stream_id_type, | |
| (pos, service.id), | |
| ) |
This itself is called in the appservice handler:
synapse/synapse/handlers/appservice.py
Lines 252 to 260 in c7a5e49
| elif stream_key == "receipt_key": | |
| events = await self._handle_receipts(service) | |
| if events: | |
| self.scheduler.submit_ephemeral_events_for_as(service, events) | |
| # Persist the latest handled stream token for this appservice | |
| await self.store.set_type_stream_id_for_appservice( | |
| service, "read_receipt", new_token | |
| ) |
I'm new to most of the synapse codebase so may have this wrong but it appears this is a race condition between two parallel executions of the notify_interested_services_ephemeral process for the same appservice/stream. Am I correct in thinking this means some events are getting sent twice to the appservice as the position is not always updated? (Could events also be missed somehow?).
We'd like to figure out a way to fix this issue, currently only two possible solutions have come to mind:
- Implement some kind of locking on
(appservice, stream_id), would probably need to be on the entire handler though which may be a significant performance impact. Not a fan of this! - Assuming transaction IDs are sequential, a global object tracking the highest value across multiple processes could be used to prevent updating the table if it would update to a lower number - this becomes harder assuming a future where multiple appservice push workers may exist
Keen to hear thoughts, and can most likely find time to work on any potential soltuion.