Skip to content
This repository was archived by the owner on Apr 26, 2024. It is now read-only.
This repository was archived by the owner on Apr 26, 2024. It is now read-only.

SerializationFailure in notify_interested_services_ephemeral under heavy load #11195

@Fizzadar

Description

@Fizzadar

Description

We are seeing periodic SerializationFailure errors in Sentry in the notify_interested_services_ephemeral background process. This is running in a dedicated appservice-pusher worker instance. At peak we see around 30 ephemeral events per second being sent to the appservice (presence is disabled so these should all be read receipts), the error appears much more regularly during these peaks. Some initial investigation:

The query in question is invoked in set_type_stream_id_for_appservice_txn, updating the stream position for the appservice:

def set_type_stream_id_for_appservice_txn(txn):
stream_id_type = "%s_stream_id" % type
txn.execute(
"UPDATE application_services_state SET %s = ? WHERE as_id=?"
% stream_id_type,
(pos, service.id),
)

This itself is called in the appservice handler:

elif stream_key == "receipt_key":
events = await self._handle_receipts(service)
if events:
self.scheduler.submit_ephemeral_events_for_as(service, events)
# Persist the latest handled stream token for this appservice
await self.store.set_type_stream_id_for_appservice(
service, "read_receipt", new_token
)

I'm new to most of the synapse codebase so may have this wrong but it appears this is a race condition between two parallel executions of the notify_interested_services_ephemeral process for the same appservice/stream. Am I correct in thinking this means some events are getting sent twice to the appservice as the position is not always updated? (Could events also be missed somehow?).

We'd like to figure out a way to fix this issue, currently only two possible solutions have come to mind:

  • Implement some kind of locking on (appservice, stream_id), would probably need to be on the entire handler though which may be a significant performance impact. Not a fan of this!
  • Assuming transaction IDs are sequential, a global object tracking the highest value across multiple processes could be used to prevent updating the table if it would update to a lower number - this becomes harder assuming a future where multiple appservice push workers may exist

Keen to hear thoughts, and can most likely find time to work on any potential soltuion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-Application-ServiceRelated to AS supportS-MinorBlocks non-critical functionality, workarounds exist.T-DefectBugs, crashes, hangs, security vulnerabilities, or other reported issues.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions