Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Sync race with get rooms for user cache invalidation over replication #14154

Open
Fizzadar opened this issue Oct 12, 2022 · 4 comments
Open

Sync race with get rooms for user cache invalidation over replication #14154

Fizzadar opened this issue Oct 12, 2022 · 4 comments
Labels
A-Sync defects related to /sync A-Workers Problems related to running Synapse in Worker Mode (or replication) O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@Fizzadar
Copy link
Contributor

Fizzadar commented Oct 12, 2022

Over the last few weeks we have started seeing syncs with missing just-joined rooms. This led me to dive deep into how sync works and ended up with identifying a few cache invalidation race conditions, my understanding of things is as follows:

I then confirmed my suspicious by adding a log line: beeper/synapse@1346af1 which successfully identified the occurrences of this. I will now submit two different PRs to address this specific issue:

@DMRobertson DMRobertson added A-Workers Problems related to running Synapse in Worker Mode (or replication) A-Sync defects related to /sync T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. S-Major Major functionality / product severely impaired, no satisfactory workaround. O-Occasional Affects or can be seen by some users regularly or most users rarely labels Oct 14, 2022
@erikjohnston
Copy link
Member

Aaaaaaaargh.

Thanks for a) looking into this and b) PRing some band aids.

I think the longer term solution will be to somehow ensure that we process the state changes and the events stream "at the same time", somehow.

@Fizzadar
Copy link
Contributor Author

Totally agree these are quick band-aids! My proposal in #14158 offers one approach to ensuring things get processed after caches have been updated.

@erikjohnston
Copy link
Member

I'm actually really confused. My reading of the code is:

  1. We persist the current_state_delta changes with a stream ordering at most that of the event we're persisting:
    self._update_current_state_txn(txn, state_delta_for_room, min_stream_order)
  2. This then gets sent out of replication in the same batch:
    updates = list(heapq.merge(event_updates, state_updates, ex_outliers_updates))
  3. And so should be processed at the same time:
    if stream_name == EventsStream.NAME:
    for row in rows:
    self._process_event_stream_row(token, row)

So I'm a bit confused about exactly where we're introducing the inconsistency (which we obviously are somewhere). Thoughts? Might be useful to add some logging to see where the above assumptions break down?

@Fizzadar
Copy link
Contributor Author

That looks right, I believe the issue is when sync is triggered by the other non-event streams. Roughly something like:

  • sync is pending for a user
  • a to-device stream update arrives which is relevant for the user, sync starts
  • at the same time, an event has been persisted that is relevant to the user, but sync worker hasn't yet processed the stream update
  • sync fetches the current token from the DB, which is after the event persisted
  • stream update is processed after this, thus part of the sync handling would have used event caches that weren't up to date

This is the same race condition described in #14158. I actually think fixing that would resolve these issues properly without any of the band-aids PR'd. Two options I can think of:

  • have sync wait for the stream to reach current token if behind (as described in 14158)
  • have sync generate current token from stream cache max pos, rather than the database

Both of these would ensure that the workers view of the world is up-to-date with the current token being used.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Sync defects related to /sync A-Workers Problems related to running Synapse in Worker Mode (or replication) O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

3 participants