Sync race with get rooms for user cache invalidation over replication #14154

Fizzadar · 2022-10-12T11:13:47Z

Over the last few weeks we have started seeing syncs with missing just-joined rooms. This led me to dive deep into how sync works and ended up with identifying a few cache invalidation race conditions, my understanding of things is as follows:

sync calls notifier.wait_for_events
the notifier waits for events relevant to the users rooms (it gets the room list, this maybe cached!)
this is handled as events come in over replication
before that handler, process_replication_rows is called
which the cache database processes here then for the event here
the _invalidate_caches_for_event call does NOT invalidate rooms for user, that is left to the state invalidations over replication
but these are:
1. sent after the event over replication
2. nothing to do with the sync handling/notifier process, which is just the events
so thus this means between event replication and state, there is a window when a sync may get notified about events whilst the get rooms for user cache remains invalid

I then confirmed my suspicious by adding a log line: beeper/synapse@1346af1 which successfully identified the occurrences of this. I will now submit two different PRs to address this specific issue:

The text was updated successfully, but these errors were encountered:

erikjohnston · 2022-10-17T12:29:34Z

Aaaaaaaargh.

Thanks for a) looking into this and b) PRing some band aids.

I think the longer term solution will be to somehow ensure that we process the state changes and the events stream "at the same time", somehow.

Fizzadar · 2022-10-17T13:12:22Z

Totally agree these are quick band-aids! My proposal in #14158 offers one approach to ensuring things get processed after caches have been updated.

erikjohnston · 2022-10-18T12:27:47Z

I'm actually really confused. My reading of the code is:

We persist the current_state_delta changes with a stream ordering at most that of the event we're persisting:

synapse/synapse/storage/databases/main/events.py

Line 488 in 6c5082f

self._update_current_state_txn(txn, state_delta_for_room, min_stream_order)
This then gets sent out of replication in the same batch:

synapse/synapse/replication/tcp/streams/events.py

Line 228 in 6c5082f

updates = list(heapq.merge(event_updates, state_updates, ex_outliers_updates))

And so should be processed at the same time:

synapse/synapse/storage/databases/main/cache.py

Lines 151 to 153 in 6c5082f

    
           if stream_name == EventsStream.NAME: 
        
               for row in rows: 
        
                   self._process_event_stream_row(token, row)

So I'm a bit confused about exactly where we're introducing the inconsistency (which we obviously are somewhere). Thoughts? Might be useful to add some logging to see where the above assumptions break down?

Fizzadar · 2022-10-18T12:54:37Z

That looks right, I believe the issue is when sync is triggered by the other non-event streams. Roughly something like:

sync is pending for a user
a to-device stream update arrives which is relevant for the user, sync starts
at the same time, an event has been persisted that is relevant to the user, but sync worker hasn't yet processed the stream update
sync fetches the current token from the DB, which is after the event persisted
stream update is processed after this, thus part of the sync handling would have used event caches that weren't up to date

This is the same race condition described in #14158. I actually think fixing that would resolve these issues properly without any of the band-aids PR'd. Two options I can think of:

have sync wait for the stream to reach current token if behind (as described in 14158)
have sync generate current token from stream cache max pos, rather than the database

Both of these would ensure that the workers view of the world is up-to-date with the current token being used.

This was referenced Oct 12, 2022

Invalidate rooms for user caches when receiving membership events #14155

Merged

Sync race cache invalidation fixes part 2 #14156

Closed

Sync current token may be ahead of event stream cache position using workers #14158

Open

This was referenced Dec 21, 2023

Sync race with get rooms for user cache invalidation over replication element-hq/synapse#14154

Open

Sync current token may be ahead of event stream cache position using workers element-hq/synapse#14158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync race with get rooms for user cache invalidation over replication #14154

Sync race with get rooms for user cache invalidation over replication #14154

Fizzadar commented Oct 12, 2022 •

edited

Loading

erikjohnston commented Oct 17, 2022

Fizzadar commented Oct 17, 2022

erikjohnston commented Oct 18, 2022

Fizzadar commented Oct 18, 2022

Sync race with get rooms for user cache invalidation over replication #14154

Sync race with get rooms for user cache invalidation over replication #14154

Comments

Fizzadar commented Oct 12, 2022 • edited Loading

erikjohnston commented Oct 17, 2022

Fizzadar commented Oct 17, 2022

erikjohnston commented Oct 18, 2022

Fizzadar commented Oct 18, 2022

Fizzadar commented Oct 12, 2022 •

edited

Loading