Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

completely broken room after purge_room/rejoin #11521

Closed
richvdh opened this issue Dec 7, 2021 · 9 comments · Fixed by #14161
Closed

completely broken room after purge_room/rejoin #11521

richvdh opened this issue Dec 7, 2021 · 9 comments · Fixed by #14161
Labels
A-Corruption Things that have led to unexpected state in Synapse or the database T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@richvdh
Copy link
Member

richvdh commented Dec 7, 2021

when you purge a room (via the delete room api), we do not clear the in-memory caches. Then, when you rejoin, we have a bunch of the events in the cache, so do not bother to re-persist them.

This leads to very bad brokenness, like state_groups_state referring to events which do not exist.

(A workaround is of course to restart synapse after purging and before rejoining)

@richvdh
Copy link
Member Author

richvdh commented Dec 7, 2021

Of course, if we had foreign key constraints, the failure mode would be a lot less insidious :/

@richvdh
Copy link
Member Author

richvdh commented Dec 7, 2021

Cache invalidation in this case is a bit tricky, because we don't actually know which events we're purging, and most of the caches in question aren't structured in a way that makes it easy to purge them by room id.

@richvdh
Copy link
Member Author

richvdh commented Dec 7, 2021

oh and thanks to #6547 you can't even re-purge your way out of the problem :( now fixed

@squahtx squahtx added the T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. label Dec 8, 2021
@anoadragon453
Copy link
Member

anoadragon453 commented Sep 14, 2022

I see that one (only?) affected cache is EventsWorkerStore._get_event_cache, which is an instance of AsyncLruCache/LruCache with a cache type of dict. Could we switch this to a DictionaryCache instead? This is a cache-type which allows you to efficiently delete "trees" of caches. Such trees could be constructed with the room ID as the root node, and all events as children. Then we could efficiently invalidate the cache for all events in a room while continuing to maintain quick lookup times. We currently use DictionaryCache to back state_group and state_group_members caches.

I'm not sure how this change will affect external cache support, though I don't think that's merged yet, so a migration would not yet be needed.

Edit: the downside is that you'd need to know the room ID when checking the cache to get an efficient lookup.

@anoadragon453
Copy link
Member

An alternative could be to maintain a separate dict cache of room_id->event_id which is populated upon pulling items from the database, and could be used to quickly invalidate a room's events from _get_event_cache.

@anoadragon453
Copy link
Member

anoadragon453 commented Sep 16, 2022

I've made good headway into converting EventWorkerStore._get_event_cache into a new type of cache following the methodology in the above comment (running a second room_id -> event_id dict and keeping it in sync) as a backing store for LruCache. It works, in fact.

However, it is not the only event-related cache we need to deal with here. Both EventWorkerStore._event_ref (weak event in memory store) and EventWorkerStore._current_event_fetches (current ongoing event fetches) also hold references to events that may have been removed from the database.

I understand EventWorkerStore._current_event_fetches, and we could possibly just iterate over this list and remove events with a matching room ID (if it never grows too large?), but EventWorkerStore._event_ref confuses me. Why have this dict if we already have an in-memory cache for events? It has entries inserted and invalidated at the same points the EventWorkerStore._get_event_cache does. Does it show noticeable performance gains?

Having a second layer which keeps entries even after they've been removed from the cache just seems like a recipe for races.

@anoadragon453
Copy link
Member

The PRs that introduced EventWorkerStore._event_ref and EventWorkerStore._current_event_fetches are:

and they help in understanding why these caches were implemented. _current_event_fetches helps with de-duplicating concurrent requests for the same event ID, while _event_ref ensures that there is only one copy of an event in memory at a given time.

We need to invalidate these by room ID as well, though this is tricky and raises some question. Unfortunately neither of these caches are a strict subset of the contents of EventWorkerStore._get_event_cache, so we can't simply use the results of that to know which event IDs to invalidate in the other caches. ...although it does beg the question, if _event_ref is a map of all events kept in memory, does that mean _get_event_cache is a strict subset of _event_ref?

If not, and thus we can't re-use the work from invalidating one cache, then that leaves the solution of having yet more room ID -> event ID maps that are maintained (one for each cache) or simply iterating over all entries, grabbing the room ID and invalidating those that match. That may not be very performant, though I haven't done any performance testing to verify. I'd like to say purging a room doesn't happen very often, but it will if we introduce purge-room-on-forget (#4720, the feature driving this investigation).

I'm also not sure if invalidating _current_event_fetches will introduce any races. We pop the deferred which I presume allows it to continue, but not be re-used from that point on. I assume this is mostly fine, since the request came in before the purge was initiated, so it should theoretically still return the event, even if that event is returned after it has been purged from the database.

@erikjohnston I'm interested in your thoughts specifically, since you authored much of this code.

@MadLittleMods
Copy link
Contributor

MadLittleMods commented Sep 21, 2022

@anoadragon453 I haven't read this thread but maybe you're interested in what the have_seen_event cache does using the TreeCache.

@cached(max_entries=100000, tree=True)
async def have_seen_event(self, room_id: str, event_id: str) -> bool:
res = await self._have_seen_events_dict(((room_id, event_id),))
return res[(room_id, event_id)]

You can see that we invalidate via (room_id, event_id) and via (room_id, ):

self._invalidate_cache_and_stream(
txn, self.have_seen_event, (room_id, event_id)
)

self._invalidate_cache_and_stream(txn, self.have_seen_event, (room_id,))


Found while tooling around on #13856 and thought you might be interested.

@anoadragon453
Copy link
Member

Hey @MadLittleMods, thanks for the suggestion! Unfortunately TreeCache won't work for this particular use case as we often don't know the room ID when querying the get_event cache.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Corruption Things that have led to unexpected state in Synapse or the database T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants