-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Faster joins: fix race where we can fail to persist an incoming event with partial state after _sync_partial_state_room
clears the partial state flag for a room
#12988
Comments
From discussing with Rich, our options are:
|
This issue also relates to the TODOs in synapse/synapse/storage/databases/main/room.py Lines 1111 to 1129 in 7c6b220
|
The span of code that would need to be locked or retried is: synapse/synapse/handlers/federation_event.py Lines 1094 to 1116 in 7c6b220
compute_event_context sets the partial state flag in the event context and possibly creates a new state group for the event.
synapse/synapse/handlers/federation_event.py Lines 1929 to 1971 in 7c6b220
Note that we need to await in the except block if we retry persisting the event, otherwise we might end up deleting the push actions from the next attempt. This'll also fix #12987.
synapse/synapse/handlers/federation_event.py Lines 1973 to 2019 in 7c6b220
If we want to retry things, there is an annoying amount of code that we have to thread the we-lost-the-partial-state-race exception through, including a replication call. |
Actually it's not just incoming events that are racy. I think it's every code path which can compute event context with partial state, before persisting the event? |
We can probably do a fairly similar trick in |
…#13100) Whenever we want to persist an event, we first compute an event context, which includes the state at the event and a flag indicating whether the state is partial. After a lot of processing, we finally try to store the event in the database, which can fail for partial state events when the containing room has been un-partial stated in the meantime. We detect the race as a foreign key constraint failure in the data store layer and turn it into a special `PartialStateConflictError` exception, which makes its way up to the method in which we computed the event context. To make things difficult, the exception needs to cross a replication request: `/fed_send_events` for events coming over federation and `/send_event` for events from clients. We transport the `PartialStateConflictError` as a `409 Conflict` over replication and turn `409`s back into `PartialStateConflictError`s on the worker making the request. All client events go through `EventCreationHandler.handle_new_client_event`, which is called in *a lot* of places. Instead of trying to update all the code which creates client events, we turn the `PartialStateConflictError` into a `429 Too Many Requests` in `EventCreationHandler.handle_new_client_event` and hope that clients take it as a hint to retry their request. On the federation event side, there are 7 places which compute event contexts. 4 of them use outlier event contexts: `FederationEventHandler._auth_and_persist_outliers_inner`, `FederationHandler.do_knock`, `FederationHandler.on_invite_request` and `FederationHandler.do_remotely_reject_invite`. These events won't have the partial state flag, so we do not need to do anything for then. The remaining 3 paths which create events are `FederationEventHandler.process_remote_join`, `FederationEventHandler.on_send_membership_event` and `FederationEventHandler._process_received_pdu`. We can't experience the race in `process_remote_join`, unless we're handling an additional join into a partial state room, which currently blocks, so we make no attempt to handle it correctly. `on_send_membership_event` is only called by `FederationServer._on_send_membership_event`, so we catch the `PartialStateConflictError` there and retry just once. `_process_received_pdu` is called by `on_receive_pdu` for incoming events and `_process_pulled_event` for backfill. The latter should never try to persist partial state events, so we ignore it. We catch the `PartialStateConflictError` in `on_receive_pdu` and retry just once. Refering to the graph of code paths in #12988 (comment) may make the above make more sense. Signed-off-by: Sean Quah <seanq@matrix.org>
synapse/synapse/handlers/federation.py
Lines 1553 to 1569 in e3163e2
When we are processing an incoming event while
_sync_partial_state_room
is running,_sync_partial_state_room
may clear the partial state flag for the room before we try to persist the event with a partial state flag. This leads to a foreign key constraint failure because there's no longer apartial_state_room
entry for the room.See #12394 (comment) for an example.
The text was updated successfully, but these errors were encountered: