Race condition with replication means `/messages` backfill lacks read-after-write consistency between workers

We have have a problem where because events are persisted in a queue in a `client_reader` worker, there is no guarantee that they are available to read on other workers. So when we fire off a backfill request from `/messages`, those backfilled messages aren't necessarily available to paginate with after the backfill completes (even on the worker that put them in the persister queue).

CI failure: https://github.com/matrix-org/synapse/actions/runs/3182998161/jobs/5189731097#step:6:15343 (from [discussion](https://github.com/matrix-org/synapse/pull/14028#discussion_r987156197)). This specific CI flake was addressed in https://github.com/matrix-org/complement/pull/492
```
WORKERS=1 POSTGRES=1 COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestJumpToDateEndpoint/parallel/federation/can_paginate_after_getting_remote_event_from_timestamp_to_event_endpoint
```


Here is what happens:

 1. `serverB` has `event1` stored as an `outlier` from previous requests (specifically from MSC3030 jump to date pulling in a missing `prev_event` after backfilling)
 1. Client on `serverB` calls `/messages?dir=b`
 1. `serverB:client_reader1` accepts the request and drives things
 1. `serverB:client_reader1` has some backward extremities in range and requests `/backfill` from `serverA`
 1. `serverB:client_reader1` processes the events from backfill including `event1` and puts them in the `_event_persist_queue`
 1. `serverB:master` picks up the events from the `_event_persist_queue` and persists them to the database, de-outliers `event1` and invalidates its own cache and sends them over replication
 1. `serverB:client_reader1` starts assembling the `/messages` response and gets `event1` out of the stale cache still as an `outlier`
 1. `serverB:client_reader1` responds to the `/messages` request without `event1` because `outliers` are filtered out
 1. `serverB:client_reader1` finally gets the replication data and invalidates its own cache for `event1` (too late, we already got the events from the stale cache and responded)
 
 
> In a nutshell, we've written the test expecting "read-after-write consistency" but we don't have that.
>
> *-- https://github.com/matrix-org/synapse/issues/13185#issuecomment-1177930853*

It's exactly this but it really sucks that calling `/messages` doesn't include events we just backfilled for that request. This is a general problem with Synapse though, see issues labeled with https://github.com/matrix-org/synapse/labels/Z-Read-After-Write. In this case, it's all within the same `/messages` request so it's a little more insidious.

Having this be possible makes it even more of a reason that we should indicate gaps in `/messages`, [MSC3871](https://github.com/matrix-org/matrix-spec-proposals/pull/3871)





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition with replication means `/messages` backfill lacks read-after-write consistency between workers #14211

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Race condition with replication means /messages backfill lacks read-after-write consistency between workers #14211

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Race condition with replication means `/messages` backfill lacks read-after-write consistency between workers #14211