-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Sync current token may be ahead of event stream cache position using workers #14158
Comments
Confirmed via the log change above, this is actually not that rare (😱). We have large bursts of these in sync workers:
There's even more in federation sender instances
Finally the background work has a constant (1/s) flow of such errors on the
|
Implemented wait-for fix option: #14269 |
err, interesting. That doesn't sound good.
This sounds very much like it could be the cause of #14251 and other device-list-tracking bugs we've been tearing our hair out over. For those of us struggling to keep up at the back: are you able to explain the mechanics of this race? How does the "current position" come to be ahead of the "max stream cache position"? I wasn't able to find where exactly you introduced these log lines - can you point me to a relevant diff? |
Diff for the logging code: #14239 Essentially it seems to boil down to:
Definitely seems likely this would trigger the device list issues as well. |
I think part of the confusion here is that there is not a one-to-one mapping between stream caches and and stream ID generators. For example, the device inbox ID generator feeds into both synapse/synapse/storage/databases/main/deviceinbox.py Lines 688 to 698 in 8c94dd3
So it's not unexpected if one xor the other stream change cache lags behind the current token. Now, that's not to say there isn't a problem here. I'd be tempted for some of these stream caches to pass in the relevant stream ID generator, and then log if we are inserting a change with a lower stream ID than returned by synapse/synapse/storage/databases/main/deviceinbox.py Lines 141 to 146 in 8c94dd3
|
I think I've been misunderstanding Re: device inbox I discovered this also where the two stream caches share one ID generator. My WIP PR #14286 addresses this by making sure both stream caches are always updated (to enable the waiting logic). |
Digging into the events stream a bit, the only place the stream ID is advanced is: synapse/synapse/storage/databases/main/events_worker.py Lines 285 to 295 in 2d0ba3f
And the change should be propagated at the same time here: synapse/synapse/storage/databases/main/cache.py Lines 185 to 210 in 2d0ba3f
Is it possible that we see a |
Should the |
The code to decide what get sent down the event stream is at: synapse/synapse/replication/tcp/streams/events.py Lines 135 to 229 in d902181
Anything that that function returns with the same token will (well should) get batched up.
Yes, that sounds like a good idea. I wonder if we should have a class that stores both the ID gen and the stream change cache, to make it more obvious that they are interlinked |
I’m keen to work on this plan and should have some time over the next few weeks, feel free to assign to me :) There’s some precursor work I have identified as well involving cleanup of the slaved stores so will start there. |
Small update - the new This is expected for UPDATE: this probably makes sense, seems to be on worker startup fetching rows and because receipts replace each other gaps would be expected. The above idea of combining stream change cache + ID generators would address this problem. |
I think I may have figured out why this happens with the
This explains why we're not seeing this particular log causing major issues and why it's so rare; I have pushed some logging to our fork to confirm the above is the case. Fixing this is relatively easy by making sure the stream cache is updated before the ID generator: #14648. |
Should we add an assertion in the stream change cache where we assert that when we mark something as changed the stream ID is strictly greater than the current token? |
Also, 😱 |
Yes! That’d be ideal, will probably require passing the relevant id generator into the stream change cache but that seems sensible anyway… |
I have added that to my PR, enabled only for the events stream, in 5ee0425. I think enabling for other streams is best left to a new PR? |
@Fizzadar which PR? |
This creates a new store method, `process_replication_position` that is called after `process_replication_rows`. By moving stream ID advances here this guarantees any relevant cache invalidations will have been applied before the stream is advanced. This avoids race conditions where Python switches between threads mid way through processing the `process_replication_rows` method where stream IDs may be advanced before caches are invalidated due to class resolution ordering. See this comment/issue for further discussion: matrix-org#14158 (comment)
This creates a new store method, `process_replication_position` that is called after `process_replication_rows`. By moving stream ID advances here this guarantees any relevant cache invalidations will have been applied before the stream is advanced. This avoids race conditions where Python switches between threads mid way through processing the `process_replication_rows` method where stream IDs may be advanced before caches are invalidated due to class resolution ordering. See this comment/issue for further discussion: #14158 (comment)
As I understand it, this has been partially addressed by #14723? Is anyone able to provide an update on what remains to be done here? |
This should indeed resolve the issue once and for all - I'm waiting on rolling 1.75 to our deployment to confirm (as we have the logs indicating the issue) but I'm pretty sure that PR solves it. Will update here/close issue once that's live. |
…4723) This creates a new store method, `process_replication_position` that is called after `process_replication_rows`. By moving stream ID advances here this guarantees any relevant cache invalidations will have been applied before the stream is advanced. This avoids race conditions where Python switches between threads mid way through processing the `process_replication_rows` method where stream IDs may be advanced before caches are invalidated due to class resolution ordering. See this comment/issue for further discussion: matrix-org#14158 (comment) # Conflicts: # synapse/storage/databases/main/devices.py # synapse/storage/databases/main/events_worker.py
…4723) (#52) * Update all stream IDs after processing replication rows (matrix-org#14723) This creates a new store method, `process_replication_position` that is called after `process_replication_rows`. By moving stream ID advances here this guarantees any relevant cache invalidations will have been applied before the stream is advanced. This avoids race conditions where Python switches between threads mid way through processing the `process_replication_rows` method where stream IDs may be advanced before caches are invalidated due to class resolution ordering. See this comment/issue for further discussion: matrix-org#14158 (comment) # Conflicts: # synapse/storage/databases/main/devices.py # synapse/storage/databases/main/events_worker.py * Fix bad cherry-picking * Remove leftover stream advance
So this is still happening on our deployment but it did occur to me this morning - could this be a side effect of multiple event persisters, where one is running behind the others? If I'm following the Note: this probably means the assertion that changes are incremental only on cache update would blow up in these situations... Does it make sense to force and order here and prevent out-of-order event persisters? Would mean any slow persister holds back sync/replication everywhere but they really shouldn't be that far behind each other... |
Based on my initial work investigating sync cache races in: #14154
Sync makes use of the event stream cache to determine whether a room has changed between the since & current tokens. This is then used to limit the number of rooms events are queried for in
get_room_events_stream_for_rooms
. After discovering the cache invalidation races above I added a quick log line for this: beeper/synapse@62497db (after beeper/synapse@5297155).And it logs! Only a very small handful of occurrences over the last ~5 days and the position difference has so far been 1 every time. I suspect this may also occur against other stream caches but have not confirmed.
The worry here is if an event was sent within the gap it may be missed from an incremental sync which is especially bad because the user will never see or know about this event unless they re-init sync (or the client backfills it?).
One solution to this is to implement a waiting mechanism on
StreamCache
so a worker can wait for the cache to catch up with the current token for a given sync before fetching data. Because this is super rare and even when it happens it's a tiny position difference this would probably have negligable impact in sync performance and provide a shield against cache invalidation races over replication.The text was updated successfully, but these errors were encountered: