Improve/fix application service transaction batching. #8604

erikjohnston · 2020-10-20T10:33:35Z

The following two things come up semi regularly:

Synapse will include all events in the next transaction that have happened since the last transaction. This can lead to a scenario where we end up with large numbers of events in a single transaction, which the AS then takes a long time to process, causing the next transaction to also be very large, introducing large amounts of lag that the AS doesn't recover from. We should limit the number of events we bundle into a transaction.
If the AS is marked down then for every event persisted we create a new transaction for that AS with a single event in it. When the AS comes back online Synapse will then send these transactions one at a time, which can mean that an AS can take a very very long time to catch up, even if its processing the events quickly. We should send transactions with multiple events instead (though not too many).

Half-Shot · 2020-10-20T13:43:42Z

Synapse will include all events in the next transaction that have happened since the last transaction. This can lead to a scenario where we end up with large numbers of events in a single transaction, which the AS then takes a long time to process, causing the next transaction to also be very large, introducing large amounts of lag that the AS doesn't recover from. We should limit the number of events we bundle into a transaction.

I'm wondering if this actually helps though? Presumably sending batches of N size is also going to delay the appservice as soon as it hits some events it needs to do a lot of work on. To be honest, I'd rather appservices didn't block transaction processing, as we don't need to send the result of the processing back to the homeserver. The AS should really keep accepting new transactions in quick succession regardless of the event contents.

We should formally spec a max size of events though, because most appservices do limit the byte size of the payload.

If the AS is marked down then for every event persisted we create a new transaction for that AS with a single event in it. When the AS comes back online Synapse will then send these transactions one at a time, which can mean that an AS can take a very very long time to catch up, even if its processing the events quickly. We should send transactions with multiple events instead (though not too many).

This seems a bit silly yep, but feels like a Synapse problem (not a spec, the spec doesn't have opinions about retry behavior) where it could be more efficient about it's batching. You can't change the size of a transaction once you've attempted to send it, but subsequent transactions that are pending are mutable so you could keep extending the next transaction to be sent up to MAX_SIZE.

turt2live · 2020-10-20T14:02:44Z

These are implementation details and unrelated to the spec - moving to the synapse repo.

erikjohnston · 2020-10-20T14:03:40Z

These are implementation details and unrelated to the spec - moving to the synapse repo.

Woops! I really meant it to be in the Synapse repo :/

erikjohnston · 2020-10-20T14:07:25Z

Synapse will include all events in the next transaction that have happened since the last transaction. This can lead to a scenario where we end up with large numbers of events in a single transaction, which the AS then takes a long time to process, causing the next transaction to also be very large, introducing large amounts of lag that the AS doesn't recover from. We should limit the number of events we bundle into a transaction.

I'm wondering if this actually helps though? Presumably sending batches of N size is also going to delay the appservice as soon as it hits some events it needs to do a lot of work on. To be honest, I'd rather appservices didn't block transaction processing, as we don't need to send the result of the processing back to the homeserver. The AS should really keep accepting new transactions in quick succession regardless of the event contents.

The problem is that you end up with a delay added on top of however long the AS takes to process an event. This delay is then never recovered from as the AS keeps struggling with large transactions (and can end up in a situation where it gets further and further behind because of larger and larger transactions). We've seen this reported in the wild.

If the AS returns a 200 OK immediately then a) you lose reliability if the AS dies/is restarted and b) there is no longer any back pressure if the AS starts falling behind, and Synapse will just continue spamming the AS with events until it falls over in a heap.

We should formally spec a max size of events though, because most appservices do limit the byte size of the payload.

There is a max size of events, its just that doesn't help if we don't bound the max number of events in a transaction

Half-Shot · 2020-10-20T15:30:17Z

These are implementation details and unrelated to the spec - moving to the synapse repo.

Ah, I did review this with a spec hat on so my comments may have been tainted with a spec side view - sorry :s.

The problem is that you end up with a delay added on top of however long the AS takes to process an event. This delay is then never recovered from as the AS keeps struggling with large transactions (and can end up in a situation where it gets further and further behind because of larger and larger transactions). We've seen this reported in the wild.

Hmm, I guess smaller transactions on recovery allows the AS to at least report to the HS that it's making progress, rather than flopping over because it didn't complete the entire transaction. That's a fair point. I would like to see what the AS could do to limit work done per event. There's no point blocking transactions while the bridge is waiting for the HS to finish a join or a leave, for example.

There is a max size of events, its just that doesn't help if we don't bound the max number of events in a transaction

Misspoke, I agree with this :).

richvdh · 2020-10-20T15:32:02Z

at least half of this is a duplicate of #3536

Half-Shot · 2020-10-21T14:38:31Z

#3536 is now fixed (we decided on a fixed 100 events per transaction) which solves concern 1.

richvdh · 2020-10-29T14:47:30Z

@erikjohnston to clarify the remaining work here

erikjohnston · 2020-11-02T10:46:24Z

Split out remaining work to #8704

turt2live transferred this issue from matrix-org/matrix-spec-proposals Oct 20, 2020

richvdh added the A-Application-Service Related to AS support label Oct 20, 2020

erikjohnston mentioned this issue Nov 2, 2020

Fix AS sender catch up handling, after an AS comes back online by batching up events in transactions #8704

Open

erikjohnston closed this as completed Nov 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve/fix application service transaction batching. #8604

Improve/fix application service transaction batching. #8604

erikjohnston commented Oct 20, 2020

Half-Shot commented Oct 20, 2020 •

edited

Loading

turt2live commented Oct 20, 2020

erikjohnston commented Oct 20, 2020

erikjohnston commented Oct 20, 2020

Half-Shot commented Oct 20, 2020

richvdh commented Oct 20, 2020

Half-Shot commented Oct 21, 2020

richvdh commented Oct 29, 2020

erikjohnston commented Nov 2, 2020

Improve/fix application service transaction batching. #8604

Improve/fix application service transaction batching. #8604

Comments

erikjohnston commented Oct 20, 2020

Half-Shot commented Oct 20, 2020 • edited Loading

turt2live commented Oct 20, 2020

erikjohnston commented Oct 20, 2020

erikjohnston commented Oct 20, 2020

Half-Shot commented Oct 20, 2020

richvdh commented Oct 20, 2020

Half-Shot commented Oct 21, 2020

richvdh commented Oct 29, 2020

erikjohnston commented Nov 2, 2020

Half-Shot commented Oct 20, 2020 •

edited

Loading