Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Improve/fix application service transaction batching. #8604

Closed
erikjohnston opened this issue Oct 20, 2020 · 9 comments
Closed

Improve/fix application service transaction batching. #8604

erikjohnston opened this issue Oct 20, 2020 · 9 comments
Labels
A-Application-Service Related to AS support

Comments

@erikjohnston
Copy link
Member

The following two things come up semi regularly:

  1. Synapse will include all events in the next transaction that have happened since the last transaction. This can lead to a scenario where we end up with large numbers of events in a single transaction, which the AS then takes a long time to process, causing the next transaction to also be very large, introducing large amounts of lag that the AS doesn't recover from. We should limit the number of events we bundle into a transaction.
  2. If the AS is marked down then for every event persisted we create a new transaction for that AS with a single event in it. When the AS comes back online Synapse will then send these transactions one at a time, which can mean that an AS can take a very very long time to catch up, even if its processing the events quickly. We should send transactions with multiple events instead (though not too many).
@Half-Shot
Copy link
Collaborator

Half-Shot commented Oct 20, 2020

Synapse will include all events in the next transaction that have happened since the last transaction. This can lead to a scenario where we end up with large numbers of events in a single transaction, which the AS then takes a long time to process, causing the next transaction to also be very large, introducing large amounts of lag that the AS doesn't recover from. We should limit the number of events we bundle into a transaction.

I'm wondering if this actually helps though? Presumably sending batches of N size is also going to delay the appservice as soon as it hits some events it needs to do a lot of work on. To be honest, I'd rather appservices didn't block transaction processing, as we don't need to send the result of the processing back to the homeserver. The AS should really keep accepting new transactions in quick succession regardless of the event contents.

We should formally spec a max size of events though, because most appservices do limit the byte size of the payload.

If the AS is marked down then for every event persisted we create a new transaction for that AS with a single event in it. When the AS comes back online Synapse will then send these transactions one at a time, which can mean that an AS can take a very very long time to catch up, even if its processing the events quickly. We should send transactions with multiple events instead (though not too many).

This seems a bit silly yep, but feels like a Synapse problem (not a spec, the spec doesn't have opinions about retry behavior) where it could be more efficient about it's batching. You can't change the size of a transaction once you've attempted to send it, but subsequent transactions that are pending are mutable so you could keep extending the next transaction to be sent up to MAX_SIZE.

@turt2live
Copy link
Member

These are implementation details and unrelated to the spec - moving to the synapse repo.

@turt2live turt2live transferred this issue from matrix-org/matrix-spec-proposals Oct 20, 2020
@erikjohnston
Copy link
Member Author

These are implementation details and unrelated to the spec - moving to the synapse repo.

Woops! I really meant it to be in the Synapse repo :/

@erikjohnston
Copy link
Member Author

Synapse will include all events in the next transaction that have happened since the last transaction. This can lead to a scenario where we end up with large numbers of events in a single transaction, which the AS then takes a long time to process, causing the next transaction to also be very large, introducing large amounts of lag that the AS doesn't recover from. We should limit the number of events we bundle into a transaction.

I'm wondering if this actually helps though? Presumably sending batches of N size is also going to delay the appservice as soon as it hits some events it needs to do a lot of work on. To be honest, I'd rather appservices didn't block transaction processing, as we don't need to send the result of the processing back to the homeserver. The AS should really keep accepting new transactions in quick succession regardless of the event contents.

The problem is that you end up with a delay added on top of however long the AS takes to process an event. This delay is then never recovered from as the AS keeps struggling with large transactions (and can end up in a situation where it gets further and further behind because of larger and larger transactions). We've seen this reported in the wild.

If the AS returns a 200 OK immediately then a) you lose reliability if the AS dies/is restarted and b) there is no longer any back pressure if the AS starts falling behind, and Synapse will just continue spamming the AS with events until it falls over in a heap.

We should formally spec a max size of events though, because most appservices do limit the byte size of the payload.

There is a max size of events, its just that doesn't help if we don't bound the max number of events in a transaction

@Half-Shot
Copy link
Collaborator

These are implementation details and unrelated to the spec - moving to the synapse repo.

Ah, I did review this with a spec hat on so my comments may have been tainted with a spec side view - sorry :s.

The problem is that you end up with a delay added on top of however long the AS takes to process an event. This delay is then never recovered from as the AS keeps struggling with large transactions (and can end up in a situation where it gets further and further behind because of larger and larger transactions). We've seen this reported in the wild.

Hmm, I guess smaller transactions on recovery allows the AS to at least report to the HS that it's making progress, rather than flopping over because it didn't complete the entire transaction. That's a fair point. I would like to see what the AS could do to limit work done per event. There's no point blocking transactions while the bridge is waiting for the HS to finish a join or a leave, for example.

There is a max size of events, its just that doesn't help if we don't bound the max number of events in a transaction

Misspoke, I agree with this :).

@richvdh richvdh added the A-Application-Service Related to AS support label Oct 20, 2020
@richvdh
Copy link
Member

richvdh commented Oct 20, 2020

at least half of this is a duplicate of #3536

@Half-Shot
Copy link
Collaborator

#3536 is now fixed (we decided on a fixed 100 events per transaction) which solves concern 1.

@richvdh
Copy link
Member

richvdh commented Oct 29, 2020

@erikjohnston to clarify the remaining work here

@erikjohnston
Copy link
Member Author

Split out remaining work to #8704

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Application-Service Related to AS support
Projects
None yet
Development

No branches or pull requests

4 participants