Proposal: mitigate extremities accumulation by intelligently pruning extremities which do not contribute to state resolution results #5438
Description
While discussing #5319 with @erikjohnston, he proposed an alternative algorithm to solve #1760:
- When forward extremities start to accumulate for a room (e.g. more than 5 dangling extremities), we could query the remote servers who sent these events to find out what /their/ current extremities are.
- This tells us which of our extremities are 'current' (if any) and which have already been overtaken by events (hah)
- We then do state resolution locally across the subset of our extremities which are 'current', and compare it with the the result we get when we do state resolution across all the extremities.
- If the result is the same, we know we can safely drop the 'non-current' extremities for that room, and so try to keep the extremity count low.
Things which are nice about this is:
- Being able to query the extremities on other servers in order to rapidly resync a room after an outage would be a very nice feature, and would help solve Homeservers don't catch up with missed traffic until someone sends another event #2528.
The concerns I have with this are:
- We could end up doing a lot of state resolution calculations as we check the various combination of extremities to work out which ones can be pruned.
- If there are lots of servers actively sending msgs in a room, then my server could easily end up with lots of extremities from different servers, especially if our local server has been yoyoing. This algorithm never lets us prune extremities which originate from a range of servers.
- There are several scenarios where the pruning operation fails - e.g. if the remote server's extremities is disjunct with your local server's extremities, or if your non-current local extremities do turn out to be significant in calculating state resolution.
So while this might well help to heuristically keep the extremity count from building up, it's not guaranteed to work - and an attacker might be able to find ways of gaming the pruning algorithm to deliberately cause unprunable extremities to accumulate and so DoS us back to square 1. Whereas #5319 guarantees us to stop the extremities ever building up beyond a given threshold, and gives us a way to reliably reduce the extremity count to 2 (momentarily at least).
It really boils down to a trade-off between risk of overcomplex DAG and dummy events flying around (#5319) versus an incremental measure to prune irrelevant extremities which may be tough to calculate and may well not work (this).
Suggestion is to try implementing both as options per-server, and see how in practice they perform.