Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC3944: Dropping stale send-to-device messages #3944

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

uhoreg
Copy link
Member

@uhoreg uhoreg commented Dec 6, 2022

@uhoreg uhoreg added e2e proposal A matrix spec change proposal client-server Client-Server API kind:maintenance MSC which clarifies/updates existing spec needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. labels Dec 6, 2022
@uhoreg uhoreg changed the title MSCxxxx: Dropping stale send-to-device messages MSC3944: Dropping stale send-to-device messages Dec 6, 2022
@bradtgmurray
Copy link

This looks great, and would help us out over at Beeper I think. We've been seeing clients (especially Element Web) generate a fair amount of room key chatter and mobile apps that aren't syncing actively at the time end up getting a large backlog of requests and cancellations when they do sync again. Since Synapse also limits the number of to_device events to 100 per response, this results in us having to do several syncs back-to-back until we can consider the app "caught up".

Comment on lines 24 to 41
- `m.room_key_request` with `action: request`, if there is a corresponding
`m.room_key_request` message with `action: request_cancellation` and the same
`request_id` and `requesting_device_id` fields, sent by the same user. If
the request message is dropped, the cancellation message is dropped as well.
- `m.secret.request` with `action: request`, if there is a corresponding
`m.secret.request` with `action: request_cancellation` and the same
`request_id` and `requesting_device_id` fields, sent by the same user. If
the request message is dropped, the cancellation message is dropped as well.
- `m.key.verification.request`, if there is a corresponding
`m.key.verification.cancel` with the same `transaction_id` field, sent by the
same user. If the request message is dropped, the cancellation message is
dropped as well.
- `m.key.verification.request` when the `timestamp` is more than 10 minutes in
the past. Since clients are to ignore verification requests that are older
than this, they can be safely dropped. The server may wish to remember the
`transaction_id` and sender for some time, so that it can also drop the
`m.key.verification.cancel` message with the same `transaction_id` and
sender, if such a message gets sent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit frustrating that we need to set out special rules for each type of to-device message, rather than laying out a generic framework for cancellable and expiring to-device messages. It feels like something of a layering violation, and feels like we're always going to be extending this list (with consequent extra work for server implementations).

the `requesting_device_id` (in the case of `m.room_key_request` and
`m.secret.request`), or `device_id` (in the case of
`m.key.verification.request`) refers to a device that no longer exists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would extend this to include semantically equivalent to-device messages. There's real world evidence that some clients will send request cancellations with the exact same request ID / requesting device ID many (>40) times.

@@ -0,0 +1,70 @@
# MSC3944: Dropping stale send-to-device messages
Copy link
Member

@kegsay kegsay Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to implement this in the sliding sync proxy because it is aligned with the core bandwidth / speed goals of MSC3575. Before implementing this though, I did some analysis on the size/kinds of to-device events that stack up on the proxy for clients:

  • The largest stack had a huge 237852 to-device events pending. This is uncommon, and rapidly drops off. The median is somewhere in the 100s.
  • Breaking down by event type, the vast majority by several orders of magnitudes are m.room_key_request events:
         event_type         | count  
----------------------------+--------
 m.room_key_request         | 121067
 m.room.encrypted           |    256
 m.secret.request           |     51
 m.key.verification.cancel  |     16
 m.key.verification.request |     13
  • By following the semantics in this MSC, I found that:
28089 to-device messages
14495 unique device|req_id tuples
1564 outstanding requests
12931 cancelled requests
260 multi-cancelled requests
potential savings: removing 26525 messages or 94.431984050696 %
  • "multi-cancelled" requests are identical request cancellation events for the same request ID / requesting device ID.
  • The 94% figure is consistent on large (100,000s events) accounts as well as mid-range (10,000s events) accounts. The smaller the stack, the less the benefit. On a device with 204 pending events, this MSC would remove 99 of them (48%).

There is real and immediate benefit to be had by implementing this MSC everywhere, particularly for m.room_key_request events which is the event type I did this analysis on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client-server Client-Server API e2e kind:maintenance MSC which clarifies/updates existing spec needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. proposal A matrix spec change proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants