-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC2638: Ability for clients to request homeservers to resync device lists #2638
Changes from 1 commit
bc02e25
1c45e80
0f55227
b60e426
5b174e0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
…lists
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
# MSC2638: Ability for clients to request homeservers to resync device lists | ||
|
||
With the recent roll out of cross-signing, | ||
[some](https://github.com/matrix-org/synapse/issues/7418) | ||
[bugs](https://github.com/matrix-org/synapse/issues/7504) in Synapse were | ||
reported around the way it was handling failures when processing device list | ||
updates started hitting users, causing local copies of remote device lists to | ||
grow stale. | ||
|
||
Remote device lists growing stale can happen as soon as a homeserver | ||
implementation implements processing of device list updates improperly, or | ||
without the proper retry mechanisms. It can also happen if a homeserver was down | ||
for enough time to miss out on device list updates. | ||
|
||
The main consequence of this situation is rendering it impossible for impacted | ||
users to use cross-signing/end-to-end encryption for some remote devices, or | ||
even to use it at all. | ||
|
||
When fixing this issue, it is currently quite difficult for a homeserver to | ||
figure out which users it should resync the device lists of. However, it is much | ||
easier for clients to see if a device list is likely out of date (e.g. if it has | ||
cross-signing keys but no signatures), and almost trivial for users to do so | ||
(e.g. if someone tells me "I've enabled cross-signing" but I can't see their | ||
cross-signing keys, then I know something has gone wrong). | ||
|
||
In the current situation, the best answer we can give to a user seeing stale | ||
device lists is to ask their server admin to get their instance to somehow (the | ||
details on how to do it depend on the implementation) resync the device list(s) | ||
for the impacted user(s), which isn't an acceptable solution. | ||
|
||
For clarity, "resync" here means for a homeserver to ask the latest version of a | ||
user's device list to their homeserver, using the [`GET | ||
/_matrix/federation/v1/user/devices/{userId}`](https://matrix.org/docs/spec/server_server/latest#get-matrix-federation-v1-user-devices-userid) | ||
federation endpoint. "device list update" here describes both a | ||
`m.device_list_update` EDU and a `m.signing_key_update` one. | ||
|
||
The proposal descibed here fixes this issue by adding a new endpoint to the | ||
babolivier marked this conversation as resolved.
Show resolved
Hide resolved
|
||
client-server API, allowing clients to request homeservers to resync device | ||
lists. Clients could then either expose a button a user can use when necessary, | ||
or call this endpoint on specific actions, or both. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Counter proposal (that I'm not convinced entirely on): implementations could be smarter about how they cache/detect failure. If a client hits the device list endpoint several times (where several is subjective), the server could realize that something may be wrong and re-sync. Additionally, as mentioned above, the server would already be able to detect some trivial cases of self-imposed corruption, such as having keys but no signatures, and request it automatically. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't mention anywhere that servers detect that kind of things. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't think that would solve the issue. If we miss a bunch of updates because we went down, when we come back up there's no way to know what we've missed, and I wouldn't be surprised if there were other situations where it would be tricky for the homeserver to figure out when to resync. This proposal is meant as a way for clients to tell the server "you didn't detect this failure correctly, but it failed, please resync". fwiw this counter proposal is already described in the "Alternatives" section of this MSC, with a summary of the explanation above. |
||
|
||
## Proposal | ||
|
||
This proposal adds the following endpoint to the client-server API: | ||
|
||
### `GET /_matrix/client/r0/user/{userId}/devices` | ||
babolivier marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
In this request, `{userId}` is the full Matrix ID of the user to retrieve the | ||
device list for. | ||
|
||
The homeserver responds to a request to this endpoint with a 202 Accepted | ||
response code and an empty body (`{}`). | ||
Comment on lines
+51
to
+52
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should also support rate limiting (strongly) and error codes/error ranges for typical failures, like the homeserver being invalid, dead, etc or the user being invalid, local, etc. If this is meant to work for local users too, that should be specified. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @babolivier this is where the rate limiting needs to be mentioned. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, thanks. That sounded like a security thing to me so I wrote it in the security section. I'll move it here. |
||
|
||
Upon request to this client-side endpoint, the homeserver would send a request | ||
to [`GET | ||
/_matrix/federation/v1/user/devices/{userId}`](https://matrix.org/docs/spec/server_server/latest#get-matrix-federation-v1-user-devices-userid) | ||
on the target user's homeserver to retrieve the device list. | ||
|
||
If the request to the aforementioned federation endpoint succeeds (i.e. the | ||
destination responds with a 200 response code and a device list for the target | ||
user), then the homeserver would include the updated device list in sync | ||
responses in order to relay it to clients. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The fact that you have to hit the endpoint and then wait for something to come down |
||
|
||
If the request to the aforementioned federation endpoint fails (i.e. the | ||
destination responds with a non-200 response code or times out), then the | ||
homeserver should relay this error to clients, but should also schedule retries | ||
for this request in case the destination starts working again. In case one of | ||
these retries succeeds (i.e. the destination responds with a 200 response code | ||
and a device list for the target user), the homeserver would include the updated | ||
device list in sync responses. | ||
|
||
Clients can then make requests to this endpoint, either upon an explicit | ||
request from the user (e.g. clicking a button) or while performing other actions | ||
(e.g. creating an encrypted 1:1 chat, processing an invite in a group chat, | ||
sending a verification request, etc.). | ||
|
||
XXX: I'm unsure whether the spec should be mandating when a client uses this | ||
endpoint or whether this is left as an implementation detail. | ||
Comment on lines
+77
to
+78
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. implementation detail. The spec is a toolbag, and it's your responsibility to grab a hammer to put a nail in the wall. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the clarification! |
||
|
||
|
||
## Alternatives | ||
|
||
An alternative to this proposal is to let homeservers figure out when a good | ||
time to resync a device list is. However, it can be difficult for it to do so, | ||
especially as it can't see some of the actions a resync would be most useful for | ||
(e.g. sending a verification request, which are encrypted in encrypted rooms). | ||
It therefore makes more sense to do this on the client-side, and it also adds | ||
the ability for the user to figure out when a device list needs to be resynced. | ||
|
||
Another approach would be to have the homeserver directly relay the response to | ||
the client in the response to a `GET /_matrix/client/r0/user/{userId}/devices` | ||
request. This approach has been dropped in order to stay consistent in how we | ||
relay device list updates to users. | ||
|
||
|
||
## Security considerations | ||
|
||
This could provide a way for users to flood remote homeservers with federated | ||
device list requests. However, this can be easily mitigated with rate-limiting | ||
and a short-lived cache on the homeserver. | ||
|
||
|
||
## Unstable prefix | ||
|
||
Until this MSC is merged and included in a stable release of the spec, | ||
implementations should use the route | ||
`GET /_matrix/unstable/org.matrix.mscXXXX/user/{userId}/devices`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely this is an implementation bug if it doesn't send them once the server returns? The spec doesn't say to give up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementations are free to decide when they retry. Currently Synapse isn't too clever about that (related: matrix-org/synapse#2526) - afaik it retries with a backoff meaning that if you've been down for long enough you could have to wait a long time (I think we go up to 24h but I'm not 100% certain) without getting the transactions it's missed. Also I don't think the "retry as soon as the server has returned" behaviour is mandated in the spec so it doesn't entirely sounds like an implementation bug to me.
Well surely implementations will give up at some point. It sounds like a gigantic waste of resources to e.g. retry 100s of servers that have been down for weeks. And if the spec doesn't specifically mandate not to give up (which again would feel a bit meh to me) then imho it just means that the homeserver is free to decide whether to give up and when to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These all pass through the transaction endpoints in the federation spec, which does recommend backing off but doesn't say that the sending server should ditch the request and start at a more convenient place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not seeing how that makes the issue this MSC tries to solve disappear TBH. Both backing off for hours/days or considering the server down and stopping sending it transactions entirely if it's been down for long enough don't sound unreasonable nor do they look like implementation bugs to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree with @turt2live that it feels wrong that we have to bodge around devicelists being unreliable by exposing an endpoint for clients to manually refresh them. It'd be better for device lists to be reliable in the first place, which means fixing matrix-org/synapse#2526 - which feels like a much better use of time. Are there other bugs that would cause device lists to get out of sync that we know of beyond that bug? (cc @kegsay)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that (retrying pending transactions as soon as a server comes back to life) something that's enforced in the spec? If so I agree. Otherwise it feels like we're fixing the issue only for Synapse and ignoring people using an implementation that's handling that in another way (either because it's less advanced or by design) which feels wrong to me.
I tend to agree; my concern is that it's a feature that's easy to get wrong (as shown by the multiple Synapse bugs related to it), and any mishap ruins a whole part of the E2EE UX for a user - my opinion was that it could be nice to have a fallback just in case an implementation, whether it's Synapse or another one, screws up. Implementing inbound device list updates over federation requires to implement some things (e.g. retries) the spec doesn't mention, and screwing it up will likely leave the homeserver with a bunch of stale device lists it can't trivially fix even if the code gets fixed.
The proposal also addresses another issue, which is how to fix a server screwing up on top of just fixing the bug in the implementation. For example, with the multiple Synapse bugs, it's fair to say we're likely to have quite a few stale device lists, with no other way to fix them other than having the remote user rename a device (which isn't intuitive at all). Homeservers can't really figure out which device lists are stale without resyncing the device list of every single remote user they knows of, whereas it's (imho) much more trivial for clients to figure out when a device list is likely to be out of date.
FWIW the main cause of user grumpiness following the issues we had in Synapse was that users that aren't operating their own homeserver basically can't fix the stale device lists they could see for remote users, making the E2EE experience quite disappointing. It seems likely to me that this happens with other implementations in the future (or even in Synapse if we introduce a regression in that code by mistake).
I guess my point is that implementing device list updates is quite prone to bugs and design flaws and either can be very annoying to both server admins and users. If this is an issue then we should probably explore alternative designs for this feature.
There is one I know of (though apparently I haven't taken the time to open an issue for it yet) which is that Synapse expects updates to arrive sorted on their
stream_id
, which the spec doesn't enforce - so if Synapse receives updates in the wrong order and the other homeserver fails to respond to the resync request the device list will become stale until the remote server starts to successfully resync again (or we're retrying again, which can be hampered due to the current backoff and to matrix-org/synapse#2526).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unaware of any other issues which can cause them to get out of sync. However, I would push for an explicit resync mechanism over the proposed "just make
/send
reliable".The core issue here is one of timing - much like matrix-org/synapse#2526 - but there's also issues around unbounded growth of messages to send to servers. Unlike PDUs, servers don't have to store all these device list updates in their DB unless they are forced to, so making them free to discard them is desirable.
With regards to timing, the options are:
prune_outbound_device_list_pokes
synapse#7173 - I regard this option as eagerly sending.device_lists.changed
in/sync
) but no such mechanism exists in Federation. I regard this option as lazy sending as the keys are only fetched when requested by a client, even though they've been eagerly notified that there is a change in the device list (though we could be smarter about when we notify servers about the new device as we know the metadata on events). This would have also averted Cross-signing signatures not being always federated correctly synapse#7418 because instead of being told "Alice has a new device and here is the key" you'd be told "Alice has a new device" and then have to fetch the list (which would pull in any missing keys).As for the proposed endpoint, I don't think it's neccessary. We already have https://matrix.org/docs/spec/client_server/r0.6.1#post-matrix-client-r0-keys-query so just use that? We may need to add a
refresh
flag to stop servers from serving up local stale copies though.