Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC2638: Ability for clients to request homeservers to resync device lists #2638

Closed
wants to merge 5 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
MSC2638: Ability for clients to request homeservers to resync device …
…lists
  • Loading branch information
babolivier committed Jun 15, 2020
commit bc02e25345511127c30bae8f88c448eaca272863
107 changes: 107 additions & 0 deletions proposals/2638-client-request-device-list.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# MSC2638: Ability for clients to request homeservers to resync device lists

With the recent roll out of cross-signing,
[some](https://github.com/matrix-org/synapse/issues/7418)
[bugs](https://github.com/matrix-org/synapse/issues/7504) in Synapse were
reported around the way it was handling failures when processing device list
updates started hitting users, causing local copies of remote device lists to
grow stale.

Remote device lists growing stale can happen as soon as a homeserver
implementation implements processing of device list updates improperly, or
without the proper retry mechanisms. It can also happen if a homeserver was down
for enough time to miss out on device list updates.
Comment on lines +12 to +13
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can also happen if a homeserver was down for enough time to miss out on device list updates.

Surely this is an implementation bug if it doesn't send them once the server returns? The spec doesn't say to give up.

Copy link
Contributor Author

@babolivier babolivier Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementations are free to decide when they retry. Currently Synapse isn't too clever about that (related: matrix-org/synapse#2526) - afaik it retries with a backoff meaning that if you've been down for long enough you could have to wait a long time (I think we go up to 24h but I'm not 100% certain) without getting the transactions it's missed. Also I don't think the "retry as soon as the server has returned" behaviour is mandated in the spec so it doesn't entirely sounds like an implementation bug to me.

The spec doesn't say to give up.

Well surely implementations will give up at some point. It sounds like a gigantic waste of resources to e.g. retry 100s of servers that have been down for weeks. And if the spec doesn't specifically mandate not to give up (which again would feel a bit meh to me) then imho it just means that the homeserver is free to decide whether to give up and when to do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These all pass through the transaction endpoints in the federation spec, which does recommend backing off but doesn't say that the sending server should ditch the request and start at a more convenient place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not seeing how that makes the issue this MSC tries to solve disappear TBH. Both backing off for hours/days or considering the server down and stopping sending it transactions entirely if it's been down for long enough don't sound unreasonable nor do they look like implementation bugs to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with @turt2live that it feels wrong that we have to bodge around devicelists being unreliable by exposing an endpoint for clients to manually refresh them. It'd be better for device lists to be reliable in the first place, which means fixing matrix-org/synapse#2526 - which feels like a much better use of time. Are there other bugs that would cause device lists to get out of sync that we know of beyond that bug? (cc @kegsay)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing matrix-org/synapse#2526 - which feels like a much better use of time

Is that (retrying pending transactions as soon as a server comes back to life) something that's enforced in the spec? If so I agree. Otherwise it feels like we're fixing the issue only for Synapse and ignoring people using an implementation that's handling that in another way (either because it's less advanced or by design) which feels wrong to me.

it feels wrong that we have to bodge around device lists being unreliable by exposing an endpoint for clients to manually refresh them

I tend to agree; my concern is that it's a feature that's easy to get wrong (as shown by the multiple Synapse bugs related to it), and any mishap ruins a whole part of the E2EE UX for a user - my opinion was that it could be nice to have a fallback just in case an implementation, whether it's Synapse or another one, screws up. Implementing inbound device list updates over federation requires to implement some things (e.g. retries) the spec doesn't mention, and screwing it up will likely leave the homeserver with a bunch of stale device lists it can't trivially fix even if the code gets fixed.

The proposal also addresses another issue, which is how to fix a server screwing up on top of just fixing the bug in the implementation. For example, with the multiple Synapse bugs, it's fair to say we're likely to have quite a few stale device lists, with no other way to fix them other than having the remote user rename a device (which isn't intuitive at all). Homeservers can't really figure out which device lists are stale without resyncing the device list of every single remote user they knows of, whereas it's (imho) much more trivial for clients to figure out when a device list is likely to be out of date.
FWIW the main cause of user grumpiness following the issues we had in Synapse was that users that aren't operating their own homeserver basically can't fix the stale device lists they could see for remote users, making the E2EE experience quite disappointing. It seems likely to me that this happens with other implementations in the future (or even in Synapse if we introduce a regression in that code by mistake).

I guess my point is that implementing device list updates is quite prone to bugs and design flaws and either can be very annoying to both server admins and users. If this is an issue then we should probably explore alternative designs for this feature.

Are there other bugs that would cause device lists to get out of sync that we know of beyond that bug?

There is one I know of (though apparently I haven't taken the time to open an issue for it yet) which is that Synapse expects updates to arrive sorted on their stream_id, which the spec doesn't enforce - so if Synapse receives updates in the wrong order and the other homeserver fails to respond to the resync request the device list will become stale until the remote server starts to successfully resync again (or we're retrying again, which can be hampered due to the current backoff and to matrix-org/synapse#2526).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unaware of any other issues which can cause them to get out of sync. However, I would push for an explicit resync mechanism over the proposed "just make /send reliable".

The core issue here is one of timing - much like matrix-org/synapse#2526 - but there's also issues around unbounded growth of messages to send to servers. Unlike PDUs, servers don't have to store all these device list updates in their DB unless they are forced to, so making them free to discard them is desirable.

With regards to timing, the options are:

  • Sending server immediately informs other servers of device list updates and backoff retries if they are down. This causes flooding issues - Performance of device list updates could be improved (aka /login is unusably slow) synapse#7721 - and the retry mechanism has been a source of bugs and subtleties around the unbounded growth of messages e.g Synapse relies on unspecced behaviour for the remote server to resync device lists after prune_outbound_device_list_pokes synapse#7173 - I regard this option as eagerly sending.
  • Receiving client requests keys as and when they determine that they need the device keys. This doesn't cause flooding issues when someone updates their devices as people will gradually request keys. However, sometimes clients will not know that they are missing keys (e.g signing in on a new device) so it would need to be augmented with some sort of ping/poke to inform clients that there are new devices. We already have this mechanism in the CS API (device_lists.changed in /sync) but no such mechanism exists in Federation. I regard this option as lazy sending as the keys are only fetched when requested by a client, even though they've been eagerly notified that there is a change in the device list (though we could be smarter about when we notify servers about the new device as we know the metadata on events). This would have also averted Cross-signing signatures not being always federated correctly synapse#7418 because instead of being told "Alice has a new device and here is the key" you'd be told "Alice has a new device" and then have to fetch the list (which would pull in any missing keys).

As for the proposed endpoint, I don't think it's neccessary. We already have https://matrix.org/docs/spec/client_server/r0.6.1#post-matrix-client-r0-keys-query so just use that? We may need to add a refresh flag to stop servers from serving up local stale copies though.


The main consequence of this situation is rendering it impossible for impacted
users to use cross-signing/end-to-end encryption for some remote devices, or
even to use it at all.

When fixing this issue, it is currently quite difficult for a homeserver to
figure out which users it should resync the device lists of. However, it is much
easier for clients to see if a device list is likely out of date (e.g. if it has
cross-signing keys but no signatures), and almost trivial for users to do so
(e.g. if someone tells me "I've enabled cross-signing" but I can't see their
cross-signing keys, then I know something has gone wrong).

In the current situation, the best answer we can give to a user seeing stale
device lists is to ask their server admin to get their instance to somehow (the
details on how to do it depend on the implementation) resync the device list(s)
for the impacted user(s), which isn't an acceptable solution.

For clarity, "resync" here means for a homeserver to ask the latest version of a
user's device list to their homeserver, using the [`GET
/_matrix/federation/v1/user/devices/{userId}`](https://matrix.org/docs/spec/server_server/latest#get-matrix-federation-v1-user-devices-userid)
federation endpoint. "device list update" here describes both a
`m.device_list_update` EDU and a `m.signing_key_update` one.

The proposal descibed here fixes this issue by adding a new endpoint to the
babolivier marked this conversation as resolved.
Show resolved Hide resolved
client-server API, allowing clients to request homeservers to resync device
lists. Clients could then either expose a button a user can use when necessary,
or call this endpoint on specific actions, or both.
Copy link
Member

@turt2live turt2live Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counter proposal (that I'm not convinced entirely on): implementations could be smarter about how they cache/detect failure. If a client hits the device list endpoint several times (where several is subjective), the server could realize that something may be wrong and re-sync. Additionally, as mentioned above, the server would already be able to detect some trivial cases of self-imposed corruption, such as having keys but no signatures, and request it automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, as mentioned above, the server would already be able to detect some trivial cases of self-imposed corruption, such as having keys but no signatures

I don't mention anywhere that servers detect that kind of things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementations could be smarter about how they cache/detect failure

I don't think that would solve the issue. If we miss a bunch of updates because we went down, when we come back up there's no way to know what we've missed, and I wouldn't be surprised if there were other situations where it would be tricky for the homeserver to figure out when to resync. This proposal is meant as a way for clients to tell the server "you didn't detect this failure correctly, but it failed, please resync".

fwiw this counter proposal is already described in the "Alternatives" section of this MSC, with a summary of the explanation above.


## Proposal

This proposal adds the following endpoint to the client-server API:

### `GET /_matrix/client/r0/user/{userId}/devices`
babolivier marked this conversation as resolved.
Show resolved Hide resolved

In this request, `{userId}` is the full Matrix ID of the user to retrieve the
device list for.

The homeserver responds to a request to this endpoint with a 202 Accepted
response code and an empty body (`{}`).
Comment on lines +51 to +52
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also support rate limiting (strongly) and error codes/error ranges for typical failures, like the homeserver being invalid, dead, etc or the user being invalid, local, etc.

If this is meant to work for local users too, that should be specified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@babolivier this is where the rate limiting needs to be mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks. That sounded like a security thing to me so I wrote it in the security section. I'll move it here.


Upon request to this client-side endpoint, the homeserver would send a request
to [`GET
/_matrix/federation/v1/user/devices/{userId}`](https://matrix.org/docs/spec/server_server/latest#get-matrix-federation-v1-user-devices-userid)
on the target user's homeserver to retrieve the device list.

If the request to the aforementioned federation endpoint succeeds (i.e. the
destination responds with a 200 response code and a device list for the target
user), then the homeserver would include the updated device list in sync
responses in order to relay it to clients.
Copy link
Contributor

@neilalexander neilalexander Jul 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that you have to hit the endpoint and then wait for something to come down /sync is still quite potentially racy and feels like a bad smell. What is the client supposed to do in the meantime? Will a client be able to retry the sync and not lose the device keys? Is the response to /devices undefined in the meantime - how do you know if it has updated?


If the request to the aforementioned federation endpoint fails (i.e. the
destination responds with a non-200 response code or times out), then the
homeserver should relay this error to clients, but should also schedule retries
for this request in case the destination starts working again. In case one of
these retries succeeds (i.e. the destination responds with a 200 response code
and a device list for the target user), the homeserver would include the updated
device list in sync responses.

Clients can then make requests to this endpoint, either upon an explicit
request from the user (e.g. clicking a button) or while performing other actions
(e.g. creating an encrypted 1:1 chat, processing an invite in a group chat,
sending a verification request, etc.).

XXX: I'm unsure whether the spec should be mandating when a client uses this
endpoint or whether this is left as an implementation detail.
Comment on lines +77 to +78
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementation detail. The spec is a toolbag, and it's your responsibility to grab a hammer to put a nail in the wall.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification!



## Alternatives

An alternative to this proposal is to let homeservers figure out when a good
time to resync a device list is. However, it can be difficult for it to do so,
especially as it can't see some of the actions a resync would be most useful for
(e.g. sending a verification request, which are encrypted in encrypted rooms).
It therefore makes more sense to do this on the client-side, and it also adds
the ability for the user to figure out when a device list needs to be resynced.

Another approach would be to have the homeserver directly relay the response to
the client in the response to a `GET /_matrix/client/r0/user/{userId}/devices`
request. This approach has been dropped in order to stay consistent in how we
relay device list updates to users.


## Security considerations

This could provide a way for users to flood remote homeservers with federated
device list requests. However, this can be easily mitigated with rate-limiting
and a short-lived cache on the homeserver.


## Unstable prefix

Until this MSC is merged and included in a stable release of the spec,
implementations should use the route
`GET /_matrix/unstable/org.matrix.mscXXXX/user/{userId}/devices`.