-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTK fetching failure woes (attempting to establish Olm sessions too often) #281
Comments
Alternative idea to timeout: use an exponential backoff algorithm for retrying. If fetching OTKs fails on message N, retry on messages N+1, N+2, N+4, N+8, ... The factor should be fine-tuned, of course. |
Also, if it fails on message N but we only manage to fetch OTKs and establish an Olm channel on message N+M, we'll only send the room key in the N+M ratcheted state. We should instead send it in state N. |
One flaw with this approach is that you might have the app suspended for quite a while. A significant amount of time would pass. The timeout approach would notice this and retry sooner, while the exponential backoff counting the message numbers would not. I agree that it would be neat to make this completely deterministic based on the message number, but I don't think that it would produce the desired behavior. |
See also element-hq/element-web#24138 of how not to do it in EW |
What I implemented, albeit not yet pushed, is an exponential backoff with a max timeout of 15 minutes. That's handling the The other failure mode that we still need to handle is before this issue can be closed is OTK exhaustion. But it's better to retry too often instead of what EW does, especially in this age of fallback keys where OTK exhaustion is a relic of history. |
Re-opening since the OTK-exhaustion case is still posing problems and wasn't handled as part of #1315. Since the server won't tell us which user/device pairs don't have an OTK we'll have to remember the users for which we sent out a I think that one way to handle this nicely API-wise is to use the following pattern: struct MissingSessions {
users: BTreeMap<UserId, DeviceId>,
machine: OlmMachine,
}
impl MissingSessions {
pub fn request(&self) -> KeysClaimRequest {
...
}
pub fn mark_request_as_sent(&self, response: &KeysClaimResponse) -> Result<(), CryptoStoreError> {
self.machine.receive_keys_query(self.users, response)
}
}
impl OlmMachine {
pub fn get_missing_sessions(
&self,
impl IntoIterator<Item = &UserId>,
) -> Result<MissingSessions, CryptoStoreError> {
...
}
} |
For context here is a rageshake from a user with a device that has no otk (including no fallback). The problematic device is some EAX build.
|
This sounds very bad, and I don't see any explanation of it in the issue. Can you clarify? |
@richvdh: It's referring to the contents of this comment. |
ah the wrong megolm ratchet. I thought it meant Olm ratchet! |
Out of interest I looked at the anatomy of a slow |
I am tempted to write an MSC to modify Will think on it a little more. |
So I opened an MSC (matrix-org/matrix-spec-proposals#4072), but I think it's actually pointless, and we might as well fix this on the client side. |
Can we please avoid closing partially resolved issues without splitting off the remainder into a separate issue? Talking about our inability to correctly heal from a transient network failure because we'll send the incorrect (advanced) Megolm ratchet state. I realise the original issue may have been a bit confusing due to bundling these concerns, but we shouldn't be losing information like this. |
Sorry, I didn't realise there was any outstanding work left on this issue. |
Sending a message in a room is a three step process:
Step 1 and 2 can be later on skipped if all the devices received the room key, sadly step 1 can fail to establish an Olm session if a particular devices has been depleted of their one-time keys, this is fixed by introducing a fallback key but not every device uploads a fallback key yet.
This means that step 1 will be retried for every message we send out if a room has such a depleted device, in test scenarios this isn't terribly slow but the request to claim one-time keys can take a while. This manifests itself as slowness while we send room messages.
The server doesn't notify us if a device uploaded a set fresh of one-time keys so we'll have to introduce a timeout instead.
Remember when we claimed keys for user/device pairs and if a certain timeout didn't pass don't try again, all of this can stay in memory since we don't expect a large amount of dead devices and retrying when we restart the client sounds sensible as well.
The text was updated successfully, but these errors were encountered: