OTK fetching failure woes (attempting to establish Olm sessions too often) #281

poljar · 2021-06-22T10:18:44Z

Sending a message in a room is a three step process:

Collect all the devices that we don't have an established Olm session with and claim one-time keys for them
Encrypt a room key for all the devices that are participating in the room and send the key via a to-device message
Encrypt and send the message

Step 1 and 2 can be later on skipped if all the devices received the room key, sadly step 1 can fail to establish an Olm session if a particular devices has been depleted of their one-time keys, this is fixed by introducing a fallback key but not every device uploads a fallback key yet.

This means that step 1 will be retried for every message we send out if a room has such a depleted device, in test scenarios this isn't terribly slow but the request to claim one-time keys can take a while. This manifests itself as slowness while we send room messages.

The server doesn't notify us if a device uploaded a set fresh of one-time keys so we'll have to introduce a timeout instead.

Remember when we claimed keys for user/device pairs and if a certain timeout didn't pass don't try again, all of this can stay in memory since we don't expect a large amount of dead devices and retrying when we restart the client sounds sensible as well.

dkasak · 2021-10-20T11:04:15Z

Alternative idea to timeout: use an exponential backoff algorithm for retrying. If fetching OTKs fails on message N, retry on messages N+1, N+2, N+4, N+8, ... The factor should be fine-tuned, of course.

dkasak · 2021-10-20T11:07:53Z

Also, if it fails on message N but we only manage to fetch OTKs and establish an Olm channel on message N+M, we'll only send the room key in the N+M ratcheted state. We should instead send it in state N.

poljar · 2022-12-22T11:06:35Z

Alternative idea to timeout: use an exponential backoff algorithm for retrying. If fetching OTKs fails on message N, retry on messages N+1, N+2, N+4, N+8, ... The factor should be fine-tuned, of course.

One flaw with this approach is that you might have the app suspended for quite a while. A significant amount of time would pass. The timeout approach would notice this and retry sooner, while the exponential backoff counting the message numbers would not.

I agree that it would be neat to make this completely deterministic based on the message number, but I don't think that it would produce the desired behavior.

ara4n · 2023-01-02T19:19:17Z

See also element-hq/element-web#24138 of how not to do it in EW

poljar · 2023-01-02T19:44:49Z

What I implemented, albeit not yet pushed, is an exponential backoff with a max timeout of 15 minutes. That's handling the failures field for the /keys/query as well as for the /keys/claim endpoint. The failures field gets populated with homeservers that could not have been reached.

The other failure mode that we still need to handle is before this issue can be closed is OTK exhaustion.
Sadly it doesn't seem that the server explicitly tells you that the device is gone or that it doesn't have any OTKs. So we can't just reuse the above mentioned solution.

But it's better to retry too often instead of what EW does, especially in this age of fallback keys where OTK exhaustion is a relic of history.

poljar · 2023-05-26T10:58:40Z

Re-opening since the OTK-exhaustion case is still posing problems and wasn't handled as part of #1315.

Since the server won't tell us which user/device pairs don't have an OTK we'll have to remember the users for which we sent out a /keys/claim request.

I think that one way to handle this nicely API-wise is to use the following pattern:

struct MissingSessions {
     users: BTreeMap<UserId, DeviceId>,
     machine: OlmMachine,
}

impl MissingSessions {
    pub fn request(&self) -> KeysClaimRequest {
        ...
    }

    pub fn mark_request_as_sent(&self, response: &KeysClaimResponse) -> Result<(), CryptoStoreError> {
        self.machine.receive_keys_query(self.users, response)
    }
}

impl OlmMachine {
    pub fn get_missing_sessions(
        &self,
        impl IntoIterator<Item = &UserId>,
    ) -> Result<MissingSessions, CryptoStoreError> {
        ...
    }
}

BillCarsonFr · 2023-05-26T11:39:27Z

For context here is a rageshake from a user with a device that has no otk (including no fallback).
So every messages he sent will again trigger a keys/claim.
The keys/claim call can take 50ms, 80ms, 100ms, 200ms and even an occurance at 3793ms (claimMissingKeys took: 3793 ms)

The problematic device is some EAX build.
The keys/claim reply is

{
    "one_time_keys": {},
    "failures": {}
}

richvdh · 2023-10-16T10:45:20Z

@poljar:

sending the wrong ratchet state

This sounds very bad, and I don't see any explanation of it in the issue. Can you clarify?

dkasak · 2023-10-16T10:57:12Z

@richvdh: It's referring to the contents of this comment.

richvdh · 2023-10-16T11:30:40Z

ah the wrong megolm ratchet. I thought it meant Olm ratchet!

richvdh · 2023-10-26T11:57:26Z

Out of interest I looked at the anatomy of a slow /keys/claim request. My thoughts over at matrix-org/synapse#16554. TL;DR: synapse could do this faster. Still, that's not really the point for this issue.

richvdh · 2023-10-26T13:01:54Z

I am tempted to write an MSC to modify /keys/claim to say that it must return an empty dict for any devices which were unknown or which have run out of OTKs. TBD what to do about federation: can we trust remote servers to behave themselves in this respect?

Will think on it a little more.

richvdh · 2023-11-01T15:53:14Z

So I opened an MSC (matrix-org/matrix-spec-proposals#4072), but I think it's actually pointless, and we might as well fix this on the client side.

dkasak · 2023-11-20T14:04:13Z

Can we please avoid closing partially resolved issues without splitting off the remainder into a separate issue? Talking about our inability to correctly heal from a transient network failure because we'll send the incorrect (advanced) Megolm ratchet state.

I realise the original issue may have been a bit confusing due to bundling these concerns, but we shouldn't be losing information like this.

richvdh · 2023-11-20T14:15:59Z

Can we please avoid closing partially resolved issues without splitting off the remainder into a separate issue? Talking about our inability to correctly heal from a transient network failure because we'll send the incorrect (advanced) Megolm ratchet state.

Sorry, I didn't realise there was any outstanding work left on this issue.

poljar added the encryption label Jul 23, 2021

dkasak changed the title ~~The crypto layer tries to establish Olm sessions too often~~ OTK fetching failure woes (attempting to establish Olm sessions too often, sending the wrong ratchet state) Oct 20, 2021

poljar added this to the Stabilize the crypto crate milestone Nov 9, 2021

BillCarsonFr mentioned this issue Oct 14, 2022

Rust crypto SDK parity for integration in element R element-hq/element-meta#759

Closed

poljar mentioned this issue Jan 3, 2023

Handle the failures field in the /keys/query and /keys/claim responses #1315

Merged

poljar self-assigned this Jan 3, 2023

poljar closed this as completed in #1315 Jan 12, 2023

poljar reopened this May 26, 2023

poljar mentioned this issue May 31, 2023

crypto release version++ 0.7.1 #1987

Closed

1 task

BillCarsonFr mentioned this issue Oct 11, 2023

Element-R: Message sending is taking several seconds element-hq/element-web#26355

Closed

richvdh mentioned this issue Oct 16, 2023

Element-R: sending messages takes several seconds element-hq/element-web#26375

Closed

dkasak changed the title ~~OTK fetching failure woes (attempting to establish Olm sessions too often, sending the wrong ratchet state)~~ OTK fetching failure woes (attempting to establish Olm sessions too often, sending the wrong Megolm ratchet state) Oct 16, 2023

richvdh mentioned this issue Oct 26, 2023

/keys/claim is surprisingly slow matrix-org/synapse#16554

Open

3 tasks

This was referenced Nov 1, 2023

Groundwork for fixes to one-time-key fetching #2803

Merged

Backoffs on FailuresCache should be increased #2804

Closed

Handle missing devices in /keys/claim responses #2805

Merged

richvdh closed this as completed in #2805 Nov 9, 2023

dkasak changed the title ~~OTK fetching failure woes (attempting to establish Olm sessions too often, sending the wrong Megolm ratchet state)~~ OTK fetching failure woes (attempting to establish Olm sessions too often) Nov 20, 2023

dkasak mentioned this issue Nov 20, 2023

A transient failure to establish an Olm session will cause forever undecryptable room messages #2864

Open

matrixbot mentioned this issue Dec 22, 2023

/keys/claim is surprisingly slow element-hq/synapse#16554

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTK fetching failure woes (attempting to establish Olm sessions too often) #281

OTK fetching failure woes (attempting to establish Olm sessions too often) #281

poljar commented Jun 22, 2021

dkasak commented Oct 20, 2021

dkasak commented Oct 20, 2021

poljar commented Dec 22, 2022

ara4n commented Jan 2, 2023 •

edited

Loading

poljar commented Jan 2, 2023

poljar commented May 26, 2023

BillCarsonFr commented May 26, 2023 •

edited

Loading

richvdh commented Oct 16, 2023

dkasak commented Oct 16, 2023 •

edited

Loading

richvdh commented Oct 16, 2023

richvdh commented Oct 26, 2023

richvdh commented Oct 26, 2023

richvdh commented Nov 1, 2023

dkasak commented Nov 20, 2023

richvdh commented Nov 20, 2023

OTK fetching failure woes (attempting to establish Olm sessions too often) #281

OTK fetching failure woes (attempting to establish Olm sessions too often) #281

Comments

poljar commented Jun 22, 2021

dkasak commented Oct 20, 2021

dkasak commented Oct 20, 2021

poljar commented Dec 22, 2022

ara4n commented Jan 2, 2023 • edited Loading

poljar commented Jan 2, 2023

poljar commented May 26, 2023

BillCarsonFr commented May 26, 2023 • edited Loading

richvdh commented Oct 16, 2023

dkasak commented Oct 16, 2023 • edited Loading

richvdh commented Oct 16, 2023

richvdh commented Oct 26, 2023

richvdh commented Oct 26, 2023

richvdh commented Nov 1, 2023

dkasak commented Nov 20, 2023

richvdh commented Nov 20, 2023

ara4n commented Jan 2, 2023 •

edited

Loading

BillCarsonFr commented May 26, 2023 •

edited

Loading

dkasak commented Oct 16, 2023 •

edited

Loading