-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
/keys/claim
is surprisingly slow
#16554
Comments
I spent some time looking at this. On matrix.org I have ~100 keys across 2 devices:
Here is an attempt to select 2 keys from the first device and 4 keys from the second in a bulk-query form.
If we can change that SELECT into a DELETE FROM... RETURNING then I think we should be able to batch this up and call No idea if this works on SQLite, but I'm not interested in optimising the perf of the thing we say isn't suitable for production. |
https://gist.github.com/DMRobertson/243121754aed82eff56fa8ec5181184a is my attempt to test this locally. It seems very promising. |
We've got some metrics for this. Future me: on m.org, look at the |
Let's see how the current changes fare. Next step here is:
|
https://jaeger.proxy.matrix.org/trace/0182832586d1ede2 is an example of a request to claim 17 keys that took ~700ms. Drilling down, the bulk queries for OTKs and fallback keys took 40ms and 8ms, respectively. The rest of the (local) time was dominated by cache invalidation over replication. I would speculate that we can do better by:
|
I had a look at the grafana metrics (p50, p90, p95) and I couldn't see much of an improvement, sadly. But perhaps that's to be expected: I assume that requests to claim many e2e keys are rare, so we're deep into the tails of the distribution. If anyone has a way to reproduce this I'd be interested to see if there's any user-perceptible improvement. I think Rich said that logging into a new session of |
/keys/claim
requests often take multiple seconds when requesting keys for hundreds of devices.Out of interest I looked at the anatomy of a slow
/keys/claim
request (https://jaeger.proxy.matrix.org/trace/62603ae20c639720). The request took 6.2 seconds altogether.In this case, we were just attempting to claim keys for devices which we had previously failed to get one. (Due to matrix-org/matrix-rust-sdk#281, we do this a bit too often). Anyway the point is pretty much all of the devices in this request have run out of OTKs - but I think it is still instructive.
What I see is:
db.claim_e2e_one_time_keys
. This is presumably one for each device formatrix.org
users. These take us to about 1.8 seconds.db._get_fallback_key
. Again one for eachmatrix.org
device. Another 2.1 seconds, bringing us to 4.0 seconds.claim_client_keys
. One per federated destination. These all happen in parallel, so the critical path is the slowest homeserver to respond. The pathological case here is servers that respond within the timeout (so don't get backed off from) but slowly - and then the device doesn't have any keys so we have to do it again. In this case the slowest server was 2.1 seconds.What I see here is some easy performance improvements. In particular:
The text was updated successfully, but these errors were encountered: