Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

/keys/claim is surprisingly slow #16554

Open
2 of 3 tasks
richvdh opened this issue Oct 26, 2023 · 7 comments
Open
2 of 3 tasks

/keys/claim is surprisingly slow #16554

richvdh opened this issue Oct 26, 2023 · 7 comments
Labels
A-E2EE End-to-end encryption for Matrix clients A-Performance Performance, both client-facing and admin-facing O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@richvdh
Copy link
Member

richvdh commented Oct 26, 2023

/keys/claim requests often take multiple seconds when requesting keys for hundreds of devices.

Out of interest I looked at the anatomy of a slow /keys/claim request (https://jaeger.proxy.matrix.org/trace/62603ae20c639720). The request took 6.2 seconds altogether.

In this case, we were just attempting to claim keys for devices which we had previously failed to get one. (Due to matrix-org/matrix-rust-sdk#281, we do this a bit too often). Anyway the point is pretty much all of the devices in this request have run out of OTKs - but I think it is still instructive.

What I see is:

  • 321 calls to db.claim_e2e_one_time_keys. This is presumably one for each device for matrix.org users. These take us to about 1.8 seconds.
  • 321 calls to db._get_fallback_key. Again one for each matrix.org device. Another 2.1 seconds, bringing us to 4.0 seconds.
  • 21 calls to claim_client_keys. One per federated destination. These all happen in parallel, so the critical path is the slowest homeserver to respond. The pathological case here is servers that respond within the timeout (so don't get backed off from) but slowly - and then the device doesn't have any keys so we have to do it again. In this case the slowest server was 2.1 seconds.

What I see here is some easy performance improvements. In particular:

@DMRobertson DMRobertson added A-Performance Performance, both client-facing and admin-facing S-Minor Blocks non-critical functionality, workarounds exist. O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience A-E2EE End-to-end encryption for Matrix clients T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Oct 27, 2023
@DMRobertson
Copy link
Contributor

321 calls to db.claim_e2e_one_time_keys. This is presumably one for each device for matrix.org users. These take us to about 1.8 seconds.

I spent some time looking at this. On matrix.org I have ~100 keys across 2 devices:

matrix=> select count(*), device_Id from e2e_one_time_keys_json where user_id = '@dmrobertson:matrix.org' group by device_id;
 count │ device_id  
═══════╪════════════
    51 │ CNVWKSCNBO
    50 │ SNGLZABLEG

Here is an attempt to select 2 keys from the first device and 4 keys from the second in a bulk-query form.

matrix=> WITH claims(user_id, device_id, algorithm, count) AS (
    VALUES ('@dmrobertson:matrix.org', 'CNVWKSCNBO', 'signed_curve25519', 4),
           ('@dmrobertson:matrix.org', 'SNGLZABLEG', 'signed_curve25519', 2)
), ranked_keys AS (
    SELECT user_id,
           device_id,
           algorithm,
           key_id,
           claims.count,
           ROW_NUMBER() OVER (PARTITION BY (user_id, device_id, algorithm)) AS r
    FROM e2e_one_time_keys_json
        JOIN claims USING (user_id, device_id, algorithm)
)
SELECT
    k.user_id,
    k.device_id,
    k.algorithm
    -- , k.key_id, k.key_json omitted
FROM e2e_one_time_keys_json k
JOIN ranked_keys USING (user_id, device_id, algorithm, key_id)
WHERE ranked_keys.r <= ranked_keys.count;
         user_id         │ device_id  │     algorithm     
═════════════════════════╪════════════╪═══════════════════
 @dmrobertson:matrix.org │ CNVWKSCNBO │ signed_curve25519
 @dmrobertson:matrix.org │ CNVWKSCNBO │ signed_curve25519
 @dmrobertson:matrix.org │ CNVWKSCNBO │ signed_curve25519
 @dmrobertson:matrix.org │ CNVWKSCNBO │ signed_curve25519
 @dmrobertson:matrix.org │ SNGLZABLEG │ signed_curve25519
 @dmrobertson:matrix.org │ SNGLZABLEG │ signed_curve25519
(6 rows)

If we can change that SELECT into a DELETE FROM... RETURNING then I think we should be able to batch this up and call execute_values.

No idea if this works on SQLite, but I'm not interested in optimising the perf of the thing we say isn't suitable for production.

@DMRobertson
Copy link
Contributor

If we can change that SELECT into a DELETE FROM... RETURNING then I think we should be able to batch this up and call execute_values.

https://gist.github.com/DMRobertson/243121754aed82eff56fa8ec5181184a is my attempt to test this locally. It seems very promising.

@DMRobertson
Copy link
Contributor

We've got some metrics for this. Future me: on m.org, look at the encryption workers and search for "OneTimeKeyServlet".

@DMRobertson
Copy link
Contributor

Let's see how the current changes fare.

Next step here is:

Doing remote and local claims in parallel would roughly halve the time.

@DMRobertson
Copy link
Contributor

DMRobertson commented Oct 31, 2023

https://jaeger.proxy.matrix.org/trace/0182832586d1ede2 is an example of a request to claim 17 keys that took ~700ms.

High-level timing:
image

Drilling down, the bulk queries for OTKs and fallback keys took 40ms and 8ms, respectively. The rest of the (local) time was dominated by cache invalidation over replication. I would speculate that we can do better by:

@clokep
Copy link
Member

clokep commented Nov 9, 2023

Definitely seems a bit better with batching of cache invalidation, not an apples-to-apples comparison, but still seems promising.

Before

image

After

image

@DMRobertson
Copy link
Contributor

I had a look at the grafana metrics (p50, p90, p95) and I couldn't see much of an improvement, sadly. But perhaps that's to be expected: I assume that requests to claim many e2e keys are rare, so we're deep into the tails of the distribution.

If anyone has a way to reproduce this I'd be interested to see if there's any user-perceptible improvement. I think Rich said that logging into a new session of develop.element.io would suffice?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-E2EE End-to-end encryption for Matrix clients A-Performance Performance, both client-facing and admin-facing O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

3 participants