|
| 1 | +# MSC3970: Scope transaction IDs to devices |
| 2 | + |
| 3 | +Transaction identifiers in the Client-Server API are currently scoped to the concept of a "client session" which, when |
| 4 | +refresh tokens are used, can span a sequence of access tokens. |
| 5 | + |
| 6 | +The spec [reads](https://spec.matrix.org/v1.6/client-server-api/#transaction-identifiers): |
| 7 | + |
| 8 | +> The scope of a transaction ID is a “client session”, where that session is identified by a particular access token. |
| 9 | +> When refreshing an access token, the transaction ID’s scope is retained. This means that if a client with token `A` |
| 10 | +> uses `TXN1` as their transaction ID, refreshes the token to `B`, and uses `TXN1` again it’ll be assumed to be a |
| 11 | +> duplicate request and ignored. If the client logs out and back in between the `A` and `B` tokens, `TXN1` could be used |
| 12 | +> once for each. |
| 13 | +
|
| 14 | +The transaction IDs appear in two parts of the Client-Server API spec: |
| 15 | + |
| 16 | +1. As a identifier to allow the homeserver to make some `PUT` endpoints [idempotent](https://spec.matrix.org/v1.6/client-server-api/#transaction-identifiers) |
| 17 | +2. An unsigned field in the event data model for a client to tell if it sent an event or not. a.k.a. solving the ["local echo"](https://spec.matrix.org/v1.6/client-server-api/#local-echo) problem |
| 18 | + |
| 19 | +For reference, the `PUT` endpoitns that have the a `{txnId}` param are: |
| 20 | + |
| 21 | +- [`PUT /_matrix/client/v3/rooms/{roomId}/send/{eventType}/{txnId}`](https://spec.matrix.org/v1.6/client-server-api/#put_matrixclientv3roomsroomidsendeventtypetxnid) |
| 22 | +- [`PUT /_matrix/client/v3/rooms/{roomId}/redact/{eventId}/{txnId}`](https://spec.matrix.org/v1.6/client-server-api/#put_matrixclientv3roomsroomidredacteventidtxnid) |
| 23 | +- [`PUT /_matrix/client/v3/sendToDevice/{eventType}/{txnId}`](https://spec.matrix.org/v1.6/client-server-api/#put_matrixclientv3sendtodeviceeventtypetxnid) |
| 24 | + |
| 25 | +## Proposal |
| 26 | + |
| 27 | +It is proposed that the scope of transaction identifiers be changed from a "client session" to a "device". |
| 28 | + |
| 29 | +A "device" is typically represented by a `device_id` elsewhere in the spec. |
| 30 | + |
| 31 | +For idempotency, this means the homeserver changing the method of identifying a request from: |
| 32 | + |
| 33 | +- (`client session`, `HTTP path of request which includes the transaction ID`) |
| 34 | + |
| 35 | +to: |
| 36 | + |
| 37 | +- (`device_id`, `HTTP path of request which includes the transaction ID`) |
| 38 | + |
| 39 | +For local echo, the homeserver would now include the `transaction_id` in the event data when it is serving a sync |
| 40 | +request from the same `device_id` as determined from the access token. |
| 41 | + |
| 42 | +## Potential issues |
| 43 | + |
| 44 | +### This is technically a breaking change to the spec |
| 45 | + |
| 46 | +The main "issue" I see with this proposal is that this is technically a breaking change to the spec. |
| 47 | + |
| 48 | +Because a device ID could have multiple sequences of access tokens associated with it, this proposal would widen the |
| 49 | +scope of the transaction ID. |
| 50 | + |
| 51 | +Therefore it could potentially lead to a request being treated as "new" where before it would have been identified as a |
| 52 | +retransmission and deduplicated. |
| 53 | + |
| 54 | +However, the evidence suggests that nothing would be impacted in reality. |
| 55 | + |
| 56 | +Some evidences have been collated to support this claim: |
| 57 | + |
| 58 | +#### 1. Data from the matrix.org Homeserver suggests the change would have no impact |
| 59 | + |
| 60 | +The matrix.org Homeserver is a reasonable size deployment and could be considered reasonably representative of the |
| 61 | +diversity of Matrix client. |
| 62 | + |
| 63 | +The Synapse homeserver that runs matrix.org maintains a `event_txn_id` table that contains a rolling 24 hour window of |
| 64 | +(`user_id`, `token_id`, `room_id`, `txn_id`) tuples. |
| 65 | + |
| 66 | +Having analysed the contents of the table, it appears that there are no repeated transaction IDs for a given user, token |
| 67 | +and room. |
| 68 | + |
| 69 | +There are other `PUT` endpoints for which transaction IDs are not in the `event_txn_id` table, however because the event |
| 70 | + |
| 71 | +As such, the widening of the scope from token to device would not have caused any issues during the periods sampled. |
| 72 | + |
| 73 | +For reference the following is the schema of the `event_txn_id` table: |
| 74 | + |
| 75 | +```sql |
| 76 | + Table "matrix.event_txn_id" |
| 77 | + Column | Type | Collation | Nullable | Default |
| 78 | +-------------+--------+-----------+----------+--------- |
| 79 | + event_id | text | | not null | |
| 80 | + room_id | text | | not null | |
| 81 | + user_id | text | | not null | |
| 82 | + token_id | bigint | | not null | |
| 83 | + txn_id | text | | not null | |
| 84 | + inserted_ts | bigint | | not null | |
| 85 | +Indexes: |
| 86 | + "event_txn_id_token_id" btree (token_id) |
| 87 | + "event_txn_id_event_id" UNIQUE, btree (event_id) |
| 88 | + "event_txn_id_ts" btree (inserted_ts) |
| 89 | + "event_txn_id_txn_id" UNIQUE, btree (room_id, user_id, token_id, txn_id) |
| 90 | +Foreign-key constraints: |
| 91 | + "event_txn_id_event_id_fkey" FOREIGN KEY (event_id) REFERENCES matrix.events(event_id) ON DELETE CASCADE |
| 92 | + "event_txn_id_token_id_fkey" FOREIGN KEY (token_id) REFERENCES matrix.access_tokens(id) ON DELETE CASCADE |
| 93 | +``` |
| 94 | + |
| 95 | +And the query to look for repeated transaction IDs: |
| 96 | + |
| 97 | +```sql |
| 98 | +SELECT e1.txn_id, LEFT(e1.user_id, 5) AS user_id, e1.token_id, e2.token_id, e1.inserted_ts, e2.inserted_ts FROM matrix.event_txn_id e1, matrix.event_txn_id e2 WHERE e1.txn_id = e2.txn_id AND e1.event_id <> e2.event_id AND e1.event_id < e2.event_id AND e1.user_id = e2.user_id AND e1.room_id = e2.room_id ORDER BY e1.token_id; |
| 99 | + txn_id | user_id | token_id | token_id | inserted_ts | inserted_ts |
| 100 | +--------+---------+----------+----------+-------------+------------- |
| 101 | +(0 rows) |
| 102 | +``` |
| 103 | + |
| 104 | +#### 2. Conduit homeserver already scopes transaction IDs to devices |
| 105 | + |
| 106 | +As highlighted by the new Complement [tests](https://github.com/matrix-org/complement/pull/613) the Conduit homeserver |
| 107 | +is already scoping transaction IDs to devices. |
| 108 | + |
| 109 | +I can't find a related issue [listed](https://gitlab.com/famedly/conduit/-/issues), so presumably this non-compliant |
| 110 | +behaviour isn't causing a known issue for admins and users of the Conduit homeserver? |
| 111 | + |
| 112 | +#### 3. Synapse homeserver only checks for retransmits over a 30-60 minute window |
| 113 | + |
| 114 | +The Synapse homeserver uses an in-memory cache to check for idempotency of requests. The cache is vacated after |
| 115 | +[30-60 minutes](https://github.com/matrix-org/synapse/blob/adac949a417d064958039ae0918b97388413c824/synapse/rest/client/transactions.py#L50-L52) |
| 116 | +and is not persisted between restarts (or across a cluster). |
| 117 | + |
| 118 | +This means that for many existing deployments idempotency is only actually enforced over a 30-60 minutes and not |
| 119 | +eternally as the spec might suggest. |
| 120 | + |
| 121 | +#### 4. Synapse homeserver only supports local echo for the previous 24 hours |
| 122 | + |
| 123 | +The Synapse homeserver only supports local echo (by the presence of `transaction_id` on sync responses) for the previous |
| 124 | +24 hours. This is because the `event_txn_id` table is only kept for 24 hours. |
| 125 | + |
| 126 | +Again, this suggests that in reality the local echo semantics are not preserved eternally as the spec might suggest. |
| 127 | + |
| 128 | +### Is the "device" concept the right level of abstraction to use? |
| 129 | + |
| 130 | +One way to look at it is that device is already widely used in the end-to-end encryption parts of the spec and so why |
| 131 | +isn't it suitable for this use case too? |
| 132 | + |
| 133 | +### What about two clients masquerading as a single device ID? |
| 134 | + |
| 135 | +I don't know if this actually works in practice. If this was a concern then it could be mitigated by clarifying in the |
| 136 | +spec that if a client wishes to submit requests using the same `device_id` as another client session that it should |
| 137 | +choose transaction identifiers that are unique to that client session. |
| 138 | + |
| 139 | +## Alternatives |
| 140 | + |
| 141 | +### Do nothing |
| 142 | + |
| 143 | +We could leave the transaction ID scope as is. |
| 144 | + |
| 145 | +However, it makes it difficult to implement a features like [MSC3861: Matrix architecture change to delegate authentication via OIDC](https://github.com/matrix-org/matrix-spec-proposals/pull/3861) |
| 146 | +as the concept of a "client session" doesn't really exist in OIDC. |
| 147 | + |
| 148 | +As noted above, at least one homeserver implementation is also not implementing the spec as it is today. |
| 149 | + |
| 150 | +It also turns out that the current implementation of refresh tokens in Synapse breaks the transaction ID semantics |
| 151 | +already and needs to be [fixed](https://github.com/matrix-org/synapse/issues/15141). |
| 152 | + |
| 153 | +### Make a backwards compatible change |
| 154 | + |
| 155 | +A backwards compatible alternative could be something like: |
| 156 | + |
| 157 | +1. For idempotency have clients opt-in to a new scope of transaction ID, but support the current semantics too for compatibility |
| 158 | +2. Have clients opt-in (e.g. request param on the sync endpoint) to receiving transaction ID for all events in the sync response and make the client responsible for identifying which messages they sent |
| 159 | + |
| 160 | +The disadvantage of this is that we create a load of extra maintenance work to support both semantics for a period of |
| 161 | +time for (empirically) no gain in return. |
| 162 | + |
| 163 | +## Security considerations |
| 164 | + |
| 165 | +A malicious client can adopt an existing device ID of a user. This could possibly allow some kind of denial of service attack. |
| 166 | + |
| 167 | +However, if such an attack where possible it would be possible to do so without this MSC as device IDs are crucial to |
| 168 | +the implementation of end-to-end encryption. |
| 169 | + |
| 170 | +## Other recommendations |
| 171 | + |
| 172 | +I'm not suggesting that these recommendations are address in this proposal, but more notes for future proposals or spec clarifications. |
| 173 | + |
| 174 | +### Clarification on idempotency semantics |
| 175 | + |
| 176 | +I have separately prepared a [spec PR](https://github.com/matrix-org/matrix-spec/pull/1449) to clarify some of the |
| 177 | +idempotency semantics that doesn't modify the spec but is useful to understand the context of this proposal. |
| 178 | + |
| 179 | +### Clarification on transaction ID time scope |
| 180 | + |
| 181 | +I also suggest that the spec be clarified over what time periods the transaction ID is scoped for such that clients can |
| 182 | +be aware. This cloud simply be to say that the time period is not defined and so may vary by implementation. |
| 183 | + |
| 184 | +### Suggest a lease naive transaction ID format |
| 185 | + |
| 186 | +I think we should sense to update the recommendation on the format of transaction IDs from: |
| 187 | + |
| 188 | +> how is not specified; a monotonically increasing integer is recommended |
| 189 | +
|
| 190 | +To something more unique (in my research I have found some clients use a seconds since epoch which doesn't seem ideal) |
| 191 | + |
| 192 | +## Unstable prefix |
| 193 | + |
| 194 | +None needed. |
| 195 | + |
| 196 | +## Dependencies |
| 197 | + |
| 198 | +None. |
0 commit comments