-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Stabilize support for MSC3970: updated transaction semantics (scope to device_id
)
#15629
Conversation
I'm a bit confused about this, I see:
@sandhose Can you help me untangle this? From my reading of the above, I'm surprised anything is failing with this PR? |
@clokep I spoke with Q earlier and made a plan to untangle this. We now have a single Complement PR, matrix-org/complement#637, which shows the appropriate changes to assert the new transaction ID behaviour. However, as noted in comments in the PR the tests are currently skipped on Synapse and Dendrite due to neither supporting the new transaction behaviour yet. How should this be handled? Effectively both this PR and the one in Complement should be merged at the same time. (As discussed in the MSC, this is technically a breaking change hence the cyclical dependency) |
I repushed matrix-org/complement#637 as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing major; happy to leave any minor comment changes in your hands
# TODO Remove this once Synapse cannot be rolled back to a version without | ||
# device IDs being stored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a target version of SCHEMA_COMPAT_VERSION
you need? If you can find out, might be worth writing it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Errr, so right now we write to both event_txn_id
and event_txn_id_device_id
. I think we want to leave this as-is so we can rollback if needed.
I guess in the future we would want to:
- Wait
- Stop writing (and reading to) to
event_txn_id
and bump theSCHEMA_VERSION
- Wait
- Bump the
SCHEMA_COMPAT_VERSION
, rip out the code
Does that sound like the right series of steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about this a bit: https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$WQwonqO7etYpa0ik4aVzVsf32R_VOqIyPzBjTl3pjps?via=matrix.org&via=element.io&via=beeper.com
My tl;dr is that the above steps look reasonable, although you usually want to stop reading before writing. Unfortunately we do still use this for certain situations, e.g. appservice users don't have device IDs. So we would need to figure out what to do there?
My inclination is to merge this as-is (so as to not block Matrix 1.7 support) and file a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately we do still use this for certain situations, e.g. appservice users don't have device IDs. So we would need to figure out what to do there?
Probably good to add a comment to the diff on the caveat for why the access token fallback needs to stay until that is figured out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was planning to point to an issue. Was waiting to see if others had thoughts before clicking merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a step back and re-evaluating my other threads with @sandhose (and re-reading the comments in this PR) I don't actually think we can remove this since it gets used for guests, appservice users, and access tokens from the admin API. 😢
I've made two changes:
- Consolidate some logic so we have fewer places we attempt to line up an event from a transaction ID.
- Added a lot more comments.
Could folks (in particular @sandhose) let me know if these still read OK? I feel like I'm missing something, but I've tried to include my current understanding in the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is, guests and appservice users don't have access token IDs anyway, so it basically won't do anything if we keep it. The only thing left is admin-minted, device-less tokens.
Keep in mind, there are two layers of things that ensure idempotency:
- the in-memory HTTP transaction cache, which properly covers guests, appservices and device-less tokens
- the db-persisted (for 24h) transaction_id -> event_id mapping, which only worked for access tokens with IDs in it, so no guests and no appservices
Also, this is only for idempotency, the transaction ID echo in event serialisation is handled by the event internal metadata JSON blob, which is separate from those tables.
If I'm not mistaken, I think this would only affect
- device-less real users
- when they replay a request, either load-balanced to another worker, or after a Synapse restart
I don't mind keeping this, but also I don't think it's worth keeping it around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this is only for idempotency, the transaction ID echo in event serialisation is handled by the event internal metadata JSON blob, which is separate from those tables.
And I suppose the same rules apply to the event internal metadata info -- guests and appservices users don't have access token IDs so we can't be storing it there anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking through this a bit more I think we need two separate approaches:
Database transaction mapping
This describes what to do with "the db-persisted (for 24h) transaction_id -> event_id mapping".
This only survives for 24 hours, but that's not really relevant to the conversation of rolling back (or forwards). As background, there's two tables (event_txn_id
and event_txn_id_device_id
) for access token IDs and device IDs, respectively. To gracefully handle up/downgrades I propose:
- (This PR = N) Always write to both and read data from either (first
event_txn_id_device_id
, fallback toevent_txn_id
). Bump the schema version to V. - In a future PR (N+2 versions), stop reading and writing to
event_txn_id
, bump the schema version to V+1.
At this point we have a database schema still compatible with today, but we're not using data event_txn_id
, so rolling back to e.g. N is fine, but not before N. To enforce this, bump the minimum schema version to V.
- In a future PR (N+4 versions), add a schema delta which drops the
event_txn_id
table.
At this point we no longer have a database schema compatible with N, but we can still rollback OK to N+2. So bump the minimum schema version to V+1.
Internal metadata
The device ID & access token ID are unconditionally written into the internal metadata of events. It then follows similar logic where the transaction ID is returned to a user when serializing the event if:
- The device ID exists in the internal metadata and matches that of the sender. 1
- The requester & sender user IDs match and:
- The token ID exists in the internal metadata and matches that of the sender.
- The requester is a guest.
- The requester is an appservice.
A similar thing should be followed:
- (This PR = N) Start using the device ID in-lieu of the token ID, if it is set. (Note that both are currently saved in the internal metadata.)
- In a future PR (N+2), stop writing the token ID into the internal metadata.
This implies that we would leave the logic for the fallback "forever", but I think we're already at am minimum schema version where we've always written both values, so we only need to be concerned with old data.
An option here would be to do a data migration for any event who has a token ID which still exists in the database, add the device ID. Then remove the fallback code. This seems a bit onerous though.
Footnotes
-
Device IDs matching should also be gated by the user ID, I think. There's more risk of those overlapping that the token IDs since they're client controllable. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#16042 was filed to handle follow-up.
device_id
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, we just need to decide whether we should keep the (txn_id, token_id) -> event_id mapping in DB or not
# TODO Remove this once Synapse cannot be rolled back to a version without | ||
# device IDs being stored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is, guests and appservice users don't have access token IDs anyway, so it basically won't do anything if we keep it. The only thing left is admin-minted, device-less tokens.
Keep in mind, there are two layers of things that ensure idempotency:
- the in-memory HTTP transaction cache, which properly covers guests, appservices and device-less tokens
- the db-persisted (for 24h) transaction_id -> event_id mapping, which only worked for access tokens with IDs in it, so no guests and no appservices
Also, this is only for idempotency, the transaction ID echo in event serialisation is handled by the event internal metadata JSON blob, which is separate from those tables.
If I'm not mistaken, I think this would only affect
- device-less real users
- when they replay a request, either load-balanced to another worker, or after a Synapse restart
I don't mind keeping this, but also I don't think it's worth keeping it around.
I've put this back into the review queue -- I split some changes out and made a few additional changes to this PR. See above for an explanation of future work / how the database changes will work. |
…(scope to `device_id`) (matrix-org#15629)" This reverts commit d98a43d.
No significant changes since 1.90.0rc1. - Scope transaction IDs to devices (implement [MSC3970](matrix-org/matrix-spec-proposals#3970)). ([\matrix-org#15629](matrix-org#15629)) - Remove old rows from the `cache_invalidation_stream_by_instance` table automatically (this table is unused in SQLite). ([\matrix-org#15868](matrix-org#15868)) - Fix a long-standing bug where purging history and paginating simultaneously could lead to database corruption when using workers. ([\matrix-org#15791](matrix-org#15791)) - Fix a long-standing bug where profile endpoint returned a 404 when the user's display name was empty. ([\matrix-org#16012](matrix-org#16012)) - Fix a long-standing bug where the `synapse_port_db` failed to configure sequences for application services and partial stated rooms. ([\matrix-org#16043](matrix-org#16043)) - Fix long-standing bug with deletion in dehydrated devices v2. ([\matrix-org#16046](matrix-org#16046)) - Add `org.opencontainers.image.version` labels to Docker containers [published by Matrix.org](https://hub.docker.com/r/matrixdotorg/synapse). Contributed by Mo Balaa. ([\matrix-org#15972](matrix-org#15972), [\matrix-org#16009](matrix-org#16009)) - Add a internal documentation page describing the ["streams" used within Synapse](https://matrix-org.github.io/synapse/v1.90/development/synapse_architecture/streams.html). ([\matrix-org#16015](matrix-org#16015)) - Clarify comment on the keys/upload over replication enpoint. ([\matrix-org#16016](matrix-org#16016)) - Do not expose Admin API in caddy reverse proxy example. Contributed by @NilsIrl. ([\matrix-org#16027](matrix-org#16027)) - Remove support for legacy application service paths. ([\matrix-org#15964](matrix-org#15964)) - Move support for application service query parameter authorization behind a configuration option. ([\matrix-org#16017](matrix-org#16017)) - Update SQL queries to inline boolean parameters as supported in SQLite 3.27. ([\matrix-org#15525](matrix-org#15525)) - Allow for the configuration of the backoff algorithm for federation destinations. ([\matrix-org#15754](matrix-org#15754)) - Allow modules to check whether the current worker is configured to run background tasks. ([\matrix-org#15991](matrix-org#15991)) - Update support for [MSC3958](matrix-org/matrix-spec-proposals#3958) to match the latest revision of the MSC. ([\matrix-org#15992](matrix-org#15992)) - Allow modules to schedule delayed background calls. ([\matrix-org#15993](matrix-org#15993)) - Properly overwrite the `redacts` content-property for forwards-compatibility with room versions 1 through 10. ([\matrix-org#16013](matrix-org#16013)) - Fix building the nix development environment on MacOS systems. ([\matrix-org#16019](matrix-org#16019)) - Remove leading and trailing spaces when setting a display name. ([\matrix-org#16031](matrix-org#16031)) - Combine duplicated code. ([\matrix-org#16023](matrix-org#16023)) - Collect additional metrics from `ResponseCache` for eviction. ([\matrix-org#16028](matrix-org#16028)) - Fix endpoint improperly declaring support for MSC3814. ([\matrix-org#16068](matrix-org#16068)) - Drop backwards compat hack for event serialization. ([\matrix-org#16069](matrix-org#16069)) * Update PyYAML to 6.0.1. ([\matrix-org#16011](matrix-org#16011)) * Bump cryptography from 41.0.2 to 41.0.3. ([\matrix-org#16048](matrix-org#16048)) * Bump furo from 2023.5.20 to 2023.7.26. ([\matrix-org#16077](matrix-org#16077)) * Bump immutabledict from 2.2.4 to 3.0.0. ([\matrix-org#16034](matrix-org#16034)) * Update certifi to 2023.7.22 and pygments to 2.15.1. ([\matrix-org#16044](matrix-org#16044)) * Bump jsonschema from 4.18.3 to 4.19.0. ([\matrix-org#16081](matrix-org#16081)) * Bump phonenumbers from 8.13.14 to 8.13.18. ([\matrix-org#16076](matrix-org#16076)) * Bump regex from 1.9.1 to 1.9.3. ([\matrix-org#16073](matrix-org#16073)) * Bump serde from 1.0.171 to 1.0.175. ([\matrix-org#15982](matrix-org#15982)) * Bump serde from 1.0.175 to 1.0.179. ([\matrix-org#16033](matrix-org#16033)) * Bump serde from 1.0.179 to 1.0.183. ([\matrix-org#16074](matrix-org#16074)) * Bump serde_json from 1.0.103 to 1.0.104. ([\matrix-org#16032](matrix-org#16032)) * Bump service-identity from 21.1.0 to 23.1.0. ([\matrix-org#16038](matrix-org#16038)) * Bump types-commonmark from 0.9.2.3 to 0.9.2.4. ([\matrix-org#16037](matrix-org#16037)) * Bump types-jsonschema from 4.17.0.8 to 4.17.0.10. ([\matrix-org#16036](matrix-org#16036)) * Bump types-netaddr from 0.8.0.8 to 0.8.0.9. ([\matrix-org#16035](matrix-org#16035)) * Bump types-opentracing from 2.4.10.5 to 2.4.10.6. ([\matrix-org#16078](matrix-org#16078)) * Bump types-setuptools from 68.0.0.0 to 68.0.0.3. ([\matrix-org#16079](matrix-org#16079))
No significant changes since 1.90.0rc1. - Scope transaction IDs to devices (implement [MSC3970](matrix-org/matrix-spec-proposals#3970)). ([\matrix-org#15629](matrix-org#15629)) - Remove old rows from the `cache_invalidation_stream_by_instance` table automatically (this table is unused in SQLite). ([\matrix-org#15868](matrix-org#15868)) - Fix a long-standing bug where purging history and paginating simultaneously could lead to database corruption when using workers. ([\matrix-org#15791](matrix-org#15791)) - Fix a long-standing bug where profile endpoint returned a 404 when the user's display name was empty. ([\matrix-org#16012](matrix-org#16012)) - Fix a long-standing bug where the `synapse_port_db` failed to configure sequences for application services and partial stated rooms. ([\matrix-org#16043](matrix-org#16043)) - Fix long-standing bug with deletion in dehydrated devices v2. ([\matrix-org#16046](matrix-org#16046)) - Add `org.opencontainers.image.version` labels to Docker containers [published by Matrix.org](https://hub.docker.com/r/matrixdotorg/synapse). Contributed by Mo Balaa. ([\matrix-org#15972](matrix-org#15972), [\matrix-org#16009](matrix-org#16009)) - Add a internal documentation page describing the ["streams" used within Synapse](https://matrix-org.github.io/synapse/v1.90/development/synapse_architecture/streams.html). ([\matrix-org#16015](matrix-org#16015)) - Clarify comment on the keys/upload over replication enpoint. ([\matrix-org#16016](matrix-org#16016)) - Do not expose Admin API in caddy reverse proxy example. Contributed by @NilsIrl. ([\matrix-org#16027](matrix-org#16027)) - Remove support for legacy application service paths. ([\matrix-org#15964](matrix-org#15964)) - Move support for application service query parameter authorization behind a configuration option. ([\matrix-org#16017](matrix-org#16017)) - Update SQL queries to inline boolean parameters as supported in SQLite 3.27. ([\matrix-org#15525](matrix-org#15525)) - Allow for the configuration of the backoff algorithm for federation destinations. ([\matrix-org#15754](matrix-org#15754)) - Allow modules to check whether the current worker is configured to run background tasks. ([\matrix-org#15991](matrix-org#15991)) - Update support for [MSC3958](matrix-org/matrix-spec-proposals#3958) to match the latest revision of the MSC. ([\matrix-org#15992](matrix-org#15992)) - Allow modules to schedule delayed background calls. ([\matrix-org#15993](matrix-org#15993)) - Properly overwrite the `redacts` content-property for forwards-compatibility with room versions 1 through 10. ([\matrix-org#16013](matrix-org#16013)) - Fix building the nix development environment on MacOS systems. ([\matrix-org#16019](matrix-org#16019)) - Remove leading and trailing spaces when setting a display name. ([\matrix-org#16031](matrix-org#16031)) - Combine duplicated code. ([\matrix-org#16023](matrix-org#16023)) - Collect additional metrics from `ResponseCache` for eviction. ([\matrix-org#16028](matrix-org#16028)) - Fix endpoint improperly declaring support for MSC3814. ([\matrix-org#16068](matrix-org#16068)) - Drop backwards compat hack for event serialization. ([\matrix-org#16069](matrix-org#16069)) * Update PyYAML to 6.0.1. ([\matrix-org#16011](matrix-org#16011)) * Bump cryptography from 41.0.2 to 41.0.3. ([\matrix-org#16048](matrix-org#16048)) * Bump furo from 2023.5.20 to 2023.7.26. ([\matrix-org#16077](matrix-org#16077)) * Bump immutabledict from 2.2.4 to 3.0.0. ([\matrix-org#16034](matrix-org#16034)) * Update certifi to 2023.7.22 and pygments to 2.15.1. ([\matrix-org#16044](matrix-org#16044)) * Bump jsonschema from 4.18.3 to 4.19.0. ([\matrix-org#16081](matrix-org#16081)) * Bump phonenumbers from 8.13.14 to 8.13.18. ([\matrix-org#16076](matrix-org#16076)) * Bump regex from 1.9.1 to 1.9.3. ([\matrix-org#16073](matrix-org#16073)) * Bump serde from 1.0.171 to 1.0.175. ([\matrix-org#15982](matrix-org#15982)) * Bump serde from 1.0.175 to 1.0.179. ([\matrix-org#16033](matrix-org#16033)) * Bump serde from 1.0.179 to 1.0.183. ([\matrix-org#16074](matrix-org#16074)) * Bump serde_json from 1.0.103 to 1.0.104. ([\matrix-org#16032](matrix-org#16032)) * Bump service-identity from 21.1.0 to 23.1.0. ([\matrix-org#16038](matrix-org#16038)) * Bump types-commonmark from 0.9.2.3 to 0.9.2.4. ([\matrix-org#16037](matrix-org#16037)) * Bump types-jsonschema from 4.17.0.8 to 4.17.0.10. ([\matrix-org#16036](matrix-org#16036)) * Bump types-netaddr from 0.8.0.8 to 0.8.0.9. ([\matrix-org#16035](matrix-org#16035)) * Bump types-opentracing from 2.4.10.5 to 2.4.10.6. ([\matrix-org#16078](matrix-org#16078)) * Bump types-setuptools from 68.0.0.0 to 68.0.0.3. ([\matrix-org#16079](matrix-org#16079)) # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEE8SRSDO7gYkSP4chELS76LzL74EcFAmTbUOEACgkQLS76LzL7 # 4Efskw/+J4X30PoqSvBWbilr8kTzwNSDXrkefYXR2sLburgowCyuAtKtCdbvnZUX # 3KRwii5/GDsduXiNY836oRxO/KWE43b1ce9C9qJM7V6NmgkJBgHRvnh69wdlmBqt # 6b6TQHoEByYS7yK90+QsRm1Bqrw7eoVO9oxcZ+4lb7Mjswf491Pga8kFJqdvjtTX # UXo4vAqYyP6Yn7sUrQmXy0N8gZ5ZFHhZEvZZ8+iEsNjPO468cSVGq8/iPB1EwBm2 # nbfZWMDnD2p7plJezXOPEBxnVR3RrWbCbK08SiiNMcQynCvBgAUfkd3GnsO726jb # 19i8p6tjuj2r41UgqYCTG2i2ij6uJquA/qq3rIiVNQVKG9aPHQ8hJfu9XOdEvaJe # Je9H6QFrU/hR640tFvb5Hdc/4ThabvtC5xgl4ZGT6y6I0s5LNwk8fJiz3sDFt0i1 # XKsqGVemBGopZnwjQIPFJaHjPT7of33PXLE/hf1vX+oXU/6MNbFYkDLY9nnnQeOx # 0GEbiYaxrj8SfxNmEykMLNCfxwJ719cSR1q8vPYn6r6TOS1pJMV0SgciXoaQ/VW6 # WlRpZolvXYSye34JW8Rg4ojAz0oYfJ2IiUpwY7eSEq4DtuosTjEECKrgB8DLqKlM # +qDd4/yeqVN5/sui5oGsR71aTMy/jdnzqmdmuFvsSwz9/7PfMEU= # =FWp5 # -----END PGP SIGNATURE----- # gpg: Signature made Tue Aug 15 11:18:09 2023 BST # gpg: using RSA key F124520CEEE062448FE1C8442D2EFA2F32FBE047 # gpg: Can't check signature: No public key # Conflicts: # .github/workflows/docker.yml # CHANGES.md # Cargo.lock # debian/changelog # docs/upgrade.md # flake.lock # poetry.lock # pyproject.toml # rust/src/push/base_rules.rs # synapse/events/utils.py # synapse/handlers/pagination.py # synapse/http/site.py # synapse/rest/client/devices.py # synapse/storage/databases/main/roommember.py # synapse/storage/schema/__init__.py # tests/rest/client/test_devices.py # tests/rest/client/test_redactions.py
MSC3970 passed FCP, we can rip out the unstable config for it now.