Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Rooms created without any membership (only m.room.create event) #15005

Open
MadLittleMods opened this issue Feb 6, 2023 · 9 comments
Open

Rooms created without any membership (only m.room.create event) #15005

MadLittleMods opened this issue Feb 6, 2023 · 9 comments
Labels
A-Create-Room A-DAG Directed acyclic graph of events (events connected by prev_events) O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@MadLittleMods
Copy link
Contributor

MadLittleMods commented Feb 6, 2023

Description

Trying to tie up any loose-ends with the Gitter import as part of the Gitter migration process and noticing that we have some Matrix rooms recorded that seem to broken on the Synapse side. Some of these rooms only have a m.room.create event in it while others have no events which means we can't interact with it at all since permission denied errors everywhere (no membership or power-level events).

The only way we recorded this room on the Gitter side is if we got a 200-response from Synapse originally when it was created. If Synapse is returning with a error-free response, there should be much better guarantees on the room being created successfully (atomic transaction). Of course it's possible this could have been corrupted after the fact though. What most likely happened was the rooms were attempted to be created but timed out. Then we introduced code to reconcile the case where a Matrix room alias exists but we don't have the bridged Matrix room entry recorded, see #15005 (comment)

They all seem to be created on January 10th, 2023 which is probably in the middle of importing messages and the import scripts.

Here is the list of affected rooms that I know of:

  • !gHbYoGxiEuXmahoYeh:gitter.im (only m.room.create)
  • !OpjflmXuauidodzmJH:gitter.im (only m.room.create)
  • !QCVddUkjrHLjEqeXQd:gitter.im (no events)
  • !wJBkwhqwLssHVjKErI:gitter.im (only m.room.create)
  • !kkjacXiigblOffltrw:gitter.im (only m.room.create)
  • !etJKlYVRXEtkkMdEOJ:gitter.im (no events)
  • !VeVZEnGasXJXpAYECh:gitter.im (only m.room.create)
  • !iJIbqcgSzsAFnwjVBI:gitter.im (no events)
  • !koFtykttKGRpjbzflq:gitter.im (no events)
  • !mxfIojpZcyQDBbmmoY:gitter.im (only m.room.create)
  • !ezxePTXYHMSIkpGOiy:gitter.im (only m.room.create)
  • !QlEqqNHuTSIHswllOX:gitter.im (only m.room.create)
  • !XInKHVcDAmmyGBEUwr:gitter.im (no events)
  • !KIKPFcoDnhBeZhLNUr:gitter.im (only m.room.create)
  • !BcBAOZpxjTAvMDfSHf:gitter.im (only m.room.create)
  • !DuyTBdNdlhfXoBFNqr:gitter.im (only m.room.create)
  • !qgfkmcpByUudrFHTMX:gitter.im (no events)
  • !tFwVYIoUNZQhRhjFNO:gitter.im (no events)
  • !qoZZIEjvblGKIkqEdj:gitter.im (no events)

In terms of moving Gitter forward, I think we can safely reconcile this by removing the bridged room entry in the Gitter database and the Matrix room alias and have them re-created (tracked by https://gitlab.com/gitterHQ/webapp/-/issues/2858#note_1267698515). But I'm unable to remove the Matrix room alias from the old room (M_FORBIDDEN) 😬

And it would still be good to fix the Synapse bugs that allow this to happen.

Steps to reproduce

  1. Not sure how this happened but it's all Gitter appservice interactions against Synapse

Homeserver

gitter.im

Synapse Version

1.74.0

Installation Method

I don't know

Database

PostgreSQL

Workers

Multiple workers

Platform

gitter.im (gitter.ems.host) runs on EMS but I don't know the details

Python version: 3.11.1

Configuration

No response

Relevant log output

The only relevant logs for this room from the time of creation is this message:

https://modular-euwest2-kibana.proxy.matrix.org/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-8w,to:now-3w))&_a=(columns:!(_source),index:'67745970-aa2d-11ea-96a7-efc84863d0ea',interval:auto,query:(language:kuery,query:'kubernetes.pod.labels.hostname:%22gitter.ems.host%22%20AND%20message:%22!!gHbYoGxiEuXmahoYeh:gitter.im%22'),sort:!(!('@timestamp',desc)))

2023-01-09 17:33:53,203 - synapse.replication.http.send_events - 149 - INFO - POST-4249- Got batch of events to send, last ID of batch is: $JeDmcg6bYb0s-d6pxIbgmfPaJoChH-qWlnTZ3n8xGdU, sending into room: !gHbYoGxiEuXmahoYeh:gitter.im

Database information:

# Connect to the database: https://gitlab.matrix.org/new-vector/internal/-/wikis/EMS-postgres#connecting-to-an-ems-database
$ ems-db-debug gitter.im

select * FROM rooms WHERE room_id = '!gHbYoGxiEuXmahoYeh:gitter.im';
            room_id            | is_public |         creator          | room_version | has_auth_chain_index
-------------------------------+-----------+--------------------------+--------------+----------------------
 !gHbYoGxiEuXmahoYeh:gitter.im | t         | @gitter-badger:gitter.im | 9            | t

select * FROM room_memberships WHERE room_id = '!gHbYoGxiEuXmahoYeh:gitter.im';
 event_id | user_id | sender | room_id | membership | forgotten | display_name | avatar_url
----------+---------+--------+---------+------------+-----------+--------------+------------
(0 rows)

select * FROM events WHERE room_id = '!gHbYoGxiEuXmahoYeh:gitter.im';
 topological_ordering |                   event_id                   |     type      |            room_id            | content | unrecognized_keys | processed | outlier | depth | origin_server_ts |  received_ts  |
   sender          | contains_url |       instance_name       | stream_ordering | state_key | rejection_reason
----------------------+----------------------------------------------+---------------+-------------------------------+---------+-------------------+-----------+---------+-------+------------------+---------------+-------
-------------------+--------------+---------------------------+-----------------+-----------+------------------
                    1 | $JeDmcg6bYb0s-d6pxIbgmfPaJoChH-qWlnTZ3n8xGdU | m.room.create | !gHbYoGxiEuXmahoYeh:gitter.im |         |                   | t         | f       |     1 |    1673285632772 | 1673285633224 | @gitte
r-badger:gitter.im | f            | synapse-event-persister-3 |        94262741 |           |
(1 row)

select * FROM room_aliases WHERE room_id = '!gHbYoGxiEuXmahoYeh:gitter.im';
            room_alias            |            room_id            |         creator
----------------------------------+-------------------------------+--------------------------
 #GreenShoeGarage_Lobby:gitter.im | !gHbYoGxiEuXmahoYeh:gitter.im | @gitter-badger:gitter.im
(1 row)

Anything else that would be useful to know?

No response

@clokep
Copy link
Member

clokep commented Feb 7, 2023

This is due to us creating the room not all in one transaction, e.g. the alias is created here:

directory_handler = self.hs.get_directory_handler()
if room_alias:
await directory_handler.create_association(
requester=requester,
room_id=room_id,
room_alias=room_alias,
servers=[self.hs.hostname],
check_membership=False,
)

But if anything fails after that point it will still exist.

@clokep
Copy link
Member

clokep commented Feb 7, 2023

In order to clean-up this up safely I ended up using the purge room API with {"erase": true}.

@MadLittleMods
Copy link
Contributor Author

MadLittleMods commented Feb 7, 2023

@clokep Why does the create room endpoint return 200 ✅ before the room is totally created? (worker hand-off with no checking?)

@clokep
Copy link
Member

clokep commented Feb 7, 2023

Why does the create room endpoint return 200 ✅ before the room is totally created? (worker hand-off with no checking?)

I haven't found a code-path that could do this, are we sure we got a 200 response and not a timeout or something?

It looks like the various worker hand-offs are all awaited properly, but maybe an error is being swallowed here:

# We now persist the event (and update the cache in parallel, since we
# don't want to block on it).
event, context = events_and_context[0]
result, _ = await make_deferred_yieldable(
gather_results(
(
run_in_background(
self._persist_events,
requester=requester,
events_and_context=events_and_context,
ratelimit=ratelimit,
extra_users=extra_users,
),
run_in_background(
self.cache_joined_hosts_for_events, events_and_context
).addErrback(log_failure, "cache_joined_hosts_for_event failed"),
),
consumeErrors=True,
)
).addErrback(unwrapFirstError)

Or maybe something weird is happening where we don't eject things from the pre-cache if the persisting fails?

@clokep
Copy link
Member

clokep commented Feb 7, 2023

I don't see other open issues that /createRoom isn't idempotent.

@DMRobertson
Copy link
Contributor

I don't see other open issues that /createRoom isn't idempotent.

It can take a transaction id and be PUT in Synapse. I think this is unspecced. There's another issue somewhere for this, let me find it

def on_PUT(
self, request: SynapseRequest, txn_id: str
) -> Awaitable[Tuple[int, JsonDict]]:
set_tag("txn_id", txn_id)
return self.txns.fetch_or_execute_request(request, self.on_POST, request)

@DMRobertson
Copy link
Contributor

There's another issue somewhere for this, let me find it

#14535

@DMRobertson
Copy link
Contributor

And also matrix-org/matrix-spec#222

@MadLittleMods
Copy link
Contributor Author

MadLittleMods commented Feb 7, 2023

I haven't found a code-path that could do this, are we sure we got a 200 response and not a timeout or something?

Good call!

It's possible that some recent code filled this in. We introduced some code to resolve a conflict where the Matrix alias exists but we don't have it stored as a bridged room entry yet: https://gitlab.com/gitterHQ/webapp/-/merge_requests/2357 (2023-01-21)

This looks very likely when I extract the date from the bridged room entries in the database since the dates come after that PR:

const mongoUtils = require('gitter-web-persistence-utils/lib/mongo-utils');

const bridgedRoomEntryIds = ['63e17d0e6da0373984be0760', '63ce5f4e6da0373984bd6166', '63ce57ba6da0373984bd612b', '63ce75b26da0373984bd6257', '63cca9586da0373984bd5910', '63ce771a6da0373984bd625a', '63ce59116da0373984bd6137', '63ccaf3a6da0373984bd591d', '63ce591d6da0373984bd6138', '63ce77586da0373984bd625b', '63ce697b6da0373984bd61c1', '63ce78886da0373984bd6264', '63ce79576da0373984bd6265', '63ce6acb6da0373984bd61c5', '63ce5e1f6da0373984bd6158', '63ccb4266da0373984bd5927', '63ce72496da0373984bd622b', '63ce92246da0373984bd6383', '63ccc1716da0373984bd593c'];

bridgedRoomEntryIds.map((bridgedRoomEntryId) => {
  return mongoUtils.getDateFromObjectId(bridgedRoomEntryId);
});

[
  2023-02-06T22:19:58.000Z,
  2023-01-23T10:19:58.000Z,
  2023-01-23T09:47:38.000Z,
  2023-01-23T11:55:30.000Z,
  2023-01-22T03:11:20.000Z,
  2023-01-23T12:01:30.000Z,
  2023-01-23T09:53:21.000Z,
  2023-01-22T03:36:26.000Z,
  2023-01-23T09:53:33.000Z,
  2023-01-23T12:02:32.000Z,
  2023-01-23T11:03:23.000Z,
  2023-01-23T12:07:36.000Z,
  2023-01-23T12:11:03.000Z,
  2023-01-23T11:08:59.000Z,
  2023-01-23T10:14:55.000Z,
  2023-01-22T03:57:26.000Z,
  2023-01-23T11:40:57.000Z,
  2023-01-23T13:56:52.000Z,
  2023-01-22T04:54:09.000Z
]
db.matricesbridgedroom (bridged room entries) of problem rooms
[
  {
    "_id" : ObjectId("63e17d0e6da0373984be0760"),
    "matrixRoomId" : "!gHbYoGxiEuXmahoYeh:gitter.im",
    "troupeId" : ObjectId("59a20334d73408ce4f731f30"),
    "__v" : 0
  },
  {
    "_id" : ObjectId("63ce5f4e6da0373984bd6166"),
    "matrixRoomId" : "!OpjflmXuauidodzmJH:gitter.im",
    "troupeId" : ObjectId("594d0c86d73408ce4f698f0f"),
    "__v" : 0
  },
  {
    "_id" : ObjectId("63ce57ba6da0373984bd612b"),
    "matrixRoomId" : "!QCVddUkjrHLjEqeXQd:gitter.im",
    "troupeId" : ObjectId("5a420ea4d73408ce4f84d8ef"),
    "__v" : 0
  },
  {
    "_id" : ObjectId("63ce75b26da0373984bd6257"),
    "matrixRoomId" : "!wJBkwhqwLssHVjKErI:gitter.im",
    "troupeId" : ObjectId("5a425341d73408ce4f84df73"),
    "__v" : 0
  },
  {
    "_id" : ObjectId("63cca9586da0373984bd5910"),
    "matrixRoomId" : "!kkjacXiigblOffltrw:gitter.im",
    "troupeId" : ObjectId("5a426551d73408ce4f84e18b"),
    "__v" : 0
  },
  {
    "_id" : ObjectId("63ce771a6da0373984bd625a"),
    "matrixRoomId" : "!etJKlYVRXEtkkMdEOJ:gitter.im",
    "troupeId" : ObjectId("5a68c628d73408ce4f89f2f6"),
    "__v" : 0
  },
  {
    "_id" : ObjectId("63ce59116da0373984bd6137"),
    "matrixRoomId" : "!VeVZEnGasXJXpAYECh:gitter.im",
    "troupeId" : ObjectId("5a690eecd73408ce4f89fcee"),
    "__v" : 0
  },
  {
    "_id" : ObjectId("63ccaf3a6da0373984bd591d"),
    "matrixRoomId" : "!iJIbqcgSzsAFnwjVBI:gitter.im",
    "troupeId" : ObjectId("5a6a3541d73408ce4f8a25f9"),
    "__v" : 0
  },
  {
    "_id" : ObjectId("63ce591d6da0373984bd6138"),
    "matrixRoomId" : "!koFtykttKGRpjbzflq:gitter.im",
    "troupeId" : ObjectId("5a6a5b82d73408ce4f8a2aed"),
    "__v" : 0
  },
  {
    "_id" : ObjectId("63ce77586da0373984bd625b"),
    "matrixRoomId" : "!mxfIojpZcyQDBbmmoY:gitter.im",
    "troupeId" : ObjectId("5a6e1afad73408ce4f8a886f"),
    "__v" : 0
  },
  {
    "_id" : ObjectId("63ce697b6da0373984bd61c1"),
    "matrixRoomId" : "!ezxePTXYHMSIkpGOiy:gitter.im",
    "troupeId" : ObjectId("5a7d2122d73408ce4f8c5b93"),
    "__v" : 0
  },
  {
    "_id" : ObjectId("63ce78886da0373984bd6264"),
    "matrixRoomId" : "!QlEqqNHuTSIHswllOX:gitter.im",
    "troupeId" : ObjectId("5a8ec7e7d73408ce4f8e56a2"),
    "__v" : 0
  },
  {
    "_id" : ObjectId("63ce79576da0373984bd6265"),
    "matrixRoomId" : "!XInKHVcDAmmyGBEUwr:gitter.im",
    "troupeId" : ObjectId("5aa5a2fad73408ce4f90e688"),
    "__v" : 0
  },
  {
    "_id" : ObjectId("63ce6acb6da0373984bd61c5"),
    "matrixRoomId" : "!KIKPFcoDnhBeZhLNUr:gitter.im",
    "troupeId" : ObjectId("5aa61f1dd73408ce4f90eff0"),
    "__v" : 0
  },
  {
    "_id" : ObjectId("63ce5e1f6da0373984bd6158"),
    "matrixRoomId" : "!BcBAOZpxjTAvMDfSHf:gitter.im",
    "troupeId" : ObjectId("5b0e6243d73408ce4f9b3833"),
    "__v" : 0
  },
  {
    "_id" : ObjectId("63ccb4266da0373984bd5927"),
    "matrixRoomId" : "!DuyTBdNdlhfXoBFNqr:gitter.im",
    "troupeId" : ObjectId("5b4d9450d73408ce4fa126e6"),
    "__v" : 0
  },
  {
    "_id" : ObjectId("63ce72496da0373984bd622b"),
    "matrixRoomId" : "!qgfkmcpByUudrFHTMX:gitter.im",
    "troupeId" : ObjectId("5bbf3ca6d73408ce4fab1b64"),
    "__v" : 0
  },
  {
    "_id" : ObjectId("63ce92246da0373984bd6383"),
    "matrixRoomId" : "!tFwVYIoUNZQhRhjFNO:gitter.im",
    "troupeId" : ObjectId("618418896da037398489cf1f"),
    "__v" : 0
  },
  {
    "_id" : ObjectId("63ccc1716da0373984bd593c"),
    "matrixRoomId" : "!qoZZIEjvblGKIkqEdj:gitter.im",
    "troupeId" : ObjectId("63a5beea6da0373984a14ba5"),
    "__v" : 0
  }
]

Otherwise, normally I think it would have to be a 200 since we would need to receive a 200 with the room_id returned. But the above situation looks more likely 🕵️‍♀️

@MatMaul MatMaul added A-DAG Directed acyclic graph of events (events connected by prev_events) O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Feb 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Create-Room A-DAG Directed acyclic graph of events (events connected by prev_events) O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

4 participants