-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
POST join from matrix bridge sometimes takes very long (> 30 sec.) #13461
Comments
I also seem to get a lot of these messages and the synapse times out (goes down):
I am using the ansible docker image for matrix. UPDATE: UPDATE: |
Thank you for the detailed report. Do you see a pattern with these slow joins occurring for either local room joins (where a user from your homeserver is already in the room) or remote room joins (where your homeserver is joining this room for the first time)? We're currently actively working on making both cases faster - see these milestones: https://github.com/matrix-org/synapse/milestone/8, https://github.com/matrix-org/synapse/milestone/10 I'll pass your report on to the people working on the issue directly in case it's helpful, but my suggestion would be to try again once some of the above work lands in a released version of Synapse. You can follow the milestones and our changelog for updates. |
yes, all requests to join the same room will be de-duplicated: only one such request will actually do any work. One of the requests (the first one to arrive) will actually have done the work: it should show significantly higher DB/CPU time (and lots more log lines). If you can find that request, and share the logs for it, we can begin to find out why it was so slow. |
@richvdh: the three join requests in the example of my original post above are from the bridge bot for three different rooms (!), all taking a long time and finishing at the same time. So your explanation (de-duplication, because of join request for the same room) does not apply here. |
These slow joins appear for local rooms on our homeserver. We manage the rooms via a matrix-bridge-based admin tool. Rooms are created by a central admin bot, and in subsequent runs of the admin tool, the central admin bot updates the rooms (adds/removed members, adjusts power levels etc.). The slow joins appear (sometimes) when this bridge function is called: It makes sure, that the central admin bot is currently a member of the room, before it does other things with the room. The bot should always be already in the room, as the bot created the room and added himself in the first place. But |
And sorry for the long "timeout" in this thread, I had been on vacation 😁 |
Any ideas on this? This is still happening for uns on a regular basis (several times per day). |
I know this isn't necessarily convenient, but it'd probably help to use some more tools to see what's going on.
|
When "/join"ing a room, there are two "lock"s. The first ensures that an application service can only do one /join at a time. The second ensures that a room can only process one /join from any user at a time. We're aware of slow local room joins for large rooms. That is, if one or more local users were to join a large local room, those joins would be queued one after the other. In the meantime, if the bridge were to /join that room, it would have to wait behind those users. Any additional /joins the bridge does would have to wait behind the /join for that big slow room too.
Going by these logs, it's the /join to |
Probably due to #13965 |
@squahtx I checked todays homeserver.log and found several occurrences of long running joins (each around 80 seconds), that led to gateway timeouts on the side of your bridge. I checked two of those and had a look at the rooms, in both cases the room hat only one member, the bridge-user Here the complete log regarding on of those rooms, with the long running join (blocking other operation) near the beginning of the log:
What puzzles me: how can "Failed to find any events in room !bGitvSgjvnaVLPhMVy" happen, right in the middle of other operations? |
Just to confirm, were the rooms you checked the ones which took the very longest out of the entire group of joins?
It's hard to say more without tracing set up. If possible, I would avoid calling
There can be multiple requests being processed at the same time, so their log messages are interleaved. Searching for "GET-4933638", for example, will yield all log lines from that request. |
@squahtx Yes, the rooms I checked are the ones with the longest running joins in the group of joins. I am pretty sure that the room size is not the problem here. The
I will try this, thank you! |
Was this issue resolved or does it still persist? |
I am still struggling with the Matrix-AppService-Bridge SDK. I changed my calls to /join for the To give you an impression, our bridge handles ~15.000 room updates (members in/out, avatar check, ...) per day, during this around 20 locks occur during |
For us the issue is no longer pressing, as we now could prevent the calls to /join for the |
Description
Hi,
Our own matrix bridge runs sometimes into long running join requests (> 30 sec.), which triggers a timeout on the bridge side.
This is what the bridge executes:
return bridge.getIntent()._ensureJoined(roomAlias, false, undefined, true);
Typically this takes 0.03 sec.. But sometimes it takes a loooong time!
Here is the result from homeserver.log (filtered only for POST-9744812):
As you can see, connection from the bridge is closed after 30 seconds (timeout), the POST request in total takes 70 seconds.
What we notice, is that these long running join-requests sometimes occur in parallel with other similar join-request.
Here for example are three join requests from the bridge bot for three different rooms, all taking a long time and finishing at the same time! The above request POST-9744812 is one of them:
Looks like there is something blocking in synapse (or the database, PostgreSQL in our case), then when the lock is released, all requests finish at the same time. Runtimes of the three requests above: 40, 70 and 100 seconds.
Here the POST join requests in the bigger context:
We have activated prometheus metrics on our synapse server, but could not find anything abnormal when these delays occured.
Steps to reproduce
bridge.getIntent()._ensureJoined(roomAlias, false, undefined, true);
Homeserver
https://chat.church.tools/
Synapse Version
{"server_version":"1.63.1","python_version":"3.8.10"}
Installation Method
Debian packages from packages.matrix.org
Platform
Ubuntu 18.04
PostgreSQL Version
postgres (PostgreSQL) 10.20 (Ubuntu 10.20-0ubuntu0.18.04.1)
Relevant log output
Anything else that would be useful to know?
No response
The text was updated successfully, but these errors were encountered: