Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Bridges take very long time to answer (~20min-2h) #13423

Closed
Tropaion opened this issue Jul 29, 2022 · 15 comments
Closed

Bridges take very long time to answer (~20min-2h) #13423

Tropaion opened this issue Jul 29, 2022 · 15 comments
Labels
A-Application-Service Related to AS support X-Needs-Info This issue is blocked awaiting information from the reporter

Comments

@Tropaion
Copy link

Description

Hello!
I'm running an homeserver on Debian11 Proxmox LXC with 2 CPUs and 4GB RAM (never peaked out until now).
On the homeserver I have Discord, Signal, Telegram and WhatsApp Bridge running.
When no one interacted with bridge bot (no matter which one) for a while and initiates a conversation (invite or message) the bot takes a really long time for the first answer to appear or for it to join the chat.
But after the first reply of the bot and the user answers thereafter, it works without problems.
When not contacting the bot for a while, the problem repeats. Weirdly, the bridging itself works the whole time.
Since this issue is with every bot I thougt it should be a problem with synapse.

I looked into the logs, but didn't find anything, in my opinion relevant. It looks like the messages take a long to be sent to the bridge bot, rebooting the bridge doesn't help, but after rebooting synapse, the bots are answering instantly.

log

Steps to reproduce

  1. Install Synapse via deb package manager
  2. Install bridges according to official instructions
  3. Enable double puppeting according to official instructions
  4. Write message to bot and wait

Homeserver

https://chat.flowerhouse.at

Synapse Version

{"server_version":"1.63.1","python_version":"3.9.2"}

Installation Method

Debian packages from packages.matrix.org

Platform

Debian11 ProxMox LXC

Relevant log output

Didn't find anything relevant in the logs

Anything else that would be useful to know?

No response

@reivilibre
Copy link
Contributor

Hi,

I think it would be good to have some of Synapse's logs, notably ones that show:

  • when the requests from clients to send a message are being handled
  • when requests are being sent (or attempts are made to send requests) to the application service

If you could include those and point out which of the messages are the slow-to-be-processed ones and which are OK, that would help in seeing whether this is an issue in Synapse (and if so, what it is) or otherwise figuring out where the problem is coming from. Thanks.

@reivilibre reivilibre added the X-Needs-Info This issue is blocked awaiting information from the reporter label Aug 3, 2022
@hubermat

This comment was marked as off-topic.

@Tropaion
Copy link
Author

Tropaion commented Aug 4, 2022

Hi,

I think it would be good to have some of Synapse's logs, notably ones that show:

  • when the requests from clients to send a message are being handled
  • when requests are being sent (or attempts are made to send requests) to the application service

If you could include those and point out which of the messages are the slow-to-be-processed ones and which are OK, that would help in seeing whether this is an issue in Synapse (and if so, what it is) or otherwise figuring out where the problem is coming from. Thanks.

Hi, I tried to get more information. I changed log level to debug and wrote to the SignalBot at 14:47, this time the bot took only three minutes and answered at 14:50.

I'm not sure what is relevant so here is the whole log: https://filebox.fhooecloud.at/index.php/s/rtFZnWCyFCd2iBe

I also looked in the old logs, I don't know if it is relevant but they are extremely huge, even though they were set to warning level.
The logs are nearly completely filled only with this message, many times a second (the last number differs):

synapse.logging.context - 81 - WARNING - federation_transaction_transmission_loop-227606 - Re-starting finished log context federation_transaction_transmission_loop-227606

@reivilibre
Copy link
Contributor

reivilibre commented Aug 4, 2022

@Tropaion

I think I may see your problem, take a look at this:

2022-08-04 14:47:57,925 - synapse.http.client - 401 - DEBUG - as-sender-signal-1 - Sending request PUT http://localhost:29328/transactions/10722?access_token=<redacted>
2022-08-04 14:47:57,926 - synapse.http.proxyagent - 223 - DEBUG - as-sender-signal-1 - Requesting b'http://localhost:29328/transactions/10722?access_token=leaked' via <HostnameEndpoint localhost:29328>

I don't really know what to suggest; possibly your use of LXC is causing some DNS trouble (I had some mysterious DNS trouble when using LXC a long long time ago).

I'd be tempted to check:

  • What does /etc/hosts say? I would expect that this file should normally have an entry for localhost in it!
  • What about /etc/resolv.conf?
  • If you see any lines like nameserver x.y.z.w, try dig @x.y.z.w localhost and see what it says.

If you get very very desperate, you can probably change the addresses of your bridges (in the registration YAML files) to say 127.0.0.1 rather than localhost, but it's not a pretty solution :/.

@reivilibre
Copy link
Contributor

reivilibre commented Aug 4, 2022

@hubermat I think it will be best if you open a new issue, please. Even though it may turn out to be the same in the end, there's a very good chance that it won't be the same, so there's a good chance we'll make this issue's history quite confusing to follow.

@Tropaion
Copy link
Author

Tropaion commented Aug 4, 2022

@reivilibre Thanks for your help.
This is /etc/hosts:

127.0.0.1       localhost
::1             localhost ip6-localhost ip6-loopback
ff02::1         ip6-allnodes
ff02::2         ip6-allrouters
# --- BEGIN PVE ---
127.0.1.1 ChatMatrix.local ChatMatrix
# --- END PVE ---

And this is /etc/resolv.conf

nameserver 8.8.8.8
nameserver 8.8.4.4

BTW I don't know if it matters, but I have configured an localhosted AdGuard DNS on my router.
Digging 1:

root@ChatMatrix:/var/log/matrix-synapse# dig @8.8.8.8 localhost

; <<>> DiG 9.16.27-Debian <<>> @8.8.8.8 localhost
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 49504
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;localhost.                     IN      A

;; AUTHORITY SECTION:
.                       85986   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2022080400 1800 900 604800 86400

;; Query time: 12 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Thu Aug 04 17:02:22 CEST 2022
;; MSG SIZE  rcvd: 113

Digging 2:

root@ChatMatrix:/var/log/matrix-synapse# dig @8.8.4.4 localhost

; <<>> DiG 9.16.27-Debian <<>> @8.8.4.4 localhost
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 32311
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;localhost.                     IN      A

;; AUTHORITY SECTION:
.                       85841   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2022080400 1800 900 604800 86400

;; Query time: 28 msec
;; SERVER: 8.8.4.4#53(8.8.4.4)
;; WHEN: Thu Aug 04 17:03:18 CEST 2022
;; MSG SIZE  rcvd: 113

@reivilibre
Copy link
Contributor

@Tropaion

I'm definitely not sure if this is the problem or not, but you have two entries for localhost:

127.0.0.1       localhost
::1             localhost ip6-localhost ip6-loopback

The ::1 entry is for IPv6. IPv6 triggers an immediate suspicion alert for me here.
My local machine has this in its hosts file:

127.0.0.1       localhost

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback

In other words, it doesn't call ::1 localhost.

Try removing localhost from the ::1 line and then trying again. I don't think you need to restart Synapse, but you can try in case something is stuck because it can't contact your application service.

Let me know how you get on.

@Tropaion
Copy link
Author

Tropaion commented Aug 4, 2022

@reivilibre For now it looks like most of the bots work (the bots are all online for now), but heisenbridge not anymore. It joins the room and writes the first info message but after that the bot is dead and never online.
This is my logs, maybe there is something? https://filebox.fhooecloud.at/index.php/s/jKAkMZECQB8j5fM
I wrote messages to it at the time around 1:28

@Tropaion
Copy link
Author

Tropaion commented Aug 8, 2022

@reivilibre Do you have an idea why?

@H-Shay
Copy link
Contributor

H-Shay commented Aug 9, 2022

A few questions: when you say

bot is dead and never online

are you determining this via presence or are there other indications that the bot is not working?

Also just to make sure this is clear, you removed localhost from the ::1 line and it solved the issue with the slow bot responses, but now heisenbridge (which was working before) is no longer working?

Finally, you mention writing messages to heisenbridge around 1:28, but the last timestamp on the logs provided is 1:15:02, could you provide logs that cover sending messages to heisenbridge?

@Tropaion
Copy link
Author

Tropaion commented Aug 9, 2022

are you determining this via presence or are there other indications that the bot is not working?

Bot is not answering anymore, not even after waiting for a while, like before. It isn't online either, the online indicator of the other bots is now constantly green.

Also just to make sure this is clear, you removed localhost from the ::1 line and it solved the issue with the slow bot responses, but now heisenbridge (which was working before) is no longer working?

That's correct.

Finally, you mention writing messages to heisenbridge around 1:28, but the last timestamp on the logs provided is 1:15:02, could you provide logs that cover sending messages to heisenbridge?

I will try to get another log tomorrow, it's too late tonight.

@Tropaion
Copy link
Author

Tropaion commented Aug 10, 2022

@H-Shay Here are the new logs: homeserver.log
23:18: joined room with bot
23:20 wrote VERSION command in chat
23:23 sent NETWORKS command

@Tropaion
Copy link
Author

@H-Shay Do these logs help? Can you find something?

@MadLittleMods MadLittleMods added the A-Application-Service Related to AS support label Sep 20, 2022
MadLittleMods added a commit that referenced this issue Sep 20, 2022
Saw that an example leak in
#13423 (comment)

```
2022-08-04 14:47:57,925 - synapse.http.client - 401 - DEBUG - as-sender-signal-1 - Sending request PUT http://localhost:29328/transactions/10722?access_token=<redacted>
2022-08-04 14:47:57,926 - synapse.http.proxyagent - 223 - DEBUG - as-sender-signal-1 - Requesting b'http://localhost:29328/transactions/10722?access_token=leaked' via <HostnameEndpoint localhost:29328>
```
@H-Shay
Copy link
Contributor

H-Shay commented Sep 20, 2022

Is this still an issue or has this been resolved?

@Tropaion
Copy link
Author

It got resolved, changing hosts did work, but after a completely new installation the bug was gone too.

MadLittleMods added a commit that referenced this issue Sep 23, 2022
This can happen specifically with an application service `/transactions/10722?access_token=leaked` request

Fix #13010

---

Saw an example leak in #13423 (comment)

```
2022-08-04 14:47:57,925 - synapse.http.client - 401 - DEBUG - as-sender-signal-1 - Sending request PUT http://localhost:29328/transactions/10722?access_token=<redacted>
2022-08-04 14:47:57,926 - synapse.http.proxyagent - 223 - DEBUG - as-sender-signal-1 - Requesting b'http://localhost:29328/transactions/10722?access_token=leaked' via <HostnameEndpoint localhost:29328>
```
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Application-Service Related to AS support X-Needs-Info This issue is blocked awaiting information from the reporter
Projects
None yet
Development

No branches or pull requests

5 participants