-
Notifications
You must be signed in to change notification settings - Fork 363
Description
Description
Synapse 1.118.0 was working fine when PostgreSQL was with version 11, after upgrading PostgreSQL to version 13 with pg_upgradecluster -m link (tried upgrading synapse to 1.135.0 but the same problem persists).
We have two other instances working fine with PostgreSQL 13, but both those instances have synapse and postgresql on the same server.
Initially postgresql logs showed errors like
2025-08-04 08:23:22.288 PDT [5973] synapse_user@synapse ERROR: canceling statement due to statement timeout
and homeserver.log had errors like,
grep "Connection from client lost" /opt/synapse/homeserver.log
2025-08-05 00:48:19,073 - synapse.http.site - 385 - INFO - POST-394 - Connection from client lost before response was sent
So this looked to be related to how PostgreSQL 13 handles timeouts,
https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-STATEMENT-TIMEOUT
"The timeout is measured from the time a command arrives at the server until it is completed by the server. If multiple SQL statements appear in a single simple-query message, the timeout is applied to each statement separately. (PostgreSQL versions before 13 usually treated the timeout as applying to the whole query string.)"
Adding statement_timeout = 120000
resolved this issue on our staging server (which is exact replica of production server - but accessed via an /etc/hosts entry - this allows to test login and sending messages, which was working after setting statement_timeout value).
But on production server this error was still there, there are no longer statement canceled errors in postgresql logs.
This looked to be similar to matrix-org/synapse#12971 but none of the changes we tried so far fixed the issue.
Adding
sync_response_cache_duration: 15m to homeserver.yaml did not help,
Some of the things we tried, tweaking cp_min/cp_max values 6/12 and 3/6.
Added example values as mentioned in https://github.com/element-hq/synapse/blob/develop/docs/postgres.md#synapse-config then also the values below
# seconds of inactivity after which TCP should send a keepalive message to the server
keepalives_idle: 60
# the number of seconds after which a TCP keepalive message that is not
# acknowledged by the server should be retransmitted
keepalives_interval: 60
# the number of TCP keepalives that can be lost before the client's connection
# to the server is considered dead
keepalives_count: 6
change postgresql parellelism settings, also tried half of these values as well,
effective_io_concurrency = 200 # 1-1000; 0 disables prefetching
max_worker_processes = 16 # (change requires restart)
max_parallel_maintenance_workers = 4 # taken from max_parallel_workers
max_parallel_workers_per_gather = 6 # taken from max_parallel_workers
#parallel_leader_participation = on
max_parallel_workers = 16 # maximum number of max_worker_processes that
upgraded CPU from digital ocean regular to premium, nginx http timeout values were changed, as per https://medium.com/@madhok.simran8/how-to-handle-request-timeouts-on-nginx-web-server-42905df2ae6c
http{
...
proxy_read_timeout 420;
proxy_connect_timeout 420;
proxy_send_timeout 420;
send_timeout 420;
...
}
If we revert back to PostgreSQL 11 using Digital ocean Droplet/volume snapshots, everything continue working. But the upcoming security update needs PostgreSQL 13 (from synapse 1.120.0) so going back to PostgreSQL 11 is not a good option either.
Would there be another setting to increase the timeout ? Or give a more clearer error than
2025-08-05 00:48:19,073 - synapse.http.site - 385 - INFO - POST-394 - Connection from client lost before response was sent
If I understood this correctly, this refers to timeout from the matrix client - possibly the db query took more time than this timeout.
Steps to reproduce
- synapse 1.118.0 with postgresql 11 was working fine
- upgrade postgresql to 13 on the dedicated database server
- now sync is broken - database logs still has entries and postgresql seems to be processing things normally
- reverting back to PostgreSQL 11 fixes the issue
- but upcoming security fix will need postgresql 13
Homeserver
librem.one
Synapse Version
1.135.0 (as well as 1.118.0 and 1.134.0)
Installation Method
pip (from PyPI)
Database
PostgreSQL was upgraded from 11 to 13 (pg_upgradecluster -m link), a single PostgreSQL
Workers
Multiple workers
Platform
Two Digital Ocean Droplets (Virtual machine with 32 GB RAM 8 vCPUs) - one for synapse and one for postgres. Synapse server running on Debian bookworm, postgresql on Debian bullseye (recently upgraded from debian buster).
Configuration
message retention is set for 2y
use_presence: False
using ldap for authentication
Relevant log output
grep "Connection from client lost" /opt/synapse/homeserver.log
2025-08-05 00:48:19,073 - synapse.http.site - 385 - INFO - POST-394 - Connection from client lost before response was sent
This keeps repeating multiple times every minute
Anything else that would be useful to know?
No response