-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Enabling Metrics while having a federation_sender sharded causes an error #14112
Comments
Any idea when? How far back do your logs go---can you pin it down to a Synapse version?
Is there a Synapse worker which actually crashes? Synapse doesn't appear to stop functioning,
Additionallu:
These aren't supported and aren't likely to cause the problem... but have you tried without these options? |
I'm sorry to say I don't have an exact date, I only noticed it the first time around Oct 5, 2022. I did take a 2 minute debug log sample of all the workers and master on Oct 6, but I haven't combed through them yet for privacy and since it's a lot, I wasn't sure where to put them here where they wouldn't be looked through by random public.
Poor choice of words on my part. An exception occurs, but doesn't seem to actually crash.
I'm aware they aren't supported, but were convenient and appear to work great for what they do. I will try again without them and edit this post afterwards. -Edit: The exception still occurs, but I do have additional info I missed before. More below. |
Context: Logs from before and after have 3 federation_senders defined. Looks like it's just manifesting as a metrics issue, but this time I saw a _save_and_send_ack error first. I did go back and look and this was in the original log sample I grabbed.
Afterwards, the metrics exception from the original issue(specifically, the process_event_queue_for_federation) does repeat. This error would show in 2 of the federation_sender logs. In the sample from Oct 6, it was in sender log 1 and 3 but none of these errors were in sender log 2. Instead, sender log 2 had:
Which did repeat once more. I'll dump the edited down to(what I think) is relevant below. federation_sender1 log 10-06-2022
federation_sender2 log 10-06-2022
federation_sender3 log 10-06-2022
|
I think I may have sorted this out, but it still raises some questions that may need to be addressed. The root of the problem appears to be configure_workers_and_start.py which is building an instance_map and the sharding configuration stuff, then not actually writing anything down. Adding them to my shared config manually made the exceptions disappear. This might just be my deployment, I'll check more on it. |
It was my deployment, but not completely. There is a method in configure_workers_and_start.py that is calling strip() and split(), but it's doing it out of order. If I have spaces in the string of worker types(like: "worker1, worker2, worker3" instead of "worker1,worker2,worker3") then it doesn't parse completely right when converting from string to object list to json(I think, python isn't my native language 😅). Anyways, change
to
and it's all good. Just some sanitizing. |
It's not the exact spot I thought it was. I've further tracked this down to line 448. |
Description
My homeserver uses workers extensively, and I like to see the graphs so I enabled metrics. Somewhat recently, I started noticing this log output. This crash does not seem to happen with only a single federation_sender. Synapse doesn't appear to stop functioning, just doesn't provide the actual metric data.
Steps to reproduce
Homeserver
matrix.littlevortex.net
Synapse Version
1.69.0rc2 from the develop branch
Installation Method
Other (please mention below)
Platform
Custom Docker image assembled from your Dockerfile and Dockerfile-workers. View it here. Python 3.10 and I run with TEST_ONLY_IGNORE_POETRY_LOCKFILE and SYNAPSE_USE_EXPERIMENTAL_FORKING_LAUNCHER enabled so always get latest dependencies and faster start up.
I run an Unraid server 48 core dual Xeon setup with 128Gb of RAM and my normal worker set has 30(including master) members.
Relevant log output
Anything else that would be useful to know?
When two federation_senders are used, this error does occur in both logs.
The text was updated successfully, but these errors were encountered: