-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Rate Limiter: RecursionError: maximum recursion depth exceeded
#14480
Comments
There's some very weird stuff going on with the ratelimiter on this process. It seems like lots of requests stack up, and then all suddenly get released at once:
Some requests are going missing altogether:
(note that it is never responded to) Also, er: (that's actually a different worker. federation-reader-1, from which the above logs are taken, also shows an increase, though less extreme) |
re: the graph: a few servers in the wild have been reporting backlogs in matrix.org->them traffic (their server falls behind) - is it possible something happened in the wild, or synapse changed something to do with transaction handling? |
Sorry if this is off-topic, but could https://status.matrix.org/ be updated to reflect this issue? |
I hadn't realised we also tracked the status of the libera bridge there. Status page updated. |
Libera seems happier these days. I'll leave this issue open in case we see this failure mode again. |
RecursionError: maximum recursion depth exceeded
RecursionError: maximum recursion depth exceeded
This is basically #2532 but for the per-host rate limiters. The federation rate limiter is per-host and its usage looks like synapse/synapse/federation/transport/server/_base.py Lines 341 to 352 in d8cc86e
When the rate limiter context manager is __exit__ed, the next request is allowed to continue execution out of synapse/synapse/util/ratelimitutils.py Lines 365 to 375 in d8cc86e
If the requesting homeserver has also given up on the next request, the next request after that is resumed, and so on. Each level of abandoned request adds 17 stack frames*. At around 50 abandoned requests, that comes out to 850 stack frames, which is close enough to Python's default limit that * determined through experiment Experiment
logger.warning(
"client disconnected before we started processing "
"request [stack depth: %d]",
len(inspect.stack(0))
)
rc_federation:
window_size: 1000
sleep_limit: 10000
sleep_delay: 500
reject_limit: 50
concurrent: 1
scripts-dev/federation_client.py --destination "test.homeserver" "/_matrix/federation/v1/query/profile?user_id=%40test%3Atest.homeserver&field=displayname"
#!/usr/bin/sh
while true; do
for i in `seq 100`; do
curl "https://test.homeserver/_matrix/federation/v1/query/profile?user_id=%40test%3Atest.homeserver&field=displayname" -H 'Authorization: X-Matrix origin=...' &
done
pkill curl
done
|
In short, this failure mode happens when close to 50 (all) of the requests in the rate limiter queue have been abandoned. |
Is there a way to clear the abandoned requests from the stack? |
Yep, we can either
both of which resume the next request later. The first option adds another reactor tick to the response time, which I don't like. |
When there are many synchronous requests waiting on a `_PerHostRatelimiter`, each request will be started recursively just after the previous request has completed. Under the right conditions, this leads to stack exhaustion. A common way for requests to become synchronous is when the remote client disconnects early, because the homeserver is overloaded and slow to respond. Avoid stack exhaustion under these conditions by deferring subsequent requests until the next reactor tick. Fixes #14480. Signed-off-by: Sean Quah <seanq@matrix.org>
…ts (#14812) When there are many synchronous requests waiting on a `_PerHostRatelimiter`, each request will be started recursively just after the previous request has completed. Under the right conditions, this leads to stack exhaustion. A common way for requests to become synchronous is when the remote client disconnects early, because the homeserver is overloaded and slow to respond. Avoid stack exhaustion under these conditions by deferring subsequent requests until the next reactor tick. Fixes #14480. Signed-off-by: Sean Quah <seanq@matrix.org>
This can't be good (observed in the federation-reader logs on
libera.chat
):The text was updated successfully, but these errors were encountered: