-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Servers may sometimes temporarily blackhole federation traffic #1733
Comments
This is probably somehow related to #1732, which was also happening at the same time |
the darmstadt->arasphere requests finally unblocked after 6 hours:
Looks like this is indeed #1729, and the attempt to get_missing_events was somehow blocked on another one. |
Yes, this is precisely what's happening. Looking at the first darmstadt->arasphere request to unblock, it seems that it was stacked up on an event_auth check for arasphere.net->productionservers.net for curbaf finally unblocking, triggered by a productionservers.net->arasphere.net transaction:
Meanwhile, looking at this blocking request (PUT-12409) - it started at 05:31 and was rattling around for ages (like, 8.4 hours) whilst blocking other stuff... ugh:
It looks like some of those blocks were presumably due to dependent requests being similarly blackholed. Despite arasphere being on 0.18.6-rc1. The nightmare logs of this request as seen by arasphere are here: |
In fact, every time arasphere received this txn from productionservers.net it triggered a multi-hour meltdown:
|
Looking at the first of these txns (PUT-12095), it seems arasphere was trying to track down a bunch of events from arasphere... which are not present in its own DB:
The full list of referenced events is:
These are all events of erik doing things on arasphere in Nov 2014. Some of them them show up on matrix.org; none of them are on arasphere itself, probably because arasphere.net's DB is dead since then. So it looks like the original event that productionservers.net referenced was @erikj:arasphere.net leaving curbaf ($14168370037dJyNv:arasphere.net), which then triggers a meltdown of arasphere trying to find this event and similar ones from other servers. It goes around different broken servers (including ones which are down?!) trying to find the ancient events, which nobody of course has other than matrix.org. |
A workaround for disabling spidering servers at random in get_missing_events has landed in #1734 and released in 0.18.6-rc2. We're hoping these are the only things which were locking the room linearizer and so blackholing federation traffic for a given room. N.B. that after 3 concurrent federation requests get stuck from the server, all future ones get queued or 429'd by the federation rate limiting logic, which can then starve out the remote server from being able to federate to the local one entirely. |
The problem is still there; productionservers.net just had a 25 minute hiatus whilst trying to find missing events. The first failed req from matrix.org was:
|
Meanwhile, during the outage, it still seems to have been querying lots of different servers for missing events:
|
I think this is basically covered by #1729 now |
Possibly related to #1729, i'm seeing arasphere failing to respond to federation requests entirely from darmstadt. This leaks inbound connections on arasphere and outbound connections on darmstadt (which is running pre-0.18.6, so doesn't clean up long-lived outbound connections as per #1725)
See https://matrix.org/_matrix/media/v1/download/matrix.org/RciZsIFyASPpHBLFOaUcxhum for logs.
This is probably the root cause of the leaking FDs on HSes which caused widespread problems between Dec 26-30.
The text was updated successfully, but these errors were encountered: