Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Federated joins failing on 0.19.3 #1893

Closed
kroeckx opened this issue Feb 8, 2017 · 8 comments
Closed

Federated joins failing on 0.19.3 #1893

kroeckx opened this issue Feb 8, 2017 · 8 comments
Assignees

Comments

@kroeckx
Copy link

kroeckx commented Feb 8, 2017

I've joined #riot:matrix.org over 24 hours ago, but other home servers are still not showing it. After a few hours I tried to leave and join again, and the other servers seem to show that I joined and leaved at that time, so still didn't join. I didn't have this problem with joining any other room. The account is @kurt:roeckx.be.

When I open that room in riot I get: Matrix error: Forbidden.

@kroeckx
Copy link
Author

kroeckx commented Apr 2, 2017

So I have the same problem now. I joined #megolm:matrix.org at 2017-03-29 22:21:12 UTC, and now almost 4 days later people still don't see I'm in the room. I also still get the same error message when using riot on android, but riot-web doesn't seem to show an error. I'm using synapse 0.19.3. I've saved the log of that day if it's useful.

@ara4n ara4n changed the title Join not showing up Federated joins failing on 0.19.3 Apr 2, 2017
@ara4n
Copy link
Member

ara4n commented Apr 2, 2017

@kroeckx sent me the full logs from that day, and here's the request which spectacularly fails. The first server it tries to call /make_join isn't even alive (although i'm assuming the new federation retry fixes in 0.20 will solve that), but then the request fails entirely when it tries to talk use kolm.io for the make_join. So: why doesn't it retry further? And why does kolm.io's signature check fail?

@richvdh this feels similar to the join problems you were looking at the other week; is this related or a new issue?

2017-03-29 22:21:11,807 - synapse.access.https.8448 - 59 - INFO - POST-1114148- None - 8448 - Received request: POST /_matrix/client/r0/join/%23megolm%3Amatrix.org?access_token=<redacted>
2017-03-29 22:21:11,810 - synapse.http.outbound - 124 - INFO - POST-1114148- {GET-O-254447} [matrix.org] Sending request: GET matrix://matrix.org/_matrix/federation/v1/query/directory?room_alias=%23megolm%3Amatrix.org
2017-03-29 22:21:11,832 - synapse.http.outbound - 209 - INFO - POST-1114148- {GET-O-254447} [matrix.org] Result: 200 OK
2017-03-29 22:21:11,833 - synapse.util.async - 201 - INFO - POST-1114148- Acquired linearizer lock 'state_resolve_lock' for key frozenset([])
2017-03-29 22:21:11,834 - synapse.state - 320 - INFO - POST-1114148- Resolving state for !UCnwUWwIKhcpaPTHtR:sw1v.org with 0 groups
2017-03-29 22:21:11,834 - synapse.util.async - 208 - INFO - POST-1114148- Releasing linearizer lock 'state_resolve_lock' for key frozenset([])
2017-03-29 22:21:11,836 - synapse.util.async - 201 - INFO - POST-1114148- Acquired linearizer lock 'member' for key ('!UCnwUWwIKhcpaPTHtR:sw1v.org',)
2017-03-29 22:21:11,837 - synapse.util.async - 201 - INFO - POST-1114148- Acquired linearizer lock 'state_resolve_lock' for key frozenset([])
2017-03-29 22:21:11,837 - synapse.state - 320 - INFO - POST-1114148- Resolving state for !UCnwUWwIKhcpaPTHtR:sw1v.org with 0 groups
2017-03-29 22:21:11,837 - synapse.util.async - 208 - INFO - POST-1114148- Releasing linearizer lock 'state_resolve_lock' for key frozenset([])
2017-03-29 22:21:11,877 - synapse.http.outbound - 124 - INFO - POST-1114148- {GET-O-254448} [dmp-develop.com] Sending request: GET matrix://dmp-develop.com/_matrix/federation/v1/make_join/!UCnwUWwIKhcpaPTHtR:sw1v.org/@kurt:roeckx.be
2017-03-29 22:21:11,878 - twisted - 131 - INFO - - DNSDatagramProtocol starting on 31891
2017-03-29 22:21:11,878 - twisted - 131 - INFO - - Starting protocol <twisted.names.dns.DNSDatagramProtocol object at 0x7f7dc43b6390>
2017-03-29 22:21:11,902 - twisted - 131 - INFO - POST-1114148- DNSDatagramProtocol starting on 43580
2017-03-29 22:21:11,903 - twisted - 131 - INFO - POST-1114148- Starting protocol <twisted.names.dns.DNSDatagramProtocol object at 0x7f7dc43b6dd0>
2017-03-29 22:21:11,904 - twisted - 131 - INFO - POST-1114148- (UDP Port 31891 Closed)
2017-03-29 22:21:11,904 - twisted - 131 - INFO - POST-1114148- Stopping protocol <twisted.names.dns.DNSDatagramProtocol object at 0x7f7dc43b6390>
2017-03-29 22:21:11,904 - synapse.metrics - 162 - INFO - POST-1114148- Collecting gc 0
2017-03-29 22:21:11,930 - synapse.metrics - 162 - INFO - POST-1114148- Collecting gc 1
2017-03-29 22:21:12,012 - twisted - 131 - INFO - POST-1114148- DNSDatagramProtocol starting on 59282
2017-03-29 22:21:12,013 - twisted - 131 - INFO - POST-1114148- Starting protocol <twisted.names.dns.DNSDatagramProtocol object at 0x7f7dbfa734d0>
2017-03-29 22:21:12,015 - twisted - 131 - INFO - POST-1114148- (UDP Port 43580 Closed)
2017-03-29 22:21:12,015 - twisted - 131 - INFO - POST-1114148- Stopping protocol <twisted.names.dns.DNSDatagramProtocol object at 0x7f7dc43b6dd0>
2017-03-29 22:21:12,157 - synapse.http.endpoint - 248 - INFO - POST-1114148- Connecting to 96.126.102.63:8448
2017-03-29 22:21:12,159 - twisted - 131 - INFO - POST-1114148- (UDP Port 59282 Closed)
2017-03-29 22:21:12,160 - twisted - 131 - INFO - POST-1114148- Stopping protocol <twisted.names.dns.DNSDatagramProtocol object at 0x7f7dbfa734d0>
2017-03-29 22:21:12,161 - twisted - 131 - INFO - POST-1114148- Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f7dbc2e4050>
2017-03-29 22:21:12,317 - twisted - 131 - INFO - POST-1114148- Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f7dbc2e4050>
2017-03-29 22:21:12,461 - synapse.http.matrixfederationclient - 183 - WARNING - POST-1114148- {GET-O-254448} Sending request failed to dmp-develop.com: GET matrix://dmp-develop.com/_matrix/federation/v1/make_join/!UCnwUWwIKhcpaPTHtR:sw1v.org/@kurt:roeckx.be: ConnectionR
efusedError - ConnectionRefusedError: Connection refused
2017-03-29 22:21:12,462 - synapse.http.outbound - 209 - INFO - POST-1114148- {GET-O-254448} [dmp-develop.com] Result: ConnectionRefusedError - ConnectionRefusedError: Connection refused
2017-03-29 22:21:12,463 - synapse.federation.federation_client - 528 - WARNING - POST-1114148- Failed to make_join via dmp-develop.com: Connection refused
2017-03-29 22:21:12,463 - synapse.http.outbound - 124 - INFO - POST-1114148- {GET-O-254449} [darmstadt.ccc.de] Sending request: GET matrix://darmstadt.ccc.de/_matrix/federation/v1/make_join/!UCnwUWwIKhcpaPTHtR:sw1v.org/@kurt:roeckx.be
2017-03-29 22:21:12,465 - synapse.http.endpoint - 248 - INFO - - Connecting to 130.83.177.132:443
2017-03-29 22:21:12,468 - twisted - 131 - INFO - POST-1114148- Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f7dcc729b48>
2017-03-29 22:21:12,801 - synapse.handlers.typing - 79 - INFO - POST-1114148- Checking for typing timeouts
2017-03-29 22:21:12,802 - synapse.handlers.presence - 328 - INFO - POST-1114148- Handling presence timeouts
2017-03-29 22:21:13,329 - twisted - 131 - INFO - POST-1114148- Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f7df0364440>
2017-03-29 22:21:13,381 - synapse.http.outbound - 209 - INFO - POST-1114148- {GET-O-254449} [darmstadt.ccc.de] Result: 200 OK
2017-03-29 22:21:13,384 - synapse.http.outbound - 124 - INFO - POST-1114148- {PUT-O-254450} [darmstadt.ccc.de] Sending request: PUT matrix://darmstadt.ccc.de/_matrix/federation/v1/send_join/!UCnwUWwIKhcpaPTHtR:sw1v.org/$149082607314AhSnp:roeckx.be
2017-03-29 22:21:15,699 - synapse.http.outbound - 209 - INFO - POST-1114148- {PUT-O-254450} [darmstadt.ccc.de] Result: 200 OK
2017-03-29 22:21:15,817 - synapse.metrics - 162 - INFO - - Collecting gc 0
2017-03-29 22:21:15,870 - synapse.federation.federation_base - 151 - WARNING - POST-1114148- Signature check failed for $14799950451389Opbqa:kolm.io
2017-03-29 22:21:15,974 - synapse.http.outbound - 124 - INFO - POST-1114148- {GET-O-254451} [kolm.io] Sending request: GET matrix://kolm.io/_matrix/federation/v1/event/$14799950451389Opbqa:kolm.io/
2017-03-29 22:21:15,976 - synapse.http.endpoint - 248 - INFO - - Connecting to 91.134.136.82:8448
2017-03-29 22:21:15,977 - twisted - 131 - INFO - - Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x7f7dbdc425f0>
2017-03-29 22:21:16,059 - synapse.http.outbound - 209 - INFO - POST-1114148- {GET-O-254451} [kolm.io] Result: 403 Forbidden
2017-03-29 22:21:16,061 - synapse.federation.federation_client - 278 - INFO - POST-1114148- Failed to get PDU $14799950451389Opbqa:kolm.io from kolm.io because 403: Forbidden
2017-03-29 22:21:16,061 - synapse.federation.federation_base - 91 - WARNING - POST-1114148- Failed to find copy of $14799950451389Opbqa:kolm.io with valid signature

(...other unrelated stuff happens...)

2017-03-29 22:21:24,881 - synapse.util.async - 208 - INFO - POST-1114148- Releasing linearizer lock 'member' for key ('!UCnwUWwIKhcpaPTHtR:sw1v.org',)
2017-03-29 22:21:24,882 - synapse.access.https.8448 - 91 - INFO - POST-1114148- None - 8448 - {@kurt:roeckx.be} Processed request: 13075ms (320ms, 0ms) (8701ms/9) 62B 200 "POST /_matrix/client/r0/join/%23megolm%3Amatrix.org?access_token=<redacted> HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Riot/0.9.7 Chrome/53.0.2785.143 Electron/1.4.14 Safari/537.36"

@richvdh
Copy link
Member

richvdh commented Apr 3, 2017

The first server it tries to call /make_join isn't even alive (although i'm assuming the new federation retry fixes in 0.20 will solve that),

Yes, there is more reuse of the 'backoffs' list in 0.20, which means that it is less likely to pick a dead server for a join. In this case it's not much of a problem because it switches to another server within 300ms.

but then the request fails entirely when it tries to talk use kolm.io for the make_join. So: why doesn't it retry further?

It doesn't use kolm.io for the make_join: it uses darmstadt.ccc.de; successfully, afaict.

And why does kolm.io's signature check fail?

That is almost certainly due to #2034 (also fixed in 0.20). It shouldn't be a significant problem here.

So from the point of view of roeckx.be, this looks like a successful join. The question is why that join didn't get propagated to other servers in the federation.

@richvdh
Copy link
Member

richvdh commented Apr 3, 2017

On sw1v.org:

2017-03-29 23:21:13,943 - synapse.federation.federation_server - 157 - INFO - PUT-659608- Discarding PDU $149082607314AhSnp:roeckx.be from invalid origin darmstadt.ccc.de

The join event is rejected. I have a horrid feeling we might have broken joining over federation in 0.20.

@richvdh
Copy link
Member

richvdh commented Apr 3, 2017

Hum, no, the join event is supposed to come from the joining server rather than the helper, so it's correct that sw1v.org rejected the join when it came from darmstadt.ccc.de.

@richvdh
Copy link
Member

richvdh commented Apr 3, 2017

So it looks like joining rooms over federation has been somewhat broken ever since synapse 0.18.7, which included commit e10c527.

When a join happens over federation, the joining server creates a join event, and sends it to another server; it is that other server which is responsible for sending out the join event to the rest of the federation. The reason it's done that way round are slightly shrouded in the mysteries of time, but the point is, e10c527 broke it, so that all the other servers in the federation will reject the join, and only the two servers involved know that the new user has joined the room.

We've mostly been getting away with this because there's a high chance that the "other" server involved is matrix.org, which means that most other users know that the user has joined, and there's enough traffic in the room to trigger the "missing state" resolution mechanism. In this case, the problem is exacerbated by the fact that we have chosen darmstadt.ccc.de, where there are no remaining active users.

The fix is relatively easy: change the code affected in e10c527 to accept these join events. Once a few servers in the federation are updated, new join events should at least get propagated to those servers, and hopefully that should be enough to get traffic flowing and the missing-state mechanism will get the join to other servers.

@richvdh
Copy link
Member

richvdh commented Apr 3, 2017

(it also looks like sending a message in the room is enough to make the join propagate to other servers, after a few minutes).

richvdh added a commit that referenced this issue Apr 3, 2017
Make sure that we accept join events from any server, rather than just the
origin server, to make the federation join dance work correctly.

(Fixes #1893).
@richvdh
Copy link
Member

richvdh commented Apr 3, 2017

Fixed by #2094, hopefully

@richvdh richvdh closed this as completed Apr 3, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants