Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Large amounts of repeated DB queries when using multiple federation sender workers #9113

Closed
emgrav opened this issue Jan 14, 2021 · 11 comments · Fixed by #9114
Closed

Large amounts of repeated DB queries when using multiple federation sender workers #9113

emgrav opened this issue Jan 14, 2021 · 11 comments · Fixed by #9114
Labels
A-Performance Performance, both client-facing and admin-facing

Comments

@emgrav
Copy link
Contributor

emgrav commented Jan 14, 2021

Description

wake_destinations_needing_catchup performs massive amounts of repeated SELECT queries when using multiple federation_sender workers. Using a single federation_sender works well.

Here are the debug logs from the federation senders, from startup up until the issue starts occuring. I don't think there's any value to providing more of the log, as the output just repeats itself after this point.

The actual destination in the query seems to vary though, I've seen three destinations so far: toofat.ru, wc20.tencapsule.com and fpoe.info. The output of a sample query with all three destinations follows below:

synapse_prod=# SELECT destination FROM destinations WHERE destination IN ( SELECT destination FROM destination_rooms WHERE destination_rooms.stream_ordering > destinations.last_successful_stream_ordering ) AND destination > 'toofat.ru' AND ( retry_last_ts IS NULL OR retry_last_ts + retry_interval < 1610625475512 ) ORDER BY destination LIMIT 25;
 destination
-------------
 wolffy.eu
(1 row)

synapse_prod=# SELECT destination FROM destinations WHERE destination IN ( SELECT destination FROM destination_rooms WHERE destination_rooms.stream_ordering > destinations.last_successful_stream_ordering ) AND destination > 'fpoe.info' AND ( retry_last_ts IS NULL OR retry_last_ts + retry_interval < 1610625676424 ) ORDER BY destination LIMIT 25;
     destination
---------------------
 komputernerds.com
 matrix.insw.cz
 matrix.noppesict.nl
 supersandro.de
 thedreamer.nl
 toofat.ru
 wolffy.eu
(7 rows)

synapse_prod=# SELECT destination FROM destinations WHERE destination IN ( SELECT destination FROM destination_rooms WHERE destination_rooms.stream_ordering > destinations.last_successful_stream_ordering ) AND destination > 'wc20.tencapsule.com' AND ( retry_last_ts IS NULL OR retry_last_ts + retry_interval < 1610631393252 ) ORDER BY destination LIMIT 25;
 destination
-------------
 wolffy.eu
(1 row)

Steps to reproduce

  • Enable two or more federation_sender workers
  • Start the main process and all workers
  • Wait ~30 seconds

Of course, this is probably specific to me and the contents of my db, so naturally I'll provide any further information as needed. Just let me know! :)

Version information

  • Homeserver: queersin.space

If not matrix.org:

  • Version: 1.25 (note: problem occurred on 1.24 as well)

  • Install method: Via apt from the matrix.org repo

  • Platform: An lxc container running Ubuntu 20.04.1
@clokep
Copy link
Member

clokep commented Jan 14, 2021

Looks like we just noticed this as well -- it should be fixed by #9114.

@clokep clokep added the A-Performance Performance, both client-facing and admin-facing label Jan 14, 2021
@richvdh
Copy link
Member

richvdh commented Jan 15, 2021

Here are the debug logs

the link seems to be broken?

performs massive amounts of repeated SELECT queries

why would the problem in #9114 cause lots of repeated queries (as opposed to a small number of very slow ones)?

@emgrav
Copy link
Contributor Author

emgrav commented Jan 15, 2021

the link seems to be broken?

Sorry, I deleted the logs when the issue was closed, in case there was any sensitive info in there I missed. It's back up now.

@richvdh
Copy link
Member

richvdh commented Jan 15, 2021

Thanks.

TL;DR, there is a lot of this stuff going on:

2021-01-14 14:52:27,332 - synapse.storage.SQL - 289 - DEBUG - wake_destinations_needing_catchup-0 - [SQL values] {get_catch_up_outstanding_destinations-1a} ('toofat.ru', 1610632347331)
2021-01-14 14:52:27,514 - synapse.storage.SQL - 303 - DEBUG - wake_destinations_needing_catchup-0 - [SQL time] {get_catch_up_outstanding_destinations-1a} 0.181786 sec
2021-01-14 14:52:27,514 - synapse.storage.txn - 585 - DEBUG - wake_destinations_needing_catchup-0 - [TXN END] {get_catch_up_outstanding_destinations-1a} 0.182408 sec
2021-01-14 14:52:27,514 - synapse.storage.txn - 498 - DEBUG - wake_destinations_needing_catchup-0 - [TXN START] {get_catch_up_outstanding_destinations-1b}
2021-01-14 14:52:27,515 - synapse.storage.SQL - 284 - DEBUG - wake_destinations_needing_catchup-0 - [SQL] {get_catch_up_outstanding_destinations-1b} SELECT destination FROM destinations WHERE destination IN ( SELECT destination FROM destination_rooms WHERE destination_rooms.stream_ordering > destinations.last_successful_stream_ordering ) AND destination > ? AND ( retry_last_ts IS NULL OR retry_last_ts + retry_interval < ? ) ORDER BY destination LIMIT 25
2021-01-14 14:52:27,515 - synapse.storage.SQL - 289 - DEBUG - wake_destinations_needing_catchup-0 - [SQL values] {get_catch_up_outstanding_destinations-1b} ('toofat.ru', 1610632347514)
2021-01-14 14:52:27,689 - synapse.storage.SQL - 303 - DEBUG - wake_destinations_needing_catchup-0 - [SQL time] {get_catch_up_outstanding_destinations-1b} 0.174550 sec
2021-01-14 14:52:27,689 - synapse.storage.txn - 585 - DEBUG - wake_destinations_needing_catchup-0 - [TXN END] {get_catch_up_outstanding_destinations-1b} 0.175038 sec
2021-01-14 14:52:27,690 - synapse.storage.txn - 498 - DEBUG - wake_destinations_needing_catchup-0 - [TXN START] {get_catch_up_outstanding_destinations-1c}
2021-01-14 14:52:27,690 - synapse.storage.SQL - 284 - DEBUG - wake_destinations_needing_catchup-0 - [SQL] {get_catch_up_outstanding_destinations-1c} SELECT destination FROM destinations WHERE destination IN ( SELECT destination FROM destination_rooms WHERE destination_rooms.stream_ordering > destinations.last_successful_stream_ordering ) AND destination > ? AND ( retry_last_ts IS NULL OR retry_last_ts + retry_interval < ? ) ORDER BY destination LIMIT 25
2021-01-14 14:52:27,690 - synapse.storage.SQL - 289 - DEBUG - wake_destinations_needing_catchup-0 - [SQL values] {get_catch_up_outstanding_destinations-1c} ('toofat.ru', 1610632347690)
2021-01-14 14:52:27,873 - synapse.storage.SQL - 303 - DEBUG - wake_destinations_needing_catchup-0 - [SQL time] {get_catch_up_outstanding_destinations-1c} 0.182199 sec
2021-01-14 14:52:27,873 - synapse.storage.txn - 585 - DEBUG - wake_destinations_needing_catchup-0 - [TXN END] {get_catch_up_outstanding_destinations-1c} 0.182860 sec

Again I'm not sure that's consistent with #9114 so I'm going to reopen this.

[edited, sorry I copied the wrong bit of log initially!]

@richvdh richvdh reopened this Jan 15, 2021
@erikjohnston
Copy link
Member

Oh, right. This is it just iterating through all the destinations that need to be retried, but since there are multiple federation senders we rapidly chew through the list until we hit destinations we need to retry (I believe).

We may want to at some point clear out entries from the list that are e.g. more than a month old, to stop the list just ever increasing.

@emgrav
Copy link
Contributor Author

emgrav commented Jan 15, 2021

Just to clarify how much of this stuff is going, I should add that the wake_destinations_needing_catchup job on each federation sender jumps to 100% DB use in the "DB usage by background jobs (including scheduling time)" metric.

@erikjohnston
Copy link
Member

Yeah, we're seeing this on matrix.org as well. I'm hoping that changing the query means that it goes fast enough that we don't care too much that there are so many of them, but we'll see.

@emgrav
Copy link
Contributor Author

emgrav commented Jan 15, 2021

Oh, I should've been clearer at the start - it keeps going indefinitely. The three destinations I mentioned in the original issue are the ones I found repeated queries for. I only noticed it yesterday morning after it had been going throughout the night. Sorry, it's my first time opening an issue.

2021-01-15-114733_1234x284_scrot

@emgrav
Copy link
Contributor Author

emgrav commented Jan 28, 2021

Hi, just wanted to confirm that this was not related to #9114, and persists after upgrading to 1.26.0.

@clokep
Copy link
Member

clokep commented Jan 28, 2021

#9198 might help this a bit too, although the cause seems a bit different? (Note that that will be in 1.27.0 and requires using workers via Redis).

@emgrav
Copy link
Contributor Author

emgrav commented Feb 19, 2021

This appears to be fixed in 1.27. Thanks a bunch for all your hard work! 😄

@emgrav emgrav closed this as completed Feb 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Performance Performance, both client-facing and admin-facing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants