Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Fix rare bug that broke looping calls #16210

Merged
merged 2 commits into from
Aug 30, 2023
Merged

Conversation

erikjohnston
Copy link
Member

We can't interact with the reactor from the main thread via looping call. If you do then you can break existing looping calls (due to changing list size during iteration).

Introduced in v1.90.0 / #15791.

Example stack trace:

2023-08-30 07:44:09,624 - twisted - 275 - CRITICAL - sentinel - 
Traceback (most recent call last):
  File "/home/erikj/.synapse39/lib/python3.9/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/erikj/.synapse39/lib/python3.9/site-packages/twisted/internet/task.py", line 238, in cb
    self._scheduleFrom(self.clock.seconds())
  File "/home/erikj/.synapse39/lib/python3.9/site-packages/twisted/internet/task.py", line 291, in _scheduleFrom
    self.call = self.clock.callLater(howLong(), self)
  File "/home/erikj/.synapse39/lib/python3.9/site-packages/twisted/internet/asyncioreactor.py", line 289, in callLater
    abs_time = self._asyncioEventloop.time() + self.timeout()
  File "/home/erikj/.synapse39/lib/python3.9/site-packages/twisted/internet/base.py", line 936, in timeout
    self._insertNewDelayedCalls()
  File "/home/erikj/.synapse39/lib/python3.9/site-packages/twisted/internet/base.py", line 924, in _insertNewDelayedCalls
    heappush(self._pendingTimedCalls, call)
RuntimeError: list changed size during iteration

We can't interact with the reactor from the main thread via looping
call.

Introduced in v1.90.0 / #15791.
@@ -0,0 +1 @@
Fix rare bug that broke looping calls, which could lead to e.g. linearly increasing memory usage. Introduced in v1.90.0.
Copy link
Contributor

@DMRobertson DMRobertson Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check: the symptom here is that GC never runs successfully?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but this could hit any looping call, so its not just hitting GC.

@erikjohnston erikjohnston marked this pull request as ready for review August 30, 2023 11:03
@erikjohnston erikjohnston requested a review from a team as a code owner August 30, 2023 11:03
Copy link
Contributor

@DMRobertson DMRobertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems plausible. Is there any quick way to write a test that would capture this/ensure we don't screw up looping calls in the future?

@erikjohnston
Copy link
Member Author

Seems plausible. Is there any quick way to write a test that would capture this/ensure we don't screw up looping calls in the future?

I'm not sure how TBH, given its so rare to hit it. I wonder if we could add guards to Clock to handle main vs non-main threads? Not sure there is an easy way to tell?

@DMRobertson
Copy link
Contributor

I'm not sure how TBH, given its so rare to hit it. I wonder if we could add guards to Clock to handle main vs non-main threads? Not sure there is an easy way to tell?

Fair enough. Let's try it out and see if the "object sized changed during iteration" errors you were seeing vanish.

@erikjohnston erikjohnston merged commit a2e0d4c into develop Aug 30, 2023
35 of 37 checks passed
@erikjohnston erikjohnston deleted the erikj/fix_looping_calls branch August 30, 2023 13:18
hughns pushed a commit that referenced this pull request Sep 4, 2023
* Fix rare bug that broke looping calls

We can't interact with the reactor from the main thread via looping
call.

Introduced in v1.90.0 / #15791.

* Newsfile
DMRobertson pushed a commit that referenced this pull request Sep 5, 2023
- Add configuration setting for CAS protocol version. Contributed by Aurélien Grimpard. ([\#15816](#15816))
- Suppress notifications from message edits per [MSC3958](matrix-org/matrix-spec-proposals#3958). ([\#16113](#16113))
- Return a `Retry-After` with `M_LIMIT_EXCEEDED` error responses. ([\#16136](#16136))
- Add `last_seen_ts` to the [admin users API](https://matrix-org.github.io/synapse/latest/admin_api/user_admin_api.html). ([\#16218](#16218))
- Improve resource usage when sending data to a large number of remote hosts that are marked as "down". ([\#16223](#16223))

- Fix IPv6-related bugs on SMTP settings, adding groundwork to fix similar issues. Contributed by @evilham and @telmich (ungleich.ch). ([\#16155](#16155))
- Fix a spec compliance issue where requests to the `/publicRooms` federation API would specify `include_all_networks` as a string. ([\#16185](#16185))
- Fix inaccurate error message while attempting to ban or unban a user with the same or higher PL by spliting the conditional statements. Contributed by @leviosacz. ([\#16205](#16205))
- Fix a rare bug that broke looping calls, which could lead to e.g. linearly increasing memory usage. Introduced in v1.90.0. ([\#16210](#16210))
- Fix a long-standing bug where uploading images would fail if we could not generate thumbnails for them. ([\#16211](#16211))
- Fix a long-standing bug where we did not correctly back off from servers that had "gone" if they returned 4xx series error codes. ([\#16221](#16221))

- Update links to the [matrix.org blog](https://matrix.org/blog/). ([\#16008](#16008))
- Document which [admin APIs](https://matrix-org.github.io/synapse/latest/usage/administration/admin_api/index.html) are disabled when experimental [MSC3861](matrix-org/matrix-spec-proposals#3861) support is enabled. ([\#16168](#16168))
- Document [`exclude_rooms_from_sync`](https://matrix-org.github.io/synapse/v1.92/usage/configuration/config_documentation.html#exclude_rooms_from_sync) configuration option. ([\#16178](#16178))

- Prepare unit tests for Python 3.12. ([\#16099](#16099))
- Fix nightly CI jobs. ([\#16121](#16121), [\#16213](#16213))
- Describe which rate limiter was hit in logs. ([\#16135](#16135))
- Simplify presence code when using workers. ([\#16170](#16170))
- Track per-device information in the presence code. ([\#16171](#16171), [\#16172](#16172))
- Stop using the `event_txn_id` table. ([\#16175](#16175))
- Use `AsyncMock` instead of custom code. ([\#16179](#16179), [\#16180](#16180))
- Improve error reporting of invalid data passed to `/_matrix/key/v2/query`. ([\#16183](#16183))
- Task scheduler: add replication notify for new task to launch ASAP. ([\#16184](#16184))
- Improve type hints. ([\#16186](#16186), [\#16188](#16188), [\#16201](#16201))
- Bump black version to 23.7.0. ([\#16187](#16187))
- Log the details of background update failures. ([\#16212](#16212))
- Cache device resync requests over replication. ([\#16241](#16241))

* Bump anyhow from 1.0.72 to 1.0.75. ([\#16141](#16141))
* Bump furo from 2023.7.26 to 2023.8.19. ([\#16238](#16238))
* Bump phonenumbers from 8.13.18 to 8.13.19. ([\#16237](#16237))
* Bump psycopg2 from 2.9.6 to 2.9.7. ([\#16196](#16196))
* Bump regex from 1.9.3 to 1.9.4. ([\#16195](#16195))
* Bump ruff from 0.0.277 to 0.0.286. ([\#16198](#16198))
* Bump sentry-sdk from 1.29.2 to 1.30.0. ([\#16236](#16236))
* Bump serde from 1.0.184 to 1.0.188. ([\#16194](#16194))
* Bump serde_json from 1.0.104 to 1.0.105. ([\#16140](#16140))
* Bump types-psycopg2 from 2.9.21.10 to 2.9.21.11. ([\#16200](#16200))
* Bump types-pyyaml from 6.0.12.10 to 6.0.12.11. ([\#16199](#16199))
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants