Skip to content

Conversation

@MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Sep 17, 2025

Before #18849, we we're using our own custom LogContextScopeManager which tied the tracing scope to the LoggingContext. Since we created a new BackgroundProcessLoggingContext any time we run_as_background_process(...), the trace for the background work was separate from the trace that kicked of the work as expected (e.g. request trace is separate from the background process we kicked to fetch more messages from the federation).

Since we've now switched to the ContextVarsScopeManager (in #18849), the tracing scope now crosses the LoggingContext boundaries (and thread boundaries) without a problem. This means we end up with request traces that include all of the background work that we've kicked off bloating the trace and making it hard to understand what's going on.

This PR separates the traces again to how things were before. Additionally, things are even better now since I added some cross-link references between the traces to easily be able to jump between.

Follow-up to #18849


In the before, you can see that the trace is blown up by the background process (bgproc.qwer).

In the after, we now only have a little cross-link marker span (start_bgproc.qwer) to jump to background process trace.

Before After
before after

Testing strategy

  1. Run a Jaeger instance (https://www.jaegertracing.io/docs/1.6/getting-started/)
    $ docker run -d --name jaeger \
      -e COLLECTOR_ZIPKIN_HTTP_PORT=9411 \
      -p 5775:5775/udp \
      -p 6831:6831/udp \
      -p 6832:6832/udp \
      -p 5778:5778 \
      -p 16686:16686 \
      -p 14268:14268 \
      -p 9411:9411 \
      jaegertracing/all-in-one:1.59.0
  2. Configure Synapse to use tracing:
    homeserver.yaml
    ## Tracing ##
    opentracing:
     enabled: true
     jaeger_config:
       sampler:
         type: const
         param: 1
       logging:
         false
  3. Make sure the optional opentracing dependency is installed: poetry install --extras all
  4. In the VersionsRestServlet, modify it to kick off a dummy background process (easy to test this way)
    from synapse.metrics.background_process_metrics import run_as_background_process
    
    async def _qwer() -> None:
        await self.clock.sleep(1)
    
    run_as_background_process("qwer", "test_server", _qwer)
  5. Run Synapse: poetry run synapse_homeserver --config-path homeserver.yaml
  6. Fire off a version requests: curl http://localhost:8008/_matrix/client/versions
  7. Visit http://localhost:16686/search to view the traces
    • Select the correct service
    • Look for the VersionsRestServlet operation
    • Press 'Find Traces' button
    • Select the relevant trace
    • Notice how the trace isn't bloated
    • Look for the start_bgproc.qwer span cross-linking to the background process
    • Jump to the other trace using the cross-link reference -> bgproc.qwer

Dev notes

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

@MadLittleMods MadLittleMods marked this pull request as ready for review September 17, 2025 17:27
@MadLittleMods MadLittleMods requested a review from a team as a code owner September 17, 2025 17:27
Copy link
Member

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that we're able to jump between the spans! I appreciate the work to make our observability stack nicer to work with.

Some small requests below, but otherwise this LGTM.

MadLittleMods and others added 3 commits September 25, 2025 16:16
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
See #18932 (comment)

Copied from `opentracing.Tracer.start_active_span`
@@ -0,0 +1 @@
Disconnect background process work from request trace.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recorded the flaky tests on #18537

❌ TestThreadReceiptsInSyncMSC4102 ->
https://github.com/element-hq/synapse/actions/runs/18021157721/job/51278706782

@MadLittleMods MadLittleMods merged commit 04721c8 into develop Sep 26, 2025
108 of 112 checks passed
@MadLittleMods MadLittleMods deleted the madlittlemods/trace-run-as-background-process branch September 26, 2025 02:45
@MadLittleMods
Copy link
Contributor Author

Thanks for the review @anoadragon453 🦈

MadLittleMods added a commit that referenced this pull request Oct 2, 2025
This functionality was originally introduced in
#18932
but previously worked fine because we were using
the `ContextVarsScopeManager` which had no problem
tracking the tracing scope across logcontext boundaries.
Now that we've switched back to our own custom
`LogContextScopeManager` which tracks the tracing scope
in the logcontext itself, we need to get the active
tracing span before we switch logcontext. (we switch
to the sentinel logcontext because of the `PreserveLoggingContext()`)
@MadLittleMods MadLittleMods added the A-Tracing OpenTracing (maybe OpenTelemetry in the future) label Oct 2, 2025
MadLittleMods added a commit that referenced this pull request Oct 2, 2025
…19007)

Revert #18849

Go back to our custom `LogContextScopeManager` after trying
OpenTracing's `ContextVarsScopeManager`.

Fix #19004

### Why revert?

For reference, with the normal reactor, `ContextVarsScopeManager` worked
just as good as our custom `LogContextScopeManager` as far as I can tell
(and even better in some cases). But since Twisted appears to not fully
support `ContextVar`'s, it doesn't work as expected in all cases.
Compounding things, `ContextVarsScopeManager` was causing errors with
the experimental `SYNAPSE_ASYNC_IO_REACTOR` option.

Since we're not getting the full benefit that we originally desired, we
might as well revert and figure out alternatives for extending the
logcontext lifetimes to support the use case we were trying to unlock
(c.f. #18804).

See
#19004 (comment)
for more info.


### Does this require backporting and patch releases?

No. Since `ContextVarsScopeManager` operates just as good with the
normal reactor and was only causing actual errors with the experimental
`SYNAPSE_ASYNC_IO_REACTOR` option, I don't think this requires us to
backport and make patch releases at all.



### Maintain cross-links between main trace and background process work

In order to maintain the functionality introduced in #18932 (cross-links between the background process trace and currently active trace), we also needed a small change.

Previously, when we were using `ContextVarsScopeManager`, it tracked the tracing scope across the logcontext changes without issue. Now that we're using our own custom `LogContextScopeManager` again, we need to capture the active span from the logcontext before we reset to the sentinel context because of the `PreserveLoggingContext()` below.

Added some tests to ensure we maintain the `run_as_background` tracing behavior regardless of the tracing scope manager we use.
Michael-Ixo pushed a commit to ixoworld/synapse that referenced this pull request Oct 23, 2025
Deployments that make use of the
[synapse-s3-storage-provider](https://github.com/matrix-org/synapse-s3-storage-provider)
module must upgrade to
[v1.6.0](https://github.com/matrix-org/synapse-s3-storage-provider/releases/tag/v1.6.0).
Using older versions of the module with this release of Synapse will prevent
users from being able to upload or download media.

No significant changes since 1.140.0rc1.

- Add [a new Media Query by ID Admin API](https://element-hq.github.io/synapse/v1.140/admin_api/media_admin_api.html#query-a-piece-of-media-by-id) that allows server admins to query and investigate the metadata of local or cached remote media via
  the `origin/media_id` identifier found in a [Matrix Content URI](https://spec.matrix.org/v1.14/client-server-api/#matrix-content-mxc-uris). ([\element-hq#18911](element-hq#18911))
- Add [a new Fetch Event Admin API](https://element-hq.github.io/synapse/v1.140/admin_api/fetch_event.html) to fetch an event by ID. ([\element-hq#18963](element-hq#18963))
- Update [MSC4284: Policy Servers](matrix-org/matrix-spec-proposals#4284) implementation to support signatures when available. ([\element-hq#18934](element-hq#18934))
- Add experimental implementation of the `GET /_matrix/client/v1/rtc/transports` endpoint for the latest draft of [MSC4143: MatrixRTC](matrix-org/matrix-spec-proposals#4143). ([\element-hq#18967](element-hq#18967))
- Expose a `defer_to_threadpool` function in the Synapse Module API that allows modules to run a function on a separate thread in a custom threadpool. ([\element-hq#19032](element-hq#19032))

- Fix room upgrade `room_config` argument and documentation for `user_may_create_room` spam-checker callback. ([\element-hq#18721](element-hq#18721))
- Compute a user's last seen timestamp from their devices' last seen timestamps instead of IPs, because the latter are automatically cleared according to `user_ips_max_age`. ([\element-hq#18948](element-hq#18948))
- Fix bug where ephemeral events were not filtered by room ID. Contributed by @frastefanini. ([\element-hq#19002](element-hq#19002))
- Update Synapse main process version string to include git info. ([\element-hq#19011](element-hq#19011))

- Explain how `Deferred` callbacks interact with logcontexts. ([\element-hq#18914](element-hq#18914))
- Fix documentation for `rc_room_creation` and `rc_reports` to clarify that a `per_user` rate limit is not supported. ([\element-hq#18998](element-hq#18998))

- Remove deprecated `LoggingContext.set_current_context`/`LoggingContext.current_context` methods which already have equivalent bare methods in `synapse.logging.context`. ([\element-hq#18989](element-hq#18989))
- Drop support for unstable field names from the long-accepted [MSC2732](matrix-org/matrix-spec-proposals#2732) (Olm fallback keys) proposal. ([\element-hq#18996](element-hq#18996))

- Cleanly shutdown `SynapseHomeServer` object, allowing artifacts of embedded small hosts to be properly garbage collected. ([\element-hq#18828](element-hq#18828))
- Update OEmbed providers to use 'X' instead of 'Twitter' in URL previews, following a rebrand. Contributed by @HammyHavoc. ([\element-hq#18767](element-hq#18767))
- Fix `server_name` in logging context for multiple Synapse instances in one process. ([\element-hq#18868](element-hq#18868))
- Wrap the Rust HTTP client with `make_deferred_yieldable` so it follows Synapse logcontext rules. ([\element-hq#18903](element-hq#18903))
- Fix the GitHub Actions workflow that moves issues labeled "X-Needs-Info" to the "Needs info" column on the team's internal triage board. ([\element-hq#18913](element-hq#18913))
- Disconnect background process work from request trace. ([\element-hq#18932](element-hq#18932))
- Reduce overall number of calls to `_get_e2e_cross_signing_signatures_for_devices` by increasing the batch size of devices the query is called with, reducing DB load. ([\element-hq#18939](element-hq#18939))
- Update error code used when an appservice tries to masquerade as an unknown device using [MSC4326](matrix-org/matrix-spec-proposals#4326). Contributed by @tulir @ Beeper. ([\element-hq#18947](element-hq#18947))
- Fix `no active span when trying to log` tracing error on startup (when OpenTracing is enabled). ([\element-hq#18959](element-hq#18959))
- Fix `run_coroutine_in_background(...)` incorrectly handling logcontext. ([\element-hq#18964](element-hq#18964))
- Add debug logs wherever we change current logcontext. ([\element-hq#18966](element-hq#18966))
- Update dockerfile metadata to fix broken link; point to documentation website. ([\element-hq#18971](element-hq#18971))
- Note that the code is additionally licensed under the [Element Commercial license](https://github.com/element-hq/synapse/blob/develop/LICENSE-COMMERCIAL) in SPDX expression field configs. ([\element-hq#18973](element-hq#18973))
- Fix logcontext handling in `timeout_deferred` tests. ([\element-hq#18974](element-hq#18974))
- Remove internal `ReplicationUploadKeysForUserRestServlet` as a follow-up to the work in element-hq#18581 that moved device changes off the main process. ([\element-hq#18988](element-hq#18988))
- Switch task scheduler from raw logcontext manipulation to using the dedicated logcontext utils. ([\element-hq#18990](element-hq#18990))
- Remove `MockClock()` in tests. ([\element-hq#18992](element-hq#18992))
- Switch back to our own custom `LogContextScopeManager` instead of OpenTracing's `ContextVarsScopeManager` which was causing problems when using the experimental `SYNAPSE_ASYNC_IO_REACTOR` option with tracing enabled. ([\element-hq#19007](element-hq#19007))
- Remove `version_string` argument from `HomeServer` since it's always the same. ([\element-hq#19012](element-hq#19012))
- Remove duplicate call to `hs.start_background_tasks()` introduced from a bad merge. ([\element-hq#19013](element-hq#19013))
- Split homeserver creation (`create_homeserver`) and setup (`setup`). ([\element-hq#19015](element-hq#19015))
- Swap near-end-of-life `macos-13` GitHub Actions runner for the `macos-15-intel` variant. ([\element-hq#19025](element-hq#19025))
- Introduce `RootConfig.validate_config()` which can be subclassed in `HomeServerConfig` to do cross-config class validation. ([\element-hq#19027](element-hq#19027))
- Allow any command of the `release.py` script to accept a `--gh-token` argument. ([\element-hq#19035](element-hq#19035))

* Bump Swatinem/rust-cache from 2.8.0 to 2.8.1. ([\element-hq#18949](element-hq#18949))
* Bump actions/cache from 4.2.4 to 4.3.0. ([\element-hq#18983](element-hq#18983))
* Bump anyhow from 1.0.99 to 1.0.100. ([\element-hq#18950](element-hq#18950))
* Bump authlib from 1.6.3 to 1.6.4. ([\element-hq#18957](element-hq#18957))
* Bump authlib from 1.6.4 to 1.6.5. ([\element-hq#19019](element-hq#19019))
* Bump bcrypt from 4.3.0 to 5.0.0. ([\element-hq#18984](element-hq#18984))
* Bump docker/login-action from 3.5.0 to 3.6.0. ([\element-hq#18978](element-hq#18978))
* Bump lxml from 6.0.0 to 6.0.2. ([\element-hq#18979](element-hq#18979))
* Bump phonenumbers from 9.0.13 to 9.0.14. ([\element-hq#18954](element-hq#18954))
* Bump phonenumbers from 9.0.14 to 9.0.15. ([\element-hq#18991](element-hq#18991))
* Bump prometheus-client from 0.22.1 to 0.23.1. ([\element-hq#19016](element-hq#19016))
* Bump pydantic from 2.11.9 to 2.11.10. ([\element-hq#19017](element-hq#19017))
* Bump pygithub from 2.7.0 to 2.8.1. ([\element-hq#18952](element-hq#18952))
* Bump regex from 1.11.2 to 1.11.3. ([\element-hq#18981](element-hq#18981))
* Bump serde from 1.0.224 to 1.0.226. ([\element-hq#18953](element-hq#18953))
* Bump serde from 1.0.226 to 1.0.228. ([\element-hq#18982](element-hq#18982))
* Bump setuptools-rust from 1.11.1 to 1.12.0. ([\element-hq#18980](element-hq#18980))
* Bump twine from 6.1.0 to 6.2.0. ([\element-hq#18985](element-hq#18985))
* Bump types-pyyaml from 6.0.12.20250809 to 6.0.12.20250915. ([\element-hq#19018](element-hq#19018))
* Bump types-requests from 2.32.4.20250809 to 2.32.4.20250913. ([\element-hq#18951](element-hq#18951))
* Bump typing-extensions from 4.14.1 to 4.15.0. ([\element-hq#18956](element-hq#18956))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Tracing OpenTracing (maybe OpenTelemetry in the future)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants