Skip to content

Commit 04721c8

Browse files
Disconnect background process work from request trace (#18932)
Before #18849, we we're using our own custom `LogContextScopeManager` which tied the tracing scope to the `LoggingContext`. Since we created a new `BackgroundProcessLoggingContext` any time we `run_as_background_process(...)`, the trace for the background work was separate from the trace that kicked of the work as expected (e.g. request trace is separate from the background process we kicked to fetch more messages from the federation). Since we've now switched to the `ContextVarsScopeManager` (in #18849), the tracing scope now crosses the `LoggingContext` boundaries (and thread boundaries) without a problem. This means we end up with request traces that include all of the background work that we've kicked off bloating the trace and making it hard to understand what's going on. This PR separates the traces again to how things were before. Additionally, things are even better now since I added some cross-link references between the traces to easily be able to jump between. Follow-up to #18849 --- In the before, you can see that the trace is blown up by the background process (`bgproc.qwer`). In the after, we now only have a little cross-link marker span (`start_bgproc.qwer`) to jump to background process trace. Before | After --- | --- <some image> | <some image> ### Testing strategy 1. Run a Jaeger instance (https://www.jaegertracing.io/docs/1.6/getting-started/) ```shell $ docker run -d --name jaeger \ -e COLLECTOR_ZIPKIN_HTTP_PORT=9411 \ -p 5775:5775/udp \ -p 6831:6831/udp \ -p 6832:6832/udp \ -p 5778:5778 \ -p 16686:16686 \ -p 14268:14268 \ -p 9411:9411 \ jaegertracing/all-in-one:1.59.0 ``` 1. Configure Synapse to use tracing: `homeserver.yaml` ```yaml ## Tracing ## opentracing: enabled: true jaeger_config: sampler: type: const param: 1 logging: false ``` 1. Make sure the optional `opentracing` dependency is installed: `poetry install --extras all` 1. In the `VersionsRestServlet`, modify it to kick off a dummy background process (easy to test this way) ```python from synapse.metrics.background_process_metrics import run_as_background_process async def _qwer() -> None: await self.clock.sleep(1) run_as_background_process("qwer", "test_server", _qwer) ``` 1. Run Synapse: `poetry run synapse_homeserver --config-path homeserver.yaml` 1. Fire off a version requests: `curl http://localhost:8008/_matrix/client/versions` 1. Visit http://localhost:16686/search to view the traces - Select the correct service - Look for the `VersionsRestServlet` operation - Press 'Find Traces' button - Select the relevant trace - Notice how the trace isn't bloated - Look for the `start_bgproc.qwer` span cross-linking to the background process - Jump to the other trace using the cross-link reference -> `bgproc.qwer`
1 parent d2a966f commit 04721c8

File tree

5 files changed

+116
-9
lines changed

5 files changed

+116
-9
lines changed

changelog.d/18932.misc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Disconnect background process work from request trace.

synapse/events/auto_accept_invites.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,6 @@ async def on_new_event(self, event: EventBase, *args: Any) -> None:
119119
event.state_key,
120120
event.room_id,
121121
"join",
122-
bg_start_span=False,
123122
)
124123

125124
if is_direct_message:

synapse/logging/opentracing.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -576,7 +576,9 @@ def start_active_span_follows_from(
576576
operation_name: str,
577577
contexts: Collection,
578578
child_of: Optional[Union["opentracing.Span", "opentracing.SpanContext"]] = None,
579+
tags: Optional[Dict[str, str]] = None,
579580
start_time: Optional[float] = None,
581+
ignore_active_span: bool = False,
580582
*,
581583
inherit_force_tracing: bool = False,
582584
tracer: Optional["opentracing.Tracer"] = None,
@@ -591,9 +593,16 @@ def start_active_span_follows_from(
591593
span will be the parent. (If there is no currently active span, the first
592594
span in `contexts` will be the parent.)
593595
596+
tags: an optional dictionary of span tags. The caller gives up ownership of that
597+
dictionary, because the :class:`Tracer` may use it as-is to avoid extra data
598+
copying.
599+
594600
start_time: optional override for the start time of the created span. Seconds
595601
since the epoch.
596602
603+
ignore_active_span: an explicit flag that ignores the current active
604+
scope and creates a root span.
605+
597606
inherit_force_tracing: if set, and any of the previous contexts have had tracing
598607
forced, the new span will also have tracing forced.
599608
tracer: override the opentracing tracer. By default the global tracer is used.
@@ -606,7 +615,9 @@ def start_active_span_follows_from(
606615
operation_name,
607616
child_of=child_of,
608617
references=references,
618+
tags=tags,
609619
start_time=start_time,
620+
ignore_active_span=ignore_active_span,
610621
tracer=tracer,
611622
)
612623

synapse/metrics/background_process_metrics.py

Lines changed: 97 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -20,15 +20,17 @@
2020

2121
import logging
2222
import threading
23-
from contextlib import nullcontext
23+
from contextlib import contextmanager, nullcontext
2424
from functools import wraps
2525
from types import TracebackType
2626
from typing import (
2727
TYPE_CHECKING,
2828
Any,
2929
Awaitable,
3030
Callable,
31+
ContextManager,
3132
Dict,
33+
Generator,
3234
Iterable,
3335
Optional,
3436
Protocol,
@@ -49,7 +51,12 @@
4951
LoggingContext,
5052
PreserveLoggingContext,
5153
)
52-
from synapse.logging.opentracing import SynapseTags, start_active_span
54+
from synapse.logging.opentracing import (
55+
SynapseTags,
56+
active_span,
57+
start_active_span,
58+
start_active_span_follows_from,
59+
)
5360
from synapse.metrics import SERVER_NAME_LABEL
5461
from synapse.metrics._types import Collector
5562

@@ -264,15 +271,97 @@ async def run() -> Optional[R]:
264271

265272
with BackgroundProcessLoggingContext(
266273
name=desc, server_name=server_name, instance_id=count
267-
) as context:
274+
) as logging_context:
268275
try:
269276
if bg_start_span:
270-
ctx = start_active_span(
271-
f"bgproc.{desc}", tags={SynapseTags.REQUEST_ID: str(context)}
272-
)
277+
original_active_tracing_span = active_span()
278+
279+
# If there is already an active span (e.g. because this background
280+
# process was started as part of handling a request for example),
281+
# because this is a long-running background task that may serve a
282+
# broader purpose than the request that kicked it off, we don't want
283+
# it to be a direct child of the currently active trace connected to
284+
# the request. We only want a loose reference to jump between the
285+
# traces.
286+
#
287+
# For example, when making a `/messages` request, when approaching a
288+
# gap, we may kick off a background process to fetch missing events
289+
# from federation. The `/messages` request trace should't include
290+
# the entire time taken and details around fetching the missing
291+
# events since the request doesn't rely on the result, it was just
292+
# part of the heuristic to initiate things.
293+
#
294+
# We don't care about the value from the context manager as it's not
295+
# used (so we just use `Any` for the type). Ideally, we'd be able to
296+
# mark this as unused like an `assert_never` of sorts.
297+
tracing_scope: ContextManager[Any]
298+
if original_active_tracing_span is not None:
299+
# With the OpenTracing client that we're using, it's impossible to
300+
# create a disconnected root span while also providing `references`
301+
# so we first create a bare root span, then create a child span that
302+
# includes the references that we want.
303+
root_tracing_scope = start_active_span(
304+
f"bgproc.{desc}",
305+
tags={SynapseTags.REQUEST_ID: str(logging_context)},
306+
# Create a root span for the background process (disconnected
307+
# from other spans)
308+
ignore_active_span=True,
309+
)
310+
311+
# Also add a span in the original request trace that cross-links
312+
# to background process trace. We immediately finish the span as
313+
# this is just a marker to follow where the real work is being
314+
# done.
315+
#
316+
# In OpenTracing, `FOLLOWS_FROM` indicates parent-child
317+
# relationship whereas we just want a cross-link to the
318+
# downstream trace. This is a bit hacky, but the closest we
319+
# can get to in OpenTracing land. If we ever migrate to
320+
# OpenTelemetry, we should use a normal `Link` for this.
321+
with start_active_span_follows_from(
322+
f"start_bgproc.{desc}",
323+
child_of=original_active_tracing_span,
324+
ignore_active_span=True,
325+
# Points to the background process span.
326+
contexts=[root_tracing_scope.span.context],
327+
):
328+
pass
329+
330+
# Then start the tracing scope that we're going to use for
331+
# the duration of the background process within the root
332+
# span we just created.
333+
child_tracing_scope = start_active_span_follows_from(
334+
f"bgproc_child.{desc}",
335+
child_of=root_tracing_scope.span,
336+
ignore_active_span=True,
337+
tags={SynapseTags.REQUEST_ID: str(logging_context)},
338+
# Create the `FOLLOWS_FROM` reference to the request's
339+
# span so there is a loose coupling between the two
340+
# traces and it's easy to jump between.
341+
contexts=[original_active_tracing_span.context],
342+
)
343+
344+
# For easy usage down below, we create a context manager that
345+
# combines both scopes.
346+
@contextmanager
347+
def combined_context_manager() -> Generator[None, None, None]:
348+
with root_tracing_scope, child_tracing_scope:
349+
yield
350+
351+
tracing_scope = combined_context_manager()
352+
353+
else:
354+
# Otherwise, when there is no active span, we will be creating
355+
# a disconnected root span already and we don't have to
356+
# worry about cross-linking to anything.
357+
tracing_scope = start_active_span(
358+
f"bgproc.{desc}",
359+
tags={SynapseTags.REQUEST_ID: str(logging_context)},
360+
)
273361
else:
274-
ctx = nullcontext() # type: ignore[assignment]
275-
with ctx:
362+
tracing_scope = nullcontext()
363+
364+
with tracing_scope:
276365
return await func(*args, **kwargs)
277366
except Exception:
278367
logger.exception(

synapse/replication/tcp/redis.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -250,6 +250,13 @@ def send_command(self, cmd: Command) -> None:
250250
self.server_name,
251251
self._async_send_command,
252252
cmd,
253+
# We originally started tracing background processes to avoid `There was no
254+
# active span` errors but this change meant we started generating 15x the
255+
# number of spans than before (this is one of the most heavily called
256+
# instances of `run_as_background_process`).
257+
#
258+
# Since we don't log or tag a tracing span in the downstream
259+
# code, we can safely disable this.
253260
bg_start_span=False,
254261
)
255262

0 commit comments

Comments
 (0)