Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Be able to correlate timeouts in reverse-proxy layer in front of Synapse (pull request ID from header) #13801

1 change: 1 addition & 0 deletions changelog.d/13801.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add `listeners[x].request_id_header` config to specify which request header to extract and use as the request ID in order to correlate timeouts in reverse-proxy layers in front of Synapse when tracing.
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
8 changes: 8 additions & 0 deletions docs/usage/configuration/config_documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -436,6 +436,14 @@ Sub-options for each listener include:
* `x_forwarded`: Only valid for an 'http' listener. Set to true to use the X-Forwarded-For header as the client IP. Useful when Synapse is
behind a reverse-proxy.

* `request_id_header`: The header extracted from each incoming request that is
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
used as the basis for the request ID. The request ID used in logs and tracing to
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
correlate and match up requests. When unset, Synapse will generate automatic
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
sequential request IDs. This option is useful when Synapse is behind a
reverse-proxy. For example, if you use Cloudflare in front of Synapse, you
can set this as `"cf-ray"` to match up requests even when the Cloudflare
layer times out before Synapse is done.
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved

* `resources`: Only valid for an 'http' listener. A list of resources to host
on this port. Sub-options for each resource are:

Expand Down
4 changes: 3 additions & 1 deletion synapse/config/server.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,7 @@ class HttpListenerConfig:
resources: List[HttpResourceConfig] = attr.Factory(list)
additional_resources: Dict[str, dict] = attr.Factory(dict)
tag: Optional[str] = None
request_id_header: Optional[str] = None


@attr.s(slots=True, frozen=True, auto_attribs=True)
Expand Down Expand Up @@ -887,7 +888,7 @@ def read_gc_thresholds(
)


def parse_listener_def(num: int, listener: Any) -> ListenerConfig:
def parse_listener_def(num: int, listener: JsonDict) -> ListenerConfig:
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
"""parse a listener config from the config file"""
listener_type = listener["type"]
# Raise a helpful error if direct TCP replication is still configured.
Expand Down Expand Up @@ -928,6 +929,7 @@ def parse_listener_def(num: int, listener: Any) -> ListenerConfig:
resources=resources,
additional_resources=listener.get("additional_resources", {}),
tag=listener.get("tag"),
request_id_header=listener.get("request_id_header"),
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
)

return ListenerConfig(port, bind_addresses, listener_type, tls, http_config)
Expand Down
14 changes: 13 additions & 1 deletion synapse/http/site.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,10 +72,12 @@ def __init__(
site: "SynapseSite",
*args: Any,
max_request_body_size: int = 1024,
request_id_header: Optional[str] = None,
**kw: Any,
):
super().__init__(channel, *args, **kw)
self._max_request_body_size = max_request_body_size
self.request_id_header = request_id_header
self.synapse_site = site
self.reactor = site.reactor
self._channel = channel # this is used by the tests
Expand Down Expand Up @@ -172,7 +174,14 @@ def set_opentracing_span(self, span: "opentracing.Span") -> None:
self._opentracing_span = span

def get_request_id(self) -> str:
return "%s-%i" % (self.get_method(), self.request_seq)
request_id_value = None
if self.request_id_header:
request_id_value = self.getHeader(self.request_id_header)

if request_id_value is None:
request_id_value = str(self.request_seq)

return "%s-%s" % (self.get_method(), request_id_value)
Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't use the simple sequential ID when this config is defined, it might harder to follow each request in the logs (have to remember a random string like GET-74a2179568cf26ff-MSP vs an incrementing integer, GET-17). But when the sequential number is sufficiently big, it's hard to remember as well.

In tests, we still use the same sequential request ID's since this config won't be defined there. We're not changing any default either so it's basically opt-in but the plan is to define this on matrix.org so we can correlate CloudFlare timeouts with the trace in Synapse.

Is there any strong preference for keeping the GET-17, POST-541 , etc? Any real world usefulness from debugging a big Synapse instance?

Copy link
Member

@turt2live turt2live Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's often easier in my experience to know the ordering of requests from their number rather than timestamp, given punctuation can obscure the problem: 2022-09-13 12:15:08 and 2022-09-13 12:16:09 look very similar in a pile of logs, but they're actually a full minute apart. The request ID would be able to show that there was a thousand requests in the meantime, for example.

Not that it's a super strong argument though - would just be sad to see it go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I usually just copy a one of those and grep the logs to find matching log lines, I'm not sure there's much gained by them being incrementing.

As you mention this is also "opt-in", but I find it much simpler to have a single request ID you can use across all logs related to a request (e.g. this would let you pivot across logs + jaeger easier than #13801).

Copy link
Contributor

@squahtx squahtx Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing the discussion in #synapse-devs:

  • reivilibre points out that the Cloudflare ray IDs are longer and more unwieldy than the current IDs. They'll take a bit more memory, make log lines longer and they may be harder to juggle as a human reading the logs.
  • babolivier notes that Cloudflare ray IDs are more unique than the current incrementing IDs. Right now, if you grep through logs for a low request number, you'll get unrelated requests if the homeserver has been restarting a lot or if you're grepping through logs for multiple workers.
  • There wasn't any objection from the Synapse team to losing the incrementing property of the current IDs (apart from turt2live above).

Overall, no strong objections.


def get_redacted_uri(self) -> str:
"""Gets the redacted URI associated with the request (or placeholder if the URI
Expand Down Expand Up @@ -611,12 +620,15 @@ def __init__(
proxied = config.http_options.x_forwarded
request_class = XForwardedForRequest if proxied else SynapseRequest

request_id_header = config.http_options.request_id_header

def request_factory(channel: HTTPChannel, queued: bool) -> Request:
return request_class(
channel,
self,
max_request_body_size=max_request_body_size,
queued=queued,
request_id_header=request_id_header,
)

self.requestFactory = request_factory # type: ignore
Expand Down