Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Be able to correlate timeouts in reverse-proxy layer in front of Synapse (pull request ID from header) #13801

Prev Previous commit
Next Next commit
Fix variable reference
  • Loading branch information
MadLittleMods committed Sep 13, 2022
commit f6fb0c8e4f209eec02f4855548c0592a47f8b011
3 changes: 2 additions & 1 deletion synapse/http/site.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,10 +174,11 @@ def set_opentracing_span(self, span: "opentracing.Span") -> None:
self._opentracing_span = span

def get_request_id(self) -> str:
request_id_value = None
if self.request_id_header:
request_id_value = self.getHeader(self.request_id_header)

if not request_id_value:
if request_id_value is None:
request_id_value = str(self.request_seq)

return "%s-%s" % (self.get_method(), request_id_value)
Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't use the simple sequential ID when this config is defined, it might harder to follow each request in the logs (have to remember a random string like GET-74a2179568cf26ff-MSP vs an incrementing integer, GET-17). But when the sequential number is sufficiently big, it's hard to remember as well.

In tests, we still use the same sequential request ID's since this config won't be defined there. We're not changing any default either so it's basically opt-in but the plan is to define this on matrix.org so we can correlate CloudFlare timeouts with the trace in Synapse.

Is there any strong preference for keeping the GET-17, POST-541 , etc? Any real world usefulness from debugging a big Synapse instance?

Copy link
Member

@turt2live turt2live Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's often easier in my experience to know the ordering of requests from their number rather than timestamp, given punctuation can obscure the problem: 2022-09-13 12:15:08 and 2022-09-13 12:16:09 look very similar in a pile of logs, but they're actually a full minute apart. The request ID would be able to show that there was a thousand requests in the meantime, for example.

Not that it's a super strong argument though - would just be sad to see it go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I usually just copy a one of those and grep the logs to find matching log lines, I'm not sure there's much gained by them being incrementing.

As you mention this is also "opt-in", but I find it much simpler to have a single request ID you can use across all logs related to a request (e.g. this would let you pivot across logs + jaeger easier than #13801).

Copy link
Contributor

@squahtx squahtx Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing the discussion in #synapse-devs:

  • reivilibre points out that the Cloudflare ray IDs are longer and more unwieldy than the current IDs. They'll take a bit more memory, make log lines longer and they may be harder to juggle as a human reading the logs.
  • babolivier notes that Cloudflare ray IDs are more unique than the current incrementing IDs. Right now, if you grep through logs for a low request number, you'll get unrelated requests if the homeserver has been restarting a lot or if you're grepping through logs for multiple workers.
  • There wasn't any objection from the Synapse team to losing the incrementing property of the current IDs (apart from turt2live above).

Overall, no strong objections.

Expand Down