Skip to content

Add a Histogram of Transport Worker Time that is Spent per-Message #80428

Closed
@original-brownbear

Description

@original-brownbear

We'd like to add information to TransportStats (and thus the node stats) that gives us insight into the performance of transport threads and whether or not they might get blocked by a heavy task (like deserializing a large message) for too long.
This should be easy to build by tracking the existing timings recorded by slow-logging in InboundHandler and OutboundHandler. We do not need a very sophisticated histogram here. We effectively only care about recording the number of problematic messages and a rough idea of long they take so we can make due with a couple of fixed buckets to count timings into.
I would suggest we record the following separately for requests and responses (just powers of 2):
<2ms, <4ms, <8ms, <16ms and so on up to <65536ms (and one more for everything longer than that). This give us a good measure of how much time we spent on the transport thread for the trivial cost of 17 counters times two.
We can then add those numbers to the TransportStats message and it's serialization into node stats.

We mainly need this for benchmarking in #77466 but this should be quite useful in debugging as well.

relates and asks for part of #36127
relates #77466

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions