Description
We'd like to add information to TransportStats
(and thus the node stats) that gives us insight into the performance of transport threads and whether or not they might get blocked by a heavy task (like deserializing a large message) for too long.
This should be easy to build by tracking the existing timings recorded by slow-logging in InboundHandler
and OutboundHandler
. We do not need a very sophisticated histogram here. We effectively only care about recording the number of problematic messages and a rough idea of long they take so we can make due with a couple of fixed buckets to count timings into.
I would suggest we record the following separately for requests and responses (just powers of 2):
<2ms
, <4ms
, <8ms
, <16ms
and so on up to <65536ms
(and one more for everything longer than that). This give us a good measure of how much time we spent on the transport thread for the trivial cost of 17 counters times two.
We can then add those numbers to the TransportStats
message and it's serialization into node stats.
We mainly need this for benchmarking in #77466 but this should be quite useful in debugging as well.