add a way to measure and report IO latency #476

HippoBaro · 2021-11-30T23:45:10Z

Add the plumbings necessary to measure various kinds of IO latency,

We now keep track of two kinds of IO latency:

The time sources spend in the ring, including the submission queue;
The post-reactor scheduler latency, or the time the scheduler takes to
come back at the tasks that consume the IO;

We now export them as distribution in the IoStats struct. And this PR
adds distribution support to Glommio. We use "sketches" that give a
statistical approximation of quantiles because we want the overhead
(space and time) to be low when recording latencies.

A side effect of this is that pulling the stats from the reactor now
clears them (because a distribution needs to be cleared regularly to be
helpful). I don't expect this to be problematic because, in practice,
users want rates from the stats and keep the previous values in memory
to compute deltas. This should, therefore, greatly simplify this logic.

glommer · 2021-12-24T19:33:33Z

Is there any way we can make this something the user enable with a flag?
When I was playing with the stall detector, I made it so that it would be enabled/disabled dynamically.

For this, we should at least do it at executor creation time, but not unconditionally. Measuring time can be incredibly expensive.

The rest of the code looks ok and ready to go.

HippoBaro · 2021-12-27T01:53:47Z

Is there any way we can make this something the user enables with a flag? When I was playing with the stall detector, I made it so that it would be enabled/disabled dynamically.

For this, we should at least do it at executor creation time, but not unconditionally. Measuring time can be incredibly expensive.

The rest of the code looks ok and ready to go.

Fair point, I'll add an executor config to disable all that stuff!

Add the plumbings necessary to measure the post-reactor scheduler delay, i.e., the time it takes for the scheduler to invoke the task that consumes the result of a fulfilled source. I suspect a bug in Glommio creates very high latency spikes; this commit is part of my effort to find it. There are three kinds of latency measurements I would like Glommio to have: * Pre-reactor delay: the delay between the moment an IO is scheduled and the moment it enters the ring; * IO latency: the time a source spends in the kernel; * Post-reactor delay: the delay this commit measures, as explained above.

Measure the time a source spends in the ring, including the submission queue.

We now export IO and scheduler latencies as distribution in the io stats. This commit adds distribution support to Glommio. We use "sketches" that give a statistical approximation of quantiles because we want the overhead (space and time) to be low when recording latencies. A side effect of this is that pulling the stats from the reactor now clears them (because a distribution needs to be cleared regularly to be useful). I don't expect this to be problematic because, in practice, users want rates from the stats and keep the previous values in memory to compute deltas. This should, therefore, greatly simplify this logic.

Not strictly necessary right now but consistency with the IO stats is desirable

Recording latency can be expensive so gate this feature behind a config knob that's disabled by default.

github-actions · 2021-12-27T09:08:21Z

Greetings @HippoBaro!

It looks like your PR added a new or changed an existing dependency, and CI has failed to validate your changes.
This is likely an indication that one of the dependencies you added uses a restricted license. See deny.toml for a list of licenses we allow.

Thank you!

HippoBaro force-pushed the latencies branch from 909059c to 65aaeb4 Compare December 1, 2021 02:10

HippoBaro force-pushed the latencies branch 3 times, most recently from c7c8e19 to 37e4f68 Compare December 23, 2021 20:06

HippoBaro added 3 commits December 27, 2021 03:43

add a way to measure the IO ring latency

c6fb9c9

Measure the time a source spends in the ring, including the submission queue.

HippoBaro force-pushed the latencies branch from 37e4f68 to b09c507 Compare December 27, 2021 03:07

HippoBaro added 2 commits December 27, 2021 04:10

reset executor and task queue stats when retrieved like IO stats

f3c6954

Not strictly necessary right now but consistency with the IO stats is desirable

make latency recording a per-executor opt-in config knob

e50f321

Recording latency can be expensive so gate this feature behind a config knob that's disabled by default.

HippoBaro force-pushed the latencies branch from b09c507 to e50f321 Compare December 27, 2021 03:11

HippoBaro merged commit 5c483ee into DataDog:master Dec 27, 2021

HippoBaro deleted the latencies branch December 27, 2021 03:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a way to measure and report IO latency #476

add a way to measure and report IO latency #476

HippoBaro commented Nov 30, 2021

glommer commented Dec 24, 2021

HippoBaro commented Dec 27, 2021

github-actions bot commented Dec 27, 2021

add a way to measure and report IO latency #476

add a way to measure and report IO latency #476

Conversation

HippoBaro commented Nov 30, 2021

glommer commented Dec 24, 2021

HippoBaro commented Dec 27, 2021

github-actions bot commented Dec 27, 2021