Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a way to measure and report IO latency #476

Merged
merged 5 commits into from
Dec 27, 2021

Conversation

HippoBaro
Copy link
Member

Add the plumbings necessary to measure various kinds of IO latency,

We now keep track of two kinds of IO latency:

  • The time sources spend in the ring, including the submission queue;
  • The post-reactor scheduler latency, or the time the scheduler takes to
    come back at the tasks that consume the IO;

We now export them as distribution in the IoStats struct. And this PR
adds distribution support to Glommio. We use "sketches" that give a
statistical approximation of quantiles because we want the overhead
(space and time) to be low when recording latencies.

A side effect of this is that pulling the stats from the reactor now
clears them (because a distribution needs to be cleared regularly to be
helpful). I don't expect this to be problematic because, in practice,
users want rates from the stats and keep the previous values in memory
to compute deltas. This should, therefore, greatly simplify this logic.

@glommer
Copy link
Collaborator

glommer commented Dec 24, 2021

Is there any way we can make this something the user enable with a flag?
When I was playing with the stall detector, I made it so that it would be enabled/disabled dynamically.

For this, we should at least do it at executor creation time, but not unconditionally. Measuring time can be incredibly expensive.

The rest of the code looks ok and ready to go.

@HippoBaro
Copy link
Member Author

Is there any way we can make this something the user enables with a flag? When I was playing with the stall detector, I made it so that it would be enabled/disabled dynamically.

For this, we should at least do it at executor creation time, but not unconditionally. Measuring time can be incredibly expensive.

The rest of the code looks ok and ready to go.

Fair point, I'll add an executor config to disable all that stuff!

Add the plumbings necessary to measure the post-reactor scheduler delay,
i.e., the time it takes for the scheduler to invoke the task that
consumes the result of a fulfilled source. I suspect a bug in Glommio
creates very high latency spikes; this commit is part of my effort to
find it.

There are three kinds of latency measurements I would like Glommio to
have:
* Pre-reactor delay: the delay between the moment an IO is scheduled and
  the moment it enters the ring;
* IO latency: the time a source spends in the kernel;
* Post-reactor delay: the delay this commit measures, as explained
  above.
Measure the time a source spends in the ring, including the submission
queue.
We now export IO and scheduler latencies as distribution in the io
stats. This commit adds distribution support to Glommio. We use
"sketches" that give a statistical approximation of quantiles because we
want the overhead (space and time) to be low when recording latencies.

A side effect of this is that pulling the stats from the reactor now
clears them (because a distribution needs to be cleared regularly to be
useful). I don't expect this to be problematic because, in practice,
users want rates from the stats and keep the previous values in memory
to compute deltas. This should, therefore, greatly simplify this logic.
Not strictly necessary right now but consistency with the IO stats is
desirable
Recording latency can be expensive so gate this feature behind a config
knob that's disabled by default.
@HippoBaro HippoBaro merged commit 5c483ee into DataDog:master Dec 27, 2021
@HippoBaro HippoBaro deleted the latencies branch December 27, 2021 03:22
@github-actions
Copy link

Greetings @HippoBaro!

It looks like your PR added a new or changed an existing dependency, and CI has failed to validate your changes.
This is likely an indication that one of the dependencies you added uses a restricted license. See deny.toml for a list of licenses we allow.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants