add a stall detector that logs stacktraces of unyielding tasks, redux #499

davidblewett · 2022-01-13T23:08:13Z

What does this PR do?

This PR supercedes #417 . I believe all the concerns raised there are addressed in this implementation.

This PR adds a rudimentary stall detection mechanism to Glommio.

It measures the runtime of task queues and logs if they don't yield when
they are due.

Knowing if a queue is stalling the reactor is nice, but finding the
exact code location where the stall occurs remains very hard. Especially
if a task queue hosts many concurrent fibers.

To help with that, this PR introduces a stall detection mechanism that
records stack traces of stalling tasks. It works as follows:

When a task queue is scheduled for execution, we set up a timer that
triggers some time after the queue is expected to yield (we add a 10%
error margin to avoid false positives). A background thread collocated
with the local executor waits on the timer at all times.
When the timer fires, the thread sends a signal (SIGUSR1) to the
local executor thread. Upon receiving the signal, the local executor
records a complete trace of the local stack. Here we take advantage of
the fact that by default, the kernel invokes signal handlers on top of
the existing stack. i.e. the frames we record are those of the
problematic user code that was meant to yield. The recorded frames are
pushed on a non-blocking communication channel that links the signal
handler and the local executor.
When a task queue yields, the local executor disarm the timer and
checks the communication channel for potential recorded frames, if there
are any then we can conclude that the queue stalled, so we log them.

This code works in practice but has two major drawbacks:

The timer dance is expensive; expect a high number of syscalls.
Because of this runtime overhead, the stall detector is disabled by
default. To opt-in, you must pass in something that implements
StallDetectionHandler when constructing the LocalExecutor. There
is a concrete DefaultStallDetectionHandler that uses SIGUSR1 and
logs a warning log message.
We log stalls only after the queue yield. Therefore, if there is a bug
in your code and your queue never yields, the stall detector will never
log the code location that's at fault (even though we probably have
recorded the trace by then). The reason for this is that logging from a
signal handler is illegal.

We have a rudimentary stall detector that triggers if a queue doesn't yield within 5ms of being scheduled for execution. Using the preempt timer duration instead will limit the false positives we log (if the user specify a preempt timer of 100ms then they are fine with a task queue running for that much time.)

Knowing if a queue is stalling the reactor is nice, but finding the exact code location where the stall occurs remains very hard. Especially if a task queue hosts many concurrent fibers. To help with that, this commit introduces a stall detection mechanism that records stack traces of stalling tasks. It works as follows: * When a task queue is scheduled for execution, we set up a timer that triggers some time after the queue is expected to yield (we add a 10% error margin to avoid false positives). A background thread collocated with the local executor waits on the timer at all times. * When the timer fires, the thread sends a signal (SIGUSR1) to the local executor thread. Upon receiving the signal, the local executor records a complete trace of the local stack. Here we take advantage of the fact that by default, the kernel invokes signal handlers on top of the existing stack. i.e. the frames we record are those of the problematic user code that was meant to yield. The recorded frames are pushed on a non-blocking communication channel that links the signal handler and the local executor. * When a task queue yields, the local executor disarm the timer and checks the communication channel for potential recorded frames, if there are any then we can conclude that the queue stalled, so we log them. This code works in practice but has two major drawbacks: * The timer dance is expensive; expect a high number of syscall. Because of this runtime overhead, the stall detector is disabled by default. To opt-in, the feature `stall-detection` must be enabled at compile-time. * We log stalls only after the queue yield. Therefore, if there is a bug in your code and your queue never yields, the stall detector will never log the code location that's at fault (even though we probably have recorded the trace by then). The reason for this is that logging from a signal handler is illegal.

and trace collection eventually.

glommio/src/executor/stall.rs

glommer · 2022-01-14T01:04:52Z

Generally good, minor comments only. The most important one is the heuristics about when to trigger. I think it is impossible that we'll get this right at the framework level, and the user has to tell us when to fire as they turn this on.

(also CI is not passing)

glommio/src/executor/stall.rs

HippoBaro

Some nits.

My one concern is that now that I am reading the code, I realize why I used TLS for the channels in my original PR. The reason is that signal handlers are global entities. Therefore I see two issues with this version of the code:

First, you install a signal handler every time an executor is created. So either (1) it is replaced each time or (2) the library does something clever underneath. Let's make sure it's the former.
Second, only a single executor will ever be signaled (the last one), and it will forward the traces to itself (because the handler captures the channels to send to). The only way for this to work is for the signal handler to fetch the channel to send to using TLS. Fortunately, the executor is hosted using TLS, so this should be trivial (Look at LOCAL_EX).

I think the best way to install the signal handler is to use some global entirety with a lazy initialization (you need it to be lazy since we can configure what signal to use at run time.)

One last thing... What happens if two executors specify different signals? That's a problem. I say panic in that case. If they are doing that they were looking for trouble.

clippy.toml

glommio/Cargo.toml

glommio/src/executor/mod.rs

glommio/src/executor/stall.rs

for checking that incoming signals match expected executor.

by moving signal_id from `LocalExecutor`.

and make all knobs configurable.

HippoBaro

Some very final comments but overall this is looking really good

glommio/src/executor/stall.rs

glommer

Very minor comments at this point.
In the trait vs tunables, I prefer the traits and you seem too as well, so let's do that.

Let me know what you think about my argument against the percentage-based threshold.
I think we should provide a concrete class with, say, 10ms flat default and then allow overriding through traits.

davidblewett · 2022-01-24T16:49:33Z

Let me know what you think about my argument against the percentage-based threshold. I think we should provide a concrete class with, say, 10ms flat default and then allow overriding through traits.

This seems reasonable to me. It's very easy to implement the exact behavior you want if this doesn't fit your use case.

davidblewett · 2022-01-24T17:15:50Z

The test failures are due to newer lints in rust 1.58.0, that seem unrelated to this PR. FWIW, the tests pass on 1.58.1.

github-actions · 2022-01-24T22:51:04Z

Greetings @davidblewett!

It looks like your PR added a new or changed an existing dependency, and CI has failed to validate your changes.
This is likely an indication that one of the dependencies you added uses a restricted license. See deny.toml for a list of licenses we allow.

Thank you!

HippoBaro

Almost there!!

glommio/src/executor/mod.rs

glommio/src/sys/uring.rs

need to write tests to validate enabling/disabling at runtime.

Glauber Costa and others added 8 commits January 11, 2022 11:48

measure and log unyielding task queues

9605100

gate the stall detector behind a feature flag

849ab2f

Remove ordering requirement in drop, per PR feedback.

77662c4

Make stall detection configurable per executor.

9ae41f7

Add trait-based handler, to allow for customizing signal

77e005c

and trace collection eventually.

Move logging to default handler.

186a4d1

HippoBaro mentioned this pull request Jan 13, 2022

add a stall detector that logs stacktraces of unyielding tasks #417

Closed

glommer reviewed Jan 14, 2022

View reviewed changes

glommio/src/executor/stall.rs Outdated Show resolved Hide resolved

glommer reviewed Jan 14, 2022

View reviewed changes

glommio/src/executor/stall.rs Outdated Show resolved Hide resolved

glommer reviewed Jan 14, 2022

View reviewed changes

glommio/src/executor/stall.rs Outdated Show resolved Hide resolved

glommer reviewed Jan 14, 2022

View reviewed changes

glommio/src/executor/stall.rs Outdated Show resolved Hide resolved

HippoBaro reviewed Jan 14, 2022

View reviewed changes

glommio/src/executor/stall.rs Outdated Show resolved Hide resolved

HippoBaro reviewed Jan 14, 2022

View reviewed changes

glommio/src/executor/stall.rs Outdated Show resolved Hide resolved

HippoBaro reviewed Jan 14, 2022

View reviewed changes

glommio/src/executor/stall.rs Outdated Show resolved Hide resolved

davidblewett added 2 commits January 14, 2022 10:44

Update docs.

647ce55

PR feedback

aee9682

davidblewett force-pushed the stall_detector branch from 2e2f12d to aee9682 Compare January 14, 2022 15:49

Merge branch 'master' into stall_detector

c24ad19

davidblewett force-pushed the stall_detector branch from e748920 to 57cc289 Compare January 14, 2022 20:57

HippoBaro requested changes Jan 15, 2022

View reviewed changes

davidblewett force-pushed the stall_detector branch from 57cc289 to c165876 Compare January 19, 2022 22:36

davidblewett added 5 commits January 20, 2022 10:15

PR feedback

99237c1

Refactor stall detector tests and add coverage

af66c8d

for checking that incoming signals match expected executor.

Explicitly check that we're on the same thread we started on.

8a1ca54

Add test for multiple signal numbers on different executors.

5c7a31b

Merge branch 'master' into stall_detector

b479ab5

davidblewett force-pushed the stall_detector branch from c165876 to b479ab5 Compare January 20, 2022 18:56

davidblewett force-pushed the stall_detector branch from 8be1ae0 to f968740 Compare January 20, 2022 20:40

davidblewett added 4 commits January 20, 2022 17:48

PR feedback fixes.

0721f20

Make StallDetector completely self-contained

0477a0b

by moving signal_id from `LocalExecutor`.

Doctest fixes.

bdaf8ee

Rename DefaultStallDetectionHandler -> LoggingStallDetectionHandler

04fddd6

and make all knobs configurable.

davidblewett force-pushed the stall_detector branch from 9e54ccb to 04fddd6 Compare January 20, 2022 22:48

davidblewett requested review from HippoBaro and glommer January 20, 2022 22:57

HippoBaro requested changes Jan 21, 2022

View reviewed changes

PR feedback

b06a369

davidblewett requested a review from HippoBaro January 21, 2022 22:46

Simplify Debug output.

79bfca1

davidblewett force-pushed the stall_detector branch from 6eab19c to 79bfca1 Compare January 21, 2022 23:16

glommer reviewed Jan 22, 2022

View reviewed changes

glommio/src/executor/stall.rs Outdated Show resolved Hide resolved

glommer reviewed Jan 22, 2022

View reviewed changes

glommio/src/executor/stall.rs Outdated Show resolved Hide resolved

glommer reviewed Jan 22, 2022

View reviewed changes

glommio/src/executor/stall.rs Outdated Show resolved Hide resolved

glommer reviewed Jan 22, 2022

View reviewed changes

glommio/src/executor/stall.rs Outdated Show resolved Hide resolved

glommer reviewed Jan 22, 2022

View reviewed changes

Switch back to DefaultStallDetectionHandler, with no tunables.

dec6adb

davidblewett requested a review from glommer January 24, 2022 16:52

HippoBaro requested changes Jan 25, 2022

View reviewed changes

r3v2d0g and others added 2 commits January 25, 2022 15:04

Update OpenOptions::append to return &mut Self

515781b

Merge remote-tracking branch 'upstream' into stall_detector

65f305d

davidblewett force-pushed the stall_detector branch from 1289fa3 to 65f305d Compare January 25, 2022 20:42

Don't export LocalExecutor::detect_stalls for now;

9be026d

need to write tests to validate enabling/disabling at runtime.

HippoBaro approved these changes Jan 26, 2022

View reviewed changes

HippoBaro merged commit fe33e30 into DataDog:master Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a stall detector that logs stacktraces of unyielding tasks, redux #499

add a stall detector that logs stacktraces of unyielding tasks, redux #499

davidblewett commented Jan 13, 2022

glommer commented Jan 14, 2022

HippoBaro left a comment •

edited

Loading

HippoBaro left a comment

glommer left a comment

davidblewett commented Jan 24, 2022

davidblewett commented Jan 24, 2022 •

edited

Loading

github-actions bot commented Jan 24, 2022

HippoBaro left a comment

add a stall detector that logs stacktraces of unyielding tasks, redux #499

add a stall detector that logs stacktraces of unyielding tasks, redux #499

Conversation

davidblewett commented Jan 13, 2022

What does this PR do?

glommer commented Jan 14, 2022

HippoBaro left a comment • edited Loading

Choose a reason for hiding this comment

HippoBaro left a comment

Choose a reason for hiding this comment

glommer left a comment

Choose a reason for hiding this comment

davidblewett commented Jan 24, 2022

davidblewett commented Jan 24, 2022 • edited Loading

github-actions bot commented Jan 24, 2022

HippoBaro left a comment

Choose a reason for hiding this comment

HippoBaro left a comment •

edited

Loading

davidblewett commented Jan 24, 2022 •

edited

Loading