Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a stall detector that logs stacktraces of unyielding tasks, redux #499

Merged
merged 26 commits into from
Jan 26, 2022

Conversation

davidblewett
Copy link
Member

What does this PR do?

This PR supercedes #417 . I believe all the concerns raised there are addressed in this implementation.

This PR adds a rudimentary stall detection mechanism to Glommio.

It measures the runtime of task queues and logs if they don't yield when
they are due.

Knowing if a queue is stalling the reactor is nice, but finding the
exact code location where the stall occurs remains very hard. Especially
if a task queue hosts many concurrent fibers.

To help with that, this PR introduces a stall detection mechanism that
records stack traces of stalling tasks. It works as follows:

  • When a task queue is scheduled for execution, we set up a timer that
    triggers some time after the queue is expected to yield (we add a 10%
    error margin to avoid false positives). A background thread collocated
    with the local executor waits on the timer at all times.
  • When the timer fires, the thread sends a signal (SIGUSR1) to the
    local executor thread. Upon receiving the signal, the local executor
    records a complete trace of the local stack. Here we take advantage of
    the fact that by default, the kernel invokes signal handlers on top of
    the existing stack. i.e. the frames we record are those of the
    problematic user code that was meant to yield. The recorded frames are
    pushed on a non-blocking communication channel that links the signal
    handler and the local executor.
  • When a task queue yields, the local executor disarm the timer and
    checks the communication channel for potential recorded frames, if there
    are any then we can conclude that the queue stalled, so we log them.

This code works in practice but has two major drawbacks:

  • The timer dance is expensive; expect a high number of syscalls.
    Because of this runtime overhead, the stall detector is disabled by
    default. To opt-in, you must pass in something that implements
    StallDetectionHandler when constructing the LocalExecutor. There
    is a concrete DefaultStallDetectionHandler that uses SIGUSR1 and
    logs a warning log message.
  • We log stalls only after the queue yield. Therefore, if there is a bug
    in your code and your queue never yields, the stall detector will never
    log the code location that's at fault (even though we probably have
    recorded the trace by then). The reason for this is that logging from a
    signal handler is illegal.

Glauber Costa and others added 8 commits January 11, 2022 11:48
We have a rudimentary stall detector that triggers if a queue doesn't
yield within 5ms of being scheduled for execution. Using the preempt
timer duration instead will limit the false positives we log (if the
user specify a preempt timer of 100ms then they are fine with a task
queue running for that much time.)
Knowing if a queue is stalling the reactor is nice, but finding the
exact code location where the stall occurs remains very hard. Especially
if a task queue hosts many concurrent fibers.

To help with that, this commit introduces a stall detection mechanism
that records stack traces of stalling tasks. It works as
follows:
* When a task queue is scheduled for execution, we set up a timer that
triggers some time after the queue is expected to yield (we add a 10%
error margin to avoid false positives). A background thread collocated
with the local executor waits on the timer at all times.
* When the timer fires, the thread sends a signal (SIGUSR1) to the local
executor thread. Upon receiving the signal, the local executor records a
complete trace of the local stack. Here we take advantage of the fact
that by default, the kernel invokes signal handlers on top of the
existing stack. i.e. the frames we record are those of the problematic
user code that was meant to yield. The recorded frames are pushed on a
non-blocking communication channel that links the signal handler and the
local executor.
* When a task queue yields, the local executor disarm the timer and
checks the communication channel for potential recorded frames, if there
are any then we can conclude that the queue stalled, so we log them.

This code works in practice but has two major drawbacks:
* The timer dance is expensive; expect a high number of syscall. Because
of this runtime overhead, the stall detector is disabled by default. To
opt-in, the feature `stall-detection` must be enabled at compile-time.
* We log stalls only after the queue yield. Therefore, if there is a bug
in your code and your queue never yields, the stall detector will never
log the code location that's at fault (even though we probably have
recorded the trace by then). The reason for this is that logging from a
signal handler is illegal.
@glommer
Copy link
Collaborator

glommer commented Jan 14, 2022

Generally good, minor comments only. The most important one is the heuristics about when to trigger. I think it is impossible that we'll get this right at the framework level, and the user has to tell us when to fire as they turn this on.

(also CI is not passing)

Copy link
Member

@HippoBaro HippoBaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits.

My one concern is that now that I am reading the code, I realize why I used TLS for the channels in my original PR. The reason is that signal handlers are global entities. Therefore I see two issues with this version of the code:

  • First, you install a signal handler every time an executor is created. So either (1) it is replaced each time or (2) the library does something clever underneath. Let's make sure it's the former.
  • Second, only a single executor will ever be signaled (the last one), and it will forward the traces to itself (because the handler captures the channels to send to). The only way for this to work is for the signal handler to fetch the channel to send to using TLS. Fortunately, the executor is hosted using TLS, so this should be trivial (Look at LOCAL_EX).

I think the best way to install the signal handler is to use some global entirety with a lazy initialization (you need it to be lazy since we can configure what signal to use at run time.)

One last thing... What happens if two executors specify different signals? That's a problem. I say panic in that case. If they are doing that they were looking for trouble.

clippy.toml Outdated Show resolved Hide resolved
glommio/Cargo.toml Outdated Show resolved Hide resolved
glommio/Cargo.toml Outdated Show resolved Hide resolved
glommio/Cargo.toml Outdated Show resolved Hide resolved
glommio/src/executor/mod.rs Outdated Show resolved Hide resolved
glommio/src/executor/mod.rs Outdated Show resolved Hide resolved
glommio/src/executor/mod.rs Outdated Show resolved Hide resolved
glommio/src/executor/stall.rs Outdated Show resolved Hide resolved
glommio/src/executor/stall.rs Outdated Show resolved Hide resolved
glommio/src/executor/stall.rs Show resolved Hide resolved
Copy link
Member

@HippoBaro HippoBaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very final comments but overall this is looking really good

glommio/src/executor/stall.rs Outdated Show resolved Hide resolved
glommio/src/executor/stall.rs Outdated Show resolved Hide resolved
glommio/src/executor/stall.rs Outdated Show resolved Hide resolved
glommio/src/executor/stall.rs Outdated Show resolved Hide resolved
glommio/src/executor/stall.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@glommer glommer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor comments at this point.
In the trait vs tunables, I prefer the traits and you seem too as well, so let's do that.

Let me know what you think about my argument against the percentage-based threshold.
I think we should provide a concrete class with, say, 10ms flat default and then allow overriding through traits.

@davidblewett
Copy link
Member Author

Let me know what you think about my argument against the percentage-based threshold. I think we should provide a concrete class with, say, 10ms flat default and then allow overriding through traits.

This seems reasonable to me. It's very easy to implement the exact behavior you want if this doesn't fit your use case.

@davidblewett
Copy link
Member Author

davidblewett commented Jan 24, 2022

The test failures are due to newer lints in rust 1.58.0, that seem unrelated to this PR. FWIW, the tests pass on 1.58.1.

@github-actions
Copy link

Greetings @davidblewett!

It looks like your PR added a new or changed an existing dependency, and CI has failed to validate your changes.
This is likely an indication that one of the dependencies you added uses a restricted license. See deny.toml for a list of licenses we allow.

Thank you!

Copy link
Member

@HippoBaro HippoBaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there!!

glommio/src/executor/mod.rs Show resolved Hide resolved
glommio/src/executor/mod.rs Outdated Show resolved Hide resolved
glommio/src/sys/uring.rs Outdated Show resolved Hide resolved
glommio/src/sys/uring.rs Outdated Show resolved Hide resolved
glommio/src/sys/uring.rs Outdated Show resolved Hide resolved
glommio/src/sys/uring.rs Outdated Show resolved Hide resolved
need to write tests to validate enabling/disabling at runtime.
@HippoBaro HippoBaro merged commit fe33e30 into DataDog:master Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants