Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a stall detector that logs stacktraces of unyielding tasks, redux #499

Merged
merged 26 commits into from
Jan 26, 2022

Commits on Jan 11, 2022

  1. measure and log unyielding task queues

    Glauber Costa authored and davidblewett committed Jan 11, 2022
    Configuration menu
    Copy the full SHA
    9605100 View commit details
    Browse the repository at this point in the history
  2. use the preempt timer duration as the stall detector threshold

    We have a rudimentary stall detector that triggers if a queue doesn't
    yield within 5ms of being scheduled for execution. Using the preempt
    timer duration instead will limit the false positives we log (if the
    user specify a preempt timer of 100ms then they are fine with a task
    queue running for that much time.)
    HippoBaro authored and davidblewett committed Jan 11, 2022
    Configuration menu
    Copy the full SHA
    f60f577 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    849ab2f View commit details
    Browse the repository at this point in the history
  4. asynchronously record stack traces when a task queue goes over budget

    Knowing if a queue is stalling the reactor is nice, but finding the
    exact code location where the stall occurs remains very hard. Especially
    if a task queue hosts many concurrent fibers.
    
    To help with that, this commit introduces a stall detection mechanism
    that records stack traces of stalling tasks. It works as
    follows:
    * When a task queue is scheduled for execution, we set up a timer that
    triggers some time after the queue is expected to yield (we add a 10%
    error margin to avoid false positives). A background thread collocated
    with the local executor waits on the timer at all times.
    * When the timer fires, the thread sends a signal (SIGUSR1) to the local
    executor thread. Upon receiving the signal, the local executor records a
    complete trace of the local stack. Here we take advantage of the fact
    that by default, the kernel invokes signal handlers on top of the
    existing stack. i.e. the frames we record are those of the problematic
    user code that was meant to yield. The recorded frames are pushed on a
    non-blocking communication channel that links the signal handler and the
    local executor.
    * When a task queue yields, the local executor disarm the timer and
    checks the communication channel for potential recorded frames, if there
    are any then we can conclude that the queue stalled, so we log them.
    
    This code works in practice but has two major drawbacks:
    * The timer dance is expensive; expect a high number of syscall. Because
    of this runtime overhead, the stall detector is disabled by default. To
    opt-in, the feature `stall-detection` must be enabled at compile-time.
    * We log stalls only after the queue yield. Therefore, if there is a bug
    in your code and your queue never yields, the stall detector will never
    log the code location that's at fault (even though we probably have
    recorded the trace by then). The reason for this is that logging from a
    signal handler is illegal.
    HippoBaro authored and davidblewett committed Jan 11, 2022
    Configuration menu
    Copy the full SHA
    75681e7 View commit details
    Browse the repository at this point in the history

Commits on Jan 13, 2022

  1. Configuration menu
    Copy the full SHA
    77662c4 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    9ae41f7 View commit details
    Browse the repository at this point in the history
  3. Add trait-based handler, to allow for customizing signal

    and trace collection eventually.
    davidblewett committed Jan 13, 2022
    Configuration menu
    Copy the full SHA
    77e005c View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    186a4d1 View commit details
    Browse the repository at this point in the history

Commits on Jan 14, 2022

  1. Update docs.

    davidblewett committed Jan 14, 2022
    Configuration menu
    Copy the full SHA
    647ce55 View commit details
    Browse the repository at this point in the history
  2. PR feedback

    davidblewett committed Jan 14, 2022
    Configuration menu
    Copy the full SHA
    aee9682 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    c24ad19 View commit details
    Browse the repository at this point in the history

Commits on Jan 20, 2022

  1. PR feedback

    davidblewett committed Jan 20, 2022
    Configuration menu
    Copy the full SHA
    99237c1 View commit details
    Browse the repository at this point in the history
  2. Refactor stall detector tests and add coverage

    for checking that incoming signals match expected executor.
    davidblewett committed Jan 20, 2022
    Configuration menu
    Copy the full SHA
    af66c8d View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    8a1ca54 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    5c7a31b View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    b479ab5 View commit details
    Browse the repository at this point in the history
  6. PR feedback fixes.

    davidblewett committed Jan 20, 2022
    Configuration menu
    Copy the full SHA
    0721f20 View commit details
    Browse the repository at this point in the history
  7. Make StallDetector completely self-contained

    by moving signal_id from `LocalExecutor`.
    davidblewett committed Jan 20, 2022
    Configuration menu
    Copy the full SHA
    0477a0b View commit details
    Browse the repository at this point in the history
  8. Doctest fixes.

    davidblewett committed Jan 20, 2022
    Configuration menu
    Copy the full SHA
    bdaf8ee View commit details
    Browse the repository at this point in the history
  9. Rename DefaultStallDetectionHandler -> LoggingStallDetectionHandler

    and make all knobs configurable.
    davidblewett committed Jan 20, 2022
    Configuration menu
    Copy the full SHA
    04fddd6 View commit details
    Browse the repository at this point in the history

Commits on Jan 21, 2022

  1. PR feedback

    davidblewett committed Jan 21, 2022
    Configuration menu
    Copy the full SHA
    b06a369 View commit details
    Browse the repository at this point in the history
  2. Simplify Debug output.

    davidblewett committed Jan 21, 2022
    Configuration menu
    Copy the full SHA
    79bfca1 View commit details
    Browse the repository at this point in the history

Commits on Jan 24, 2022

  1. Configuration menu
    Copy the full SHA
    dec6adb View commit details
    Browse the repository at this point in the history

Commits on Jan 25, 2022

  1. Configuration menu
    Copy the full SHA
    515781b View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    65f305d View commit details
    Browse the repository at this point in the history

Commits on Jan 26, 2022

  1. Don't export LocalExecutor::detect_stalls for now;

    need to write tests to validate enabling/disabling at runtime.
    davidblewett committed Jan 26, 2022
    Configuration menu
    Copy the full SHA
    9be026d View commit details
    Browse the repository at this point in the history