Skip to content

eBPF event loop blockage finder #569

Closed
@kvakil

Description

@kvakil

Hi -- I'd like to share an eBPF use-case which I found quite useful in
my day job. (Please let me know if there is a better forum to do this.)

We were experiencing long (10s+) event loop blockages which was
affecting our performance. We were alerted to this issue by
node-blocked.

We had two initial ideas:

  1. CPU profiles: but the overhead felt too high especially since the
    blockages only happened rarely.
  2. async_hooks (specifically blocked-at): but the overhead was
    unacceptable.

The solution we landed on used eBPF, particularly bpftrace:

/* Whenever any thread enters uv__run_timers, record the current time
   in nanoseconds in a map. */
u:NODE_PATH:uv__run_timers { @[tid] = nsecs; }

/* Whenever any thread returns from uv__run_check, clear its time from
   the map. */
ur:NODE_PATH:uv__run_check /@[tid]/ { delete(@[tid]); }

/* 99 times a second, check if any running thread has been blocked
   for longer than 10 seconds. If so, take a core dump and stop
   this script. */
p:hz:99 /@[tid]/ {
    if (nsecs - @[tid] > 10000000000) {
        system("gcore %d", pid);
        exit();
    }
}

We ran this script on a bunch of machines, and eventually it spit out a
coredump. We opened the coredump with llnode and found the cause
via v8 backtrace.

Questions for this group

  • I will also create a separate issue about how llnode is no longer
    supported, but I think this is still useful functionality. For
    example, you can use it to get histograms of event loop blockages
    which is independently useful for workload characterization.

  • On Node's side, it would be nicer if event loop stages were exposed as
    stable tracepoints instead of uprobes. This would make it easier for
    people to package similar tools.

  • One could also imagine a weaker version of this functionality being
    built-in to NodeJS: collecting a Javascript backtrace when the event
    loop has currently been blocked for too long. From talking with other
    engineers, I've heard that attributing event loop blockages is a
    common problem when running NodeJS at scale. Is there interest in
    having this in NodeJS core?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions