eBPF event loop blockage finder

Hi -- I'd like to share an eBPF use-case which I found quite useful in
my day job. (Please let me know if there is a better forum to do this.)

We were experiencing long (10s+) event loop blockages which was
affecting our performance. We were alerted to this issue by
[node-blocked][nb].

[nb]: https://github.com/tj/node-blocked

We had two initial ideas:

1. CPU profiles: but the overhead felt too high especially since the
   blockages only happened rarely.
1. `async_hooks` (specifically [blocked-at][ba]): but the overhead was
   unacceptable.

[ba]: https://github.com/naugtur/blocked-at

The solution we landed on used eBPF, particularly bpftrace:

```c
/* Whenever any thread enters uv__run_timers, record the current time
   in nanoseconds in a map. */
u:NODE_PATH:uv__run_timers { @[tid] = nsecs; }

/* Whenever any thread returns from uv__run_check, clear its time from
   the map. */
ur:NODE_PATH:uv__run_check /@[tid]/ { delete(@[tid]); }

/* 99 times a second, check if any running thread has been blocked
   for longer than 10 seconds. If so, take a core dump and stop
   this script. */
p:hz:99 /@[tid]/ {
    if (nsecs - @[tid] > 10000000000) {
        system("gcore %d", pid);
        exit();
    }
}
```

We ran this script on a bunch of machines, and eventually it spit out a
coredump. We opened the coredump with [llnode][lln] and found the cause
via `v8 backtrace`.

[lln]: https://github.com/nodejs/llnode

## Questions for this group

* I will also create a separate issue about how llnode is no longer
  supported, but I think this is still useful functionality. For
  example, you can use it to get histograms of event loop blockages
  which is independently useful for workload characterization.

* On Node's side, it would be nicer if event loop stages were exposed as
  stable tracepoints instead of uprobes. This would make it easier for
  people to package similar tools.

* One could also imagine a weaker version of this functionality being
  built-in to NodeJS: collecting a Javascript backtrace when the event
  loop has currently been blocked for too long. From talking with other
  engineers, I've heard that attributing event loop blockages is a
  common problem when running NodeJS at scale. Is there interest in
  having this in NodeJS core?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

eBPF event loop blockage finder #569

Questions for this group

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

eBPF event loop blockage finder #569

Description

Questions for this group

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions