Description
Hi -- I'd like to share an eBPF use-case which I found quite useful in
my day job. (Please let me know if there is a better forum to do this.)
We were experiencing long (10s+) event loop blockages which was
affecting our performance. We were alerted to this issue by
node-blocked.
We had two initial ideas:
- CPU profiles: but the overhead felt too high especially since the
blockages only happened rarely. async_hooks
(specifically blocked-at): but the overhead was
unacceptable.
The solution we landed on used eBPF, particularly bpftrace:
/* Whenever any thread enters uv__run_timers, record the current time
in nanoseconds in a map. */
u:NODE_PATH:uv__run_timers { @[tid] = nsecs; }
/* Whenever any thread returns from uv__run_check, clear its time from
the map. */
ur:NODE_PATH:uv__run_check /@[tid]/ { delete(@[tid]); }
/* 99 times a second, check if any running thread has been blocked
for longer than 10 seconds. If so, take a core dump and stop
this script. */
p:hz:99 /@[tid]/ {
if (nsecs - @[tid] > 10000000000) {
system("gcore %d", pid);
exit();
}
}
We ran this script on a bunch of machines, and eventually it spit out a
coredump. We opened the coredump with llnode and found the cause
via v8 backtrace
.
Questions for this group
-
I will also create a separate issue about how llnode is no longer
supported, but I think this is still useful functionality. For
example, you can use it to get histograms of event loop blockages
which is independently useful for workload characterization. -
On Node's side, it would be nicer if event loop stages were exposed as
stable tracepoints instead of uprobes. This would make it easier for
people to package similar tools. -
One could also imagine a weaker version of this functionality being
built-in to NodeJS: collecting a Javascript backtrace when the event
loop has currently been blocked for too long. From talking with other
engineers, I've heard that attributing event loop blockages is a
common problem when running NodeJS at scale. Is there interest in
having this in NodeJS core?