Description
We’d like to get feedback on the status of the profiling tools landscape in Node.js today. In particular -- we want to get alignment on a roadmap which will provide a free, open source, and cross-platform set of tools that are part of the node/v8 API i.e. maintained across LTS versions that can provide a powerful suite to debug and diagnose Node.js issues in production.
Production Challenges
There are some challenges that are unique to debugging and diagnosing issues in production. Specifically for large critical production deployments. In particular here are some of the constraints due to the production nature of the deployments:
- Generally, most solutions need to be low impact on the process. Attaching debuggers or tracers is often impractical when the application is taking production traffic. Debuggers will pause the process, causing all inflight requests to hang. Tracers and debuggers often introduce large performance penalties, which can impact customers.
- Tools need to be safe and not cause kernel panics or process crashes.
- It’s often hard to reproduce issues seen at production scale -- therefore programmatic access of tooling via APIs can help diagnose problems as they occur. e.g. start a profile based on some condition since it’s often impossible to reproduce
Profiling
One of the most useful methodologies to optimize CPU performance in a running application is by sampling the CPU stack frames (CPU profiling) and then visualizing the samples, typically using a flamegraph. This technique will show hot code paths on CPU -- which gives you the opportunity to optimize the relevant source code.
The approach can be done in production with OS level profilers such as (perf, DTrace, systemtap, eBPF) with very low overhead. The profilers lack the information to resolve the JS frames, resulting in unhelpful memory addresses for the JS frames. V8 solves this problem by dumping a mapping of native frame addresses to JS source and line number.
It’s important to mention here that having access to all stack frames, whether native (v8, libc, syscalls, libuv, native modules) or JS is important. Problems can occur anywhere in the stack, and we want to be able to profile Node with complete stack frames. E.g. We heavily use gRPC -- which is a native module -- so without access to native frames we would not be able to get visibility into this critical part of our application.
There are a few issues with this implementation:
perf(1)
support is now deprecated in V8 and will not be supported starting in Node 8 -- which effectively means we’re losing the ability to profile JS stacks.- Lack of cross-platform tooling. Even if perf support wasn’t deprecated, this solution only works on Linux.
We’d like to contribute and collaborate on a set of comprehensive, cross-platform, and open source CPU profiling tools with the Node and V8 team. The V8 team has advised us that they plan to support the v8 profiler and the v8 cpu profiler API going forward, and we want to unlock CPU profiling capabilities in Node using these supported frameworks.
Roadmap:
- Determine which tools -- v8 profiler (--prof), cpu profiler or something else -- to support and maintain for profiling Node applications.
- Currently, these tools do not support native c/c++ frames which includes v8, native modules, libuv or syscall stack frames. Note because these tools are in process, they will never be able to show syscalls as those can only be sampled from within the kernel. Supporting native frames will require implementing a stackwalker in the v8 profiler or the cpu profiler.
- Determine the overhead of profiling using these tools. With the now deprecated OS level profilers, the overhead was quite low -- and thus suitable to use in a production environment without outsized impact to customers.
- Dynamically toggle profiling in a production environment. It’s not practical to profile all the time in production due to the performance impact -- being able to dynamically enable and disable profiling via an API would be ideal.
- Provide an API to programmatically consume profiling data once enabled.
- Add documentation so these tools can be adopted by users.
We’re looking for feedback and alignment with the community on this subject before proceeding with the design and implementation -- please let us know your thoughts.