Skip to content

monitorEventLoopDelay metrics are completely off especially the count #56064

Open
@rochdev

Description

@rochdev

Version

v23.3.0

Platform

Darwin Rochs-MBP-3710.localdomain 23.4.0 Darwin Kernel Version 23.4.0: Wed Feb 21 21:44:43 PST 2024; root:xnu-10063.101.15~2/RELEASE_ARM64_T6000 arm64

Subsystem

perf_hooks

What steps will reproduce the bug?

const { monitorEventLoopDelay } = require('perf_hooks')
const h = monitorEventLoopDelay({ resolution: 1 })

h.enable()

setInterval(() => {
  // Just to trigger an event loop iteration 10x per second.
}, 100)

setTimeout(() => {
  console.log(h)
}, 1000)

How often does it reproduce? Is there a required condition?

Every time.

What is the expected behavior? Why is that the expected behavior?

The expected behaviour would be to get accurate metrics about event loop delay without impacting the event loop itself. In the above example, I would expect to see ~10 iterations.

What do you see instead?

Using monitorEventLoopDelay alters the behaviour of the event loop, causing it to run more than necessary and preventing it from idling. This means that after 1 second with a resolution of 1, you get ~1000 iterations in 1 second for an app that is mostly idle because of the added timer.

Another similar issue is that even the metrics themselves are wildly inaccurate as described in #34661

Additional information

It seems to me that this is a fundamental issue with the approach that was used to capture these metrics, and the function would need to be rewritten in order to provide accurate information.

One way to do this that we've been using successfully for years at this point is to collect timing information with 2 hooks on libuv around the non-IO part of the loop to isolate user code. While it's not 100% accurate, it's definitely way more accurate than the current approach and doesn't skew regardless of very low (0-10) or very high (thousands to millions) iterations. Here is a code snippet of the alternative approach: https://github.com/DataDog/dd-native-metrics-js/blob/4b326d5a1669e7a69c5c84a4d3e036163e60b9d5/src/metrics/EventLoop.hpp#L72-L96

I'm willing to open a PR to fix the issue by reworking the internals to monitor libuv directly instead of relying on a timer, but I first wanted to open this issue to see what others think, and if there may be better approaches or subtle potential issue with the proposed new approach. If it sounds good to everyone then I'll just go ahead with a PR. We've been waiting for a long time to get this out of the box, and thought that we could finally remove our native addon but this ended up not being the case, so I'm pretty motivated to help any way I can to fix this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    perf_hooksIssues and PRs related to the implementation of the Performance Timing API.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions