Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[profiling] Associate tracing span IDs with samples in the wall profiler #3371

Merged
merged 4 commits into from
Jul 14, 2023

Conversation

nsavoire
Copy link
Collaborator

@nsavoire nsavoire commented Jul 7, 2023

What does this PR do?

Uses the new functionality of wall profiler in DataDog/pprof-nodejs#105 to associate wall profiler samples with span IDs.

Motivation

Implement the code hotspots (ability to link traces to profiling samples) for wall profiler.

Additional Notes

Includes #3368.

Profiler takes samples through a signal handler and needs to retrieve the current span ID and local root span ID in this signal handler.
The current design is to notify profiler on tracer side of all changes of current span. Change of active span can occur for two different reasons:

  • Node.js change of asynchronous context: notification is done by using async hooks before/after and publishing to dd-trace:storage:before and dd-trace:storage:after channels.
  • tracer changes current active span through AsyncResourceStorage.enterWith or AsyncResourceStorage.run: notification is done through dd-trace:storage:enter channel.
    Upon notification and if a sample has been taken by profiler since last notification, profiler tracer side calls profiler native side to update span id / root span id of previous sample.

Implementation notes:

  • Repeated native calls are avoided by sharing an array between JS and native addon. This array currently only stores the number of samples taken by the profiler.
  • JS code use this information to determine if it needs to compute and update context for previous samples and then reset it to 0.

Open questions:

  • There is perhaps a better place than AsyncResourceStorage.enterWith / AsyncResourceStorage.run to detect active span changes ?

@nsavoire nsavoire requested a review from a team as a code owner July 7, 2023 13:21
@github-actions
Copy link

github-actions bot commented Jul 7, 2023

Overall package size

Self size: 4.88 MB
Deduped: 57.98 MB
No deduping: 58.07 MB

Dependency sizes

name version self size total size
@datadog/native-iast-taint-tracking 1.5.0 14.86 MB 14.86 MB
@datadog/native-appsec 3.2.0 13.38 MB 13.39 MB
@datadog/pprof 3.1.0 10.66 MB 11.5 MB
protobufjs 7.2.4 2.74 MB 6.52 MB
@datadog/native-iast-rewriter 2.0.1 2.09 MB 2.1 MB
@opentelemetry/core 1.14.0 872.87 kB 1.47 MB
@datadog/native-metrics 2.0.0 898.77 kB 1.3 MB
@opentelemetry/api 1.4.1 780.32 kB 780.32 kB
msgpack-lite 0.1.26 201.16 kB 281.59 kB
opentracing 0.14.7 194.81 kB 194.81 kB
semver 7.5.3 93.39 kB 123.79 kB
@datadog/sketches-js 2.1.0 109.9 kB 109.9 kB
lodash.sortby 4.7.0 75.76 kB 75.76 kB
lru-cache 7.14.0 74.95 kB 74.95 kB
ipaddr.js 2.0.1 59.52 kB 59.52 kB
int64-buffer 0.1.10 49.18 kB 49.18 kB
ignore 5.2.0 48.87 kB 48.87 kB
import-in-the-middle 1.3.5 34.34 kB 38.81 kB
istanbul-lib-coverage 3.2.0 29.34 kB 29.34 kB
retry 0.10.1 27.44 kB 27.44 kB
lodash.uniq 4.5.0 25.01 kB 25.01 kB
limiter 1.1.5 23.17 kB 23.17 kB
lodash.kebabcase 4.1.1 17.75 kB 17.75 kB
lodash.pick 4.4.0 16.33 kB 16.33 kB
node-abort-controller 3.0.1 14.33 kB 14.33 kB
crypto-randomuuid 1.0.0 11.18 kB 11.18 kB
diagnostics_channel 1.1.0 7.07 kB 7.07 kB
path-to-regexp 0.1.7 6.78 kB 6.78 kB
koalas 1.0.2 6.47 kB 6.47 kB
methods 1.1.2 5.29 kB 5.29 kB
module-details-from-path 1.0.3 4.47 kB 4.47 kB

🤖 This report was automatically generated by heaviest-objects-in-the-universe

Qard
Qard previously approved these changes Jul 7, 2023
Copy link
Contributor

@Qard Qard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one minor nit.

durationMillis: this._flushIntervalMillis,
sourceMapper: this._mapper,
customLabels: this._codeHotspotsEnabled,
lineNumbers: false })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lineNumbers: false })
lineNumbers: false
})

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!
I already asked but I forgot what was the answer: why isn't it flagged by linter? which auto-formatter should I use to make sure that code complies with repo coding standards ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're probably missing an eslint config for it. https://eslint.org/docs/latest/rules/object-curly-newline appears to be what we're looking for?

@codecov
Copy link

codecov bot commented Jul 10, 2023

Codecov Report

Merging #3371 (1b00677) into master (d450941) will increase coverage by 19.01%.
The diff coverage is 25.31%.

@@             Coverage Diff             @@
##           master    #3371       +/-   ##
===========================================
+ Coverage   65.12%   84.13%   +19.01%     
===========================================
  Files         205      211        +6     
  Lines        7794     8328      +534     
  Branches       33       33               
===========================================
+ Hits         5076     7007     +1931     
+ Misses       2718     1321     -1397     
Impacted Files Coverage Δ
packages/dd-trace/src/profiling/profiler.js 92.92% <0.00%> (+80.80%) ⬆️
packages/dd-trace/src/span_processor.js 48.75% <ø> (-1.25%) ⬇️
packages/dd-trace/src/profiling/profilers/wall.js 39.17% <21.62%> (+35.17%) ⬆️
packages/dd-trace/src/plugins/util/web.js 90.66% <100.00%> (+0.03%) ⬆️
packages/dd-trace/src/profiling/config.js 96.36% <100.00%> (+84.43%) ⬆️

... and 66 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@pr-commenter
Copy link

pr-commenter bot commented Jul 10, 2023

Benchmarks

Benchmark execution time: 2023-07-13 13:43:30

Comparing candidate commit 1b00677 in PR branch nsavoire/code_hotspots with baseline commit d450941 in branch master.

Found 0 performance improvements and 21 performance regressions! Performance is the same for 449 metrics, 22 unstable metrics.

scenario:exporting-pipeline-0.4-16

  • 🟥 cpu_user_time [+50.441ms; +55.523ms] or [+20.808%; +22.904%]
  • 🟥 execution_time [+54.549ms; +59.178ms] or [+19.917%; +21.607%]
  • 🟥 instructions [+120M instructions; +121M instructions] or [+17.281%; +17.382%]
  • 🟥 max_rss_usage [+3.336KB; +3.413KB] or [+5.992%; +6.131%]

scenario:exporting-pipeline-0.4-18

  • 🟥 cpu_user_time [+55.344ms; +60.794ms] or [+21.240%; +23.332%]
  • 🟥 execution_time [+60.400ms; +64.003ms] or [+20.503%; +21.726%]
  • 🟥 instructions [+134M instructions; +135M instructions] or [+17.332%; +17.412%]
  • 🟥 max_rss_usage [+3.386KB; +3.437KB] or [+5.398%; +5.479%]

scenario:exporting-pipeline-0.4_with_stats-16

  • 🟥 cpu_user_time [+51.527ms; +56.609ms] or [+21.134%; +23.218%]
  • 🟥 execution_time [+54.828ms; +59.563ms] or [+19.851%; +21.565%]
  • 🟥 instructions [+123M instructions; +123M instructions] or [+17.547%; +17.616%]
  • 🟥 max_rss_usage [+3.471KB; +3.562KB] or [+6.219%; +6.382%]

scenario:exporting-pipeline-0.4_with_stats-18

  • 🟥 cpu_user_time [+54.970ms; +60.281ms] or [+20.950%; +22.974%]
  • 🟥 execution_time [+62.631ms; +65.714ms] or [+21.156%; +22.197%]
  • 🟥 instructions [+135M instructions; +136M instructions] or [+17.372%; +17.451%]
  • 🟥 max_rss_usage [+3.631KB; +3.680KB] or [+5.785%; +5.863%]

scenario:exporting-pipeline-0.5-16

  • 🟥 instructions [+58M instructions; +58M instructions] or [+10.283%; +10.314%]

scenario:exporting-pipeline-0.5-18

  • 🟥 instructions [+73M instructions; +73M instructions] or [+12.681%; +12.756%]

scenario:exporting-pipeline-0.5_with_stats-16

  • 🟥 instructions [+58M instructions; +59M instructions] or [+10.227%; +10.278%]

scenario:exporting-pipeline-0.5_with_stats-18

  • 🟥 execution_time [+12.972ms; +15.689ms] or [+5.121%; +6.194%]
  • 🟥 instructions [+73M instructions; +73M instructions] or [+12.596%; +12.642%]

@nsavoire nsavoire force-pushed the nsavoire/code_hotspots branch 2 times, most recently from 616ce46 to 7338234 Compare July 10, 2023 13:06
@nsavoire nsavoire requested review from szegedi and Qard July 10, 2023 13:26
@nsavoire nsavoire force-pushed the nsavoire/code_hotspots branch 9 times, most recently from b0d8686 to 42e0d49 Compare July 13, 2023 13:04
Comment on lines 141 to 142
if (this._profilerState[kSampleCount] !== 0) {
this._profilerState[kSampleCount] = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of resetting here, I'd rather keep a last sample count and compare to it

const sampleCount = this._profilerState[kSampleCount]
if (sampleCount != this._lastSampleCount) {
  ...
}
this._lastSampleCount = sampleCount

This way JS code never writes to it, only reads it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I like that !

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Hide functionality behind DD_PROFILING_EXPERIMENTAL_CODEHOTSPOTS_ENABLED
and DD_PROFILING_EXPERIMENTAL_ENDPOINT_COLLECTION_ENABLED flags.

Profiler tracks changes of active span by subscribing to async hooks 
before channel and to AsyncResourceStorage enter channel that 
notifies when current store is changed with enterWith/run. 
Profiler remember previous span and previous started spans, upon change
of active span, it checks if a sample has been taken by native profiler
addon, then if that's the case, it updates the context of the sample 
with span id and root span id computed from previous span / started spans.
It also add the tags from the last web span to the context to allow 
computation of endpoint later on.
Profiler may need the tags to determine span resource name for endpoint
aggregation: profiler keeps a reference on web span that was active when
each profiling sample is taken. Then during profile export, span tags
are used to determine endpoint. Endpoint cannot be determined right away
because sometimes tags necessary to determine endpoint are only set just
before span is finished.
For web spans, add request tags just after span creation instead of when
span finishes. This is needed for profiler code hotspots implementation
to determine which span are web spans and should be used to find current
endpoint.
@nsavoire nsavoire requested a review from szegedi July 13, 2023 13:36
Copy link
Contributor

@szegedi szegedi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!
🚢

@nsavoire
Copy link
Collaborator Author

nsavoire commented Jul 13, 2023

Benchmark regression is expected:
benchmark repeatedly exports the same spans, therefore with the previous behaviour (clearing span tags on export), after the first export, all spans would have had empty tags.
Now that tags are not cleared, initial tags are present during all subsequent exports, which are then slower.

@@ -5,6 +5,7 @@ const { channel } = require('../../../diagnostics_channel')

const beforeCh = channel('dd-trace:storage:before')
const afterCh = channel('dd-trace:storage:after')
const enterCh = channel('dd-trace:storage:enter')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, we're aiming long-term to eliminate use of async_hooks, so it would be ideal if we can eliminate these async_hooks lifecycle event channels. Not going to block this now as this will be a longer-term effort, but it's worth thinking about how you can do this without these channels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dd-trace:storage:enter is not async hook specific per se, its event fires when the value in the DD wrapper for AsyncResourceStorage is synchronously modified. I know AsyncResourceStorage is part of the async hooks library, but if we started using Node's AsyncLocalStorage under the hood for DD storage, we could still be emitting this enter event.

That said, I don't see us being able to avoid observing async context switches with the before event as we need to capture the active span on every context switch in case we need to associate it with a taken sample.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it will be possible to wrap the run method for our use of AsyncLocalStorage to get the enter event, but the before event will likely go away or change substantially at some point so we should be thinking about how we can achieve that in some other way.

It's worth noting that my current work rewriting AsyncLocalStorage is to implement it on the native side which could mean we expose a native interface to it which could enable you to just track your data in an AsyncLocalStorage instance and access that from the C++ side on your end whenever you need it. It would mean needing to enter the isolate so if you're doing that in a signal it probably needs verification that it can function safely. If you can review nodejs/node#48528 to see if that works for you, that would be a big help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting! Looking at implementation of AsyncContextFrame::current and AsyncContextFrame::get it might be safe to call from a signal handler. We'll need to be storing a v8::Global on native side for the key and convert it in a v8::Local for the AsyncContextFrame::get call. So we'd still maybe the enter event, and derive some data from the span into our own ALS. Unfortunately, we'll probably lose the ability to do lazy backfilling of our sampling context like we now do on context switches, although considering we'll only have to do this when a logical async context starts, and not on every switch it might work well.

@szegedi szegedi merged commit c45c087 into master Jul 14, 2023
@szegedi szegedi deleted the nsavoire/code_hotspots branch July 14, 2023 13:13
szegedi pushed a commit that referenced this pull request Jul 20, 2023
…ler (#3371)

* Initial code for code hotspots and endpoint aggregation in wall profiler

Hide functionality behind DD_PROFILING_EXPERIMENTAL_CODEHOTSPOTS_ENABLED
and DD_PROFILING_EXPERIMENTAL_ENDPOINT_COLLECTION_ENABLED flags.

Profiler tracks changes of active span by subscribing to async hooks 
before channel and to AsyncResourceStorage enter channel that 
notifies when current store is changed with enterWith/run. 
Profiler remembers previous span and previous started spans, upon change
of active span, it checks if a sample has been taken by native profiler
addon, then if that's the case, it updates the context of the sample 
with span id and root span id computed from previous span / started spans.
It also add the tags from the last web span to the context to allow 
computation of endpoint later on.

Minor changes to other parts of the code to make it all work:
===

* Report error if start fails

* Avoid clearing span context tags upon export

Profiler may need the tags to determine span resource name for endpoint
aggregation: profiler keeps a reference on web span that was active when
each profiling sample is taken. Then during profile export, span tags
are used to determine endpoint. Endpoint cannot be determined right away
because sometimes tags necessary to determine endpoint are only set just
before span is finished.

* Add request tags just after span creation

For web spans, add request tags just after span creation instead of when
span finishes. This is needed for profiler code hotspots implementation
to determine which span are web spans and should be used to find current
endpoint.
@szegedi szegedi mentioned this pull request Jul 20, 2023
szegedi pushed a commit that referenced this pull request Jul 20, 2023
…ler (#3371)

* Initial code for code hotspots and endpoint aggregation in wall profiler

Hide functionality behind DD_PROFILING_EXPERIMENTAL_CODEHOTSPOTS_ENABLED
and DD_PROFILING_EXPERIMENTAL_ENDPOINT_COLLECTION_ENABLED flags.

Profiler tracks changes of active span by subscribing to async hooks 
before channel and to AsyncResourceStorage enter channel that 
notifies when current store is changed with enterWith/run. 
Profiler remembers previous span and previous started spans, upon change
of active span, it checks if a sample has been taken by native profiler
addon, then if that's the case, it updates the context of the sample 
with span id and root span id computed from previous span / started spans.
It also add the tags from the last web span to the context to allow 
computation of endpoint later on.

Minor changes to other parts of the code to make it all work:
===

* Report error if start fails

* Avoid clearing span context tags upon export

Profiler may need the tags to determine span resource name for endpoint
aggregation: profiler keeps a reference on web span that was active when
each profiling sample is taken. Then during profile export, span tags
are used to determine endpoint. Endpoint cannot be determined right away
because sometimes tags necessary to determine endpoint are only set just
before span is finished.

* Add request tags just after span creation

For web spans, add request tags just after span creation instead of when
span finishes. This is needed for profiler code hotspots implementation
to determine which span are web spans and should be used to find current
endpoint.
@szegedi szegedi mentioned this pull request Jul 20, 2023
szegedi pushed a commit that referenced this pull request Jul 20, 2023
…ler (#3371)

* Initial code for code hotspots and endpoint aggregation in wall profiler

Hide functionality behind DD_PROFILING_EXPERIMENTAL_CODEHOTSPOTS_ENABLED
and DD_PROFILING_EXPERIMENTAL_ENDPOINT_COLLECTION_ENABLED flags.

Profiler tracks changes of active span by subscribing to async hooks
before channel and to AsyncResourceStorage enter channel that
notifies when current store is changed with enterWith/run.
Profiler remembers previous span and previous started spans, upon change
of active span, it checks if a sample has been taken by native profiler
addon, then if that's the case, it updates the context of the sample
with span id and root span id computed from previous span / started spans.
It also add the tags from the last web span to the context to allow
computation of endpoint later on.

Minor changes to other parts of the code to make it all work:
===

* Report error if start fails

* Avoid clearing span context tags upon export

Profiler may need the tags to determine span resource name for endpoint
aggregation: profiler keeps a reference on web span that was active when
each profiling sample is taken. Then during profile export, span tags
are used to determine endpoint. Endpoint cannot be determined right away
because sometimes tags necessary to determine endpoint are only set just
before span is finished.

* Add request tags just after span creation

For web spans, add request tags just after span creation instead of when
span finishes. This is needed for profiler code hotspots implementation
to determine which span are web spans and should be used to find current
endpoint.
This was referenced Jul 20, 2023
tlhunter pushed a commit that referenced this pull request Jul 21, 2023
…ler (#3371)

* Initial code for code hotspots and endpoint aggregation in wall profiler

Hide functionality behind DD_PROFILING_EXPERIMENTAL_CODEHOTSPOTS_ENABLED
and DD_PROFILING_EXPERIMENTAL_ENDPOINT_COLLECTION_ENABLED flags.

Profiler tracks changes of active span by subscribing to async hooks
before channel and to AsyncResourceStorage enter channel that
notifies when current store is changed with enterWith/run.
Profiler remembers previous span and previous started spans, upon change
of active span, it checks if a sample has been taken by native profiler
addon, then if that's the case, it updates the context of the sample
with span id and root span id computed from previous span / started spans.
It also add the tags from the last web span to the context to allow
computation of endpoint later on.

Minor changes to other parts of the code to make it all work:
===

* Report error if start fails

* Avoid clearing span context tags upon export

Profiler may need the tags to determine span resource name for endpoint
aggregation: profiler keeps a reference on web span that was active when
each profiling sample is taken. Then during profile export, span tags
are used to determine endpoint. Endpoint cannot be determined right away
because sometimes tags necessary to determine endpoint are only set just
before span is finished.

* Add request tags just after span creation

For web spans, add request tags just after span creation instead of when
span finishes. This is needed for profiler code hotspots implementation
to determine which span are web spans and should be used to find current
endpoint.
tlhunter pushed a commit that referenced this pull request Jul 21, 2023
…ler (#3371)

* Initial code for code hotspots and endpoint aggregation in wall profiler

Hide functionality behind DD_PROFILING_EXPERIMENTAL_CODEHOTSPOTS_ENABLED
and DD_PROFILING_EXPERIMENTAL_ENDPOINT_COLLECTION_ENABLED flags.

Profiler tracks changes of active span by subscribing to async hooks 
before channel and to AsyncResourceStorage enter channel that 
notifies when current store is changed with enterWith/run. 
Profiler remembers previous span and previous started spans, upon change
of active span, it checks if a sample has been taken by native profiler
addon, then if that's the case, it updates the context of the sample 
with span id and root span id computed from previous span / started spans.
It also add the tags from the last web span to the context to allow 
computation of endpoint later on.

Minor changes to other parts of the code to make it all work:
===

* Report error if start fails

* Avoid clearing span context tags upon export

Profiler may need the tags to determine span resource name for endpoint
aggregation: profiler keeps a reference on web span that was active when
each profiling sample is taken. Then during profile export, span tags
are used to determine endpoint. Endpoint cannot be determined right away
because sometimes tags necessary to determine endpoint are only set just
before span is finished.

* Add request tags just after span creation

For web spans, add request tags just after span creation instead of when
span finishes. This is needed for profiler code hotspots implementation
to determine which span are web spans and should be used to find current
endpoint.
tlhunter pushed a commit that referenced this pull request Jul 21, 2023
…ler (#3371)

* Initial code for code hotspots and endpoint aggregation in wall profiler

Hide functionality behind DD_PROFILING_EXPERIMENTAL_CODEHOTSPOTS_ENABLED
and DD_PROFILING_EXPERIMENTAL_ENDPOINT_COLLECTION_ENABLED flags.

Profiler tracks changes of active span by subscribing to async hooks 
before channel and to AsyncResourceStorage enter channel that 
notifies when current store is changed with enterWith/run. 
Profiler remembers previous span and previous started spans, upon change
of active span, it checks if a sample has been taken by native profiler
addon, then if that's the case, it updates the context of the sample 
with span id and root span id computed from previous span / started spans.
It also add the tags from the last web span to the context to allow 
computation of endpoint later on.

Minor changes to other parts of the code to make it all work:
===

* Report error if start fails

* Avoid clearing span context tags upon export

Profiler may need the tags to determine span resource name for endpoint
aggregation: profiler keeps a reference on web span that was active when
each profiling sample is taken. Then during profile export, span tags
are used to determine endpoint. Endpoint cannot be determined right away
because sometimes tags necessary to determine endpoint are only set just
before span is finished.

* Add request tags just after span creation

For web spans, add request tags just after span creation instead of when
span finishes. This is needed for profiler code hotspots implementation
to determine which span are web spans and should be used to find current
endpoint.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants