[PROF-8917] Add support for the libdatadog crash tracker #3384

ivoanjo · 2024-01-16T18:40:37Z

What does this PR do?

This PR adds support for the libdatadog crash tracker feature (off by default).

The crash tracker works by detecting when the Ruby VM reaches a segmentation fault, reporting the crash information as a last profile before the VM dies.

All of the interesting work is in DataDog/libdatadog#282, this PR basically just wires things up.

Motivation:

This will be a useful tool when debugging VM crashes.

Additional Notes:

~~I'm opening this PR as a draft as the libdatadog support has not yet landed/been released. Also, there's a few open questions on:~~

~~fork handling~~
~~when to shut down~~

How to test the change?

(You'll need to build <DataDog/libdatadog#282 until the crash tracker gets included in a libdatadog release)

To test the crash tracker with an actual crash, try running the following on Ruby 2.6:

$ DD_PROFILING_ENABLED=true DD_PROFILING_EXPERIMENTAL_CRASH_TRACKING_ENABLED=true DD_TRACE_DEBUG=true bundle exec ddprofrb exec ruby -e 'Process.detach(fork { exit! }).instance_variable_get(:@foo)'

~~This should also work in the future but right now it doesn't work correctly; still looking into why:~~

This also works on every Ruby:

$ DD_PROFILING_ENABLED=true DD_PROFILING_EXPERIMENTAL_CRASH_TRACKING_ENABLED=true DD_TRACE_DEBUG=true bundle exec ddprofrb exec ruby -e 'Process.kill("SEGV", Process.pid)'

For Datadog employees:

If this PR touches code that signs or publishes builds or packages, or handles
credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
This PR doesn't touch any of that.

…bdatadog helpers These are going to be needed also for the crash tracker code. While doing this extraction, I've gone ahead and made the failure logging on `convert_tags` use the ddtrace logger directly, rather than having to rely on an extra method to do this conversion. Since these methods are covered by the tests in `http_transport_spec.rb`, I've chosen not to expose them in other ways for testing.

This is also going to be needed by the crash tracking feature.

@foo

**What does this PR do?** This PR adds support for the libdatadog crash tracker feature (off by default). The crash tracker works by detecting when the Ruby VM reaches a segmentation fault, reporting the crash information as a last profile before the VM dies. All of the interesting work is in <DataDog/libdatadog#282>, this PR basically just wires things up. **Motivation:** This will be a useful tool when debugging VM crashes. **Additional Notes:** I'm opening this PR as a draft as the libdatadog support has not yet landed/been released. Also, there's a few open questions on: * fork handling * when to shut down **How to test the change?** (You'll need to build <<DataDog/libdatadog#282> until the crash tracker gets included in a libdatadog release) To test the crash tracker with an actual crash, try running the following on Ruby 2.6: ```bash $ DD_PROFILING_ENABLED=true DD_PROFILING_EXPERIMENTAL_CRASH_TRACKING_ENABLED=true DD_TRACE_DEBUG=true bundle exec ddtracerb exec ruby -e 'Process.detach(fork { exit! }).instance_variable_get(:@foo)' ``` This should also work in the future but right now it doesn't work correctly; still looking into why: ```bash $ DD_PROFILING_ENABLED=true DD_PROFILING_EXPERIMENTAL_CRASH_TRACKING_ENABLED=true DD_TRACE_DEBUG=true bundle exec ddtracerb exec ruby -e 'Process.kill("SEGV", Process.pid)' ```

`EventGroup` has long since been removed from the codebase

When the profiler starts up, a number of other log messages are also printed, so let's get rid of a few. (Also, when it's disabled, it shows up on the environment logger, so that's already covered too.)

**What does this PR do?** This PR upgrades dd-trace-rb to use libdatadog 7. There was a small breaking API change -- the rename of `ddog_Endpoint` APIs to `ddog_prof_Endpoint`. **Motivation:** Make sure Ruby is up-to-date on libdatadog. **Additional Notes:** As far as Ruby is impacted, the main changes in libdatadog 7 are a number of fixes to the crash tracker (getting us closer to merging #3384) as well as some potential memory improvements from (DataDog/libdatadog#303). I'm opening this as a draft PR as libdatadog 7 is not yet available on rubygems.org, and I'll come back to re-trigger CI and mark this as non-draft once it is. **How to test the change?** Our existing test coverage includes libdatadog testing, so a green CI is good here :)

The in-process option has known though issues to be solved, so let's go with the much safer option of having the receiver do the hard work.

This reverts commit 32f02b4.

…r-ruby

There's still two FIXMEs in here, which I'll fix in follow-up commits.

This will be integrated into the libdatadog v9 release in DataDog/libdatadog#423 .

ivoanjo · 2024-05-09T15:46:43Z

Update: I've changed this PR to be on top of #3627 ( libdatadog v9 upgrade) + updated it for libdatadog v9 compatibility.

I think once libdatadog 9 support lands in master, this is finally 100% good to go :)

danielsn

LGTM. A few comments/questions

danielsn · 2024-05-09T18:36:03Z

ext/datadog_profiling_native_extension/crashtracker.c

+    .additional_files = {},
+    // The Ruby VM already uses an alt stack to detect stack overflows so the crash handler must not overwrite it.
+    //
+    // @ivoanjo: Specifically, with `create_alt_stack = true` I saw a segfault, such as Ruby 2.6's bug with


Makes sense, this is why this is an option here :)

danielsn · 2024-05-09T18:39:59Z

ext/datadog_profiling_native_extension/crashtracker.c

+
+  ddog_prof_CrashtrackerReceiverConfig receiver_config = {
+    .args = {},
+    .env = {.ptr = &ld_library_path_env, .len = 1},


I think this might override the env rather than appending. Which would you expect to happen?

Good question -- I don't particularly have a preference.

On one side, inheriting the full env would be useful if the crash tracker wanted to log some extra info from there. On the other hand, we can always add anything else we want extra later.

Dealer's choice? Do you think it'd be useful for me to pass along the env that Ruby has + with the addition of the ld_library_path?

danielsn · 2024-05-09T18:42:26Z

ext/datadog_profiling_native_extension/libdatadog_helpers.c

@@ -60,3 +62,87 @@ size_t read_ddogerr_string_and_drop(ddog_Error *error, char *string, size_t capa
  ddog_Error_drop(error);
  return error_msg_size;
 }
+
+__attribute__((warn_unused_result))
+ddog_prof_Endpoint endpoint_from(VALUE exporter_configuration) {


Would you want a way to send to file endpoints?

I can imagine some uses for it in some weird debugging cases... 🤔

From my side, I'm fine with not having that ability, so I don't particularly think it's worth the effort.

danielsn · 2024-05-09T18:50:06Z

lib/datadog/core/configuration/settings.rb

+            # otherwise `false`
+            option :experimental_crash_tracking_enabled do |o|
+              o.type :bool
+              o.env 'DD_PROFILING_EXPERIMENTAL_CRASH_TRACKING_ENABLED'


Should we document this publicly?

Yeah, that's a good idea. Did you have something in mind? I'm thinking I could open a PR to add to https://docs.datadoghq.com/profiler/profiler_troubleshooting/ruby/ .

danielsn · 2024-05-09T18:51:30Z

lib/datadog/profiling/component.rb

+
+        unless transport.respond_to?(:exporter_configuration)
+          Datadog.logger.warn(
+            'Cannot enable profiling crash tracking as a custom settings.profiling.exporter.transport is configured'


Just FYI, what is this and why does it prevent crashtracking?

Yes, really good question -- I've added a big comment to the code to explain this in 09976cc :)

danielsn · 2024-05-09T19:01:48Z

spec/datadog/core/configuration/settings_spec.rb

+      describe '#experimental_crash_tracking_enabled=' do
+        it 'updates the #experimental_crash_tracking_enabled setting' do
+          expect { settings.profiling.advanced.experimental_crash_tracking_enabled = true }
+            .to change { settings.profiling.advanced.experimental_crash_tracking_enabled }
+            .from(false)
+            .to(true)


Ruby is kinda a fun language

danielsn · 2024-05-09T19:04:39Z

spec/datadog/profiling/crashtracker_spec.rb

+        expect_in_fork do
+          crashtracker.reset_after_fork
+
+          expect(`pgrep -f libdatadog-crashtracking-receiver`.lines.size).to be 2


Is it possible to have other processes running that could be captured here?

It is, although we never run our tests in parallel inside the same container in CI, so only if you're running the test suite locally would this potentially happen.

I've added an extra check at the beginning of these tests to check that no existing crash tracker was already running: 5431d2b .

danielsn · 2024-05-09T19:11:00Z

spec/datadog/profiling/crashtracker_spec.rb

+      expect_in_fork(fork_expectations: fork_expectations) do
+        crashtracker.start
+
+        Process.kill('SEGV', Process.pid)


Would it also make sense to test an actual crash (e.g. null ptr deref)

That I know of, there's no Ruby-level way to cause a segfault. I even thought to check the built-in ffi fiddle gem, which actually is neat enough to handle its own potential segfaults cleanly and turn them into Ruby exceptions:

spec/datadog/profiling/crashtracker_spec.rb:188:in `[]': NULL pointer dereference (Fiddle::DLError)

So to actually do this, I would need to add a cause_crash method on the native side of the Profiler, which.... seems like a sharp edge that is not great to leave lying around >_> ?

You didn't try hard enough? :D

$ ruby -r fiddle -e 'Fiddle.free(42)' -e:1: [BUG] Segmentation fault at 0x0000000000000022

Ah, interesting, I didn't think of that :D . Thanks @lloeki for the suggestion, I'll make a note to shoot a small PR with your suggestion :)

danielsn · 2024-05-09T19:13:30Z

spec/datadog/profiling/crashtracker_spec.rb

+
+      crash_report = JSON.parse(request.body, symbolize_names: true)[:payload].first
+
+      expect(crash_report[:stack_trace]).to_not be_empty


Could also check that the correct signal type was reported

Added in a2ae730

danielsn · 2024-05-09T19:14:00Z

spec/datadog/profiling/profiler_spec.rb

+    context 'when a crash tracker instance is provided' do
+      let(:optional_crashtracker) { instance_double(Datadog::Profiling::Crashtracker) }
+
+      it 'signals the crash tracker to start before other components' do


…om transports

This makes the test fail early if there's an unexpected crash tracker instance running, rather than failing later in a more confusing way.

codecov-commenter · 2024-05-10T12:05:38Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.11%. Comparing base (bcb6785) to head (42f6ca5).
Report is 9 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3384      +/-   ##
==========================================
- Coverage   98.13%   98.11%   -0.02%     
==========================================
  Files        1223     1225       +2     
  Lines       72139    72379     +240     
  Branches     3421     3433      +12     
==========================================
+ Hits        70795    71018     +223     
- Misses       1344     1361      +17

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ivoanjo · 2024-05-10T12:07:34Z

CI is now green on this! Took a while but we made it :D

@lloeki

**What does this PR do?** This PR adds an additional test case to the profiling crashtracker as discussed in #3384 (comment) (thanks @lloeki for the suggestion). **Motivation:** Improve test coverage for the feature. **Additional Notes:** The diff looks more noiser than it is because of whitespace changes, I recommend reviewing this PR without them. **How to test the change?** Check that the new test passes!

github-actions bot added core Involves Datadog core libraries profiling Involves Datadog profiling labels Jan 16, 2024

ivoanjo added 9 commits March 11, 2024 10:40

Introduce setting to control crash tracker option

e1340de

Extract ddtrace_version from HttpTransport to ruby_helpers.h

8afc9f2

This is also going to be needed by the crash tracking feature.

Add TODO about integration spec

d89e876

Update crash_tracker.c with latest libdatadog API

d3bc290

Add experimental spec

4889d05

Minor: Remove unused/outdated type declaration

37f8082

`EventGroup` has long since been removed from the codebase

Redesign crash tracker to behave as regular object

18758de

ivoanjo force-pushed the ivoanjo/prof-8917-crash-tracker-ruby branch from 5efb033 to 18758de Compare March 11, 2024 17:14

ivoanjo added 6 commits March 12, 2024 09:01

Merge branch 'master' into ivoanjo/prof-8917-crash-tracker-ruby

02d8691

Wire new crash tracker design into profiler

bc7d72c

Minor: Remove redundant log message

5312591

When the profiler starts up, a number of other log messages are also printed, so let's get rid of a few. (Also, when it's disabled, it shows up on the environment logger, so that's already covered too.)

Avoid leaking threads and outputting errors during spec

45f5daa

Rename CrashTracker to Crashtracker to match libdatadog naming

a5c0ba2

Match fixed case for crashtracker APIs

fb45524

This was referenced Mar 12, 2024

[PROF-8917] Fix case on crashtracker ffi APIs + add Ruby helper + add example DataDog/libdatadog#351

Merged

[PROF-8917] Misc fixes from crashtracker branch #3517

Merged

Merge branch 'master' into ivoanjo/prof-8917-crash-tracker-ruby

6f83a4a

ivoanjo mentioned this pull request Mar 19, 2024

[NO-TICKET] Upgrade to libdatadog 7 #3536

Merged

2 tasks

ivoanjo added 3 commits March 22, 2024 10:57

Update to libdatadog 7 APIs

6af57da

Merge branch 'master' into ivoanjo/prof-8917-crash-tracker-ruby

d19ecde

Enable frame resolution

e670699

ivoanjo added 3 commits March 28, 2024 17:04

Use in-receiver resolve frames

d37cbb8

The in-process option has known though issues to be solved, so let's go with the much safer option of having the receiver do the hard work.

Merge branch 'master' into ivoanjo/prof-8917-crash-tracker-ruby

d8b4122

Adjust to latest libdatadog crash tracker changes

6f3f8be

ivoanjo added 8 commits May 9, 2024 13:06

Revert "Update gemfiles with libdatadog 7 -> 8 upgrade"

f029683

This reverts commit 32f02b4.

Merge branch 'alexjf/libdatadog9' into ivoanjo/prof-8917-crash-tracke…

7083455

…r-ruby

Update Ruby crashtracker to libdatadog v9 API

49e9f31

There's still two FIXMEs in here, which I'll fix in follow-up commits.

Setup ld_library_path argument for crashtracker

5197792

Use profiling.upload.timeout_seconds for crashtracker timeout

7cda332

Remove temporary libdatadog monkey patch

b87f171

This will be integrated into the libdatadog v9 release in DataDog/libdatadog#423 .

Fix upload_timeout_seconds being a float by default

3acf413

Make rubocop happy

bd7de08

ivoanjo changed the base branch from master to alexjf/libdatadog9 May 9, 2024 15:44

github-actions bot removed the core Involves Datadog core libraries label May 9, 2024

danielsn reviewed May 9, 2024

View reviewed changes

ivoanjo added 3 commits May 10, 2024 10:59

Minor: Add explanation for why we're skipping crash tracker with cust…

09976cc

…om transports

Assert that no crashtracker is running before each test

5431d2b

This makes the test fail early if there's an unexpected crash tracker instance running, rather than failing later in a more confusing way.

Assert that correct signal name is reported

a2ae730

github-actions bot added the core Involves Datadog core libraries label May 10, 2024

ivoanjo force-pushed the alexjf/libdatadog9 branch from 3c27b38 to 3a6b242 Compare May 10, 2024 10:37

Base automatically changed from alexjf/libdatadog9 to master May 10, 2024 11:48

Minor: Empty commit to re-trigger CI

42f6ca5

danielsn approved these changes May 10, 2024

View reviewed changes

ivoanjo merged commit 3bd8b05 into master May 13, 2024
166 checks passed

ivoanjo deleted the ivoanjo/prof-8917-crash-tracker-ruby branch May 13, 2024 10:53

github-actions bot added this to the 2.0.0 milestone May 13, 2024

marcotc mentioned this pull request May 24, 2024

Bump to version 2.0.0.rc1 #3666

Merged

TonyCTHsu mentioned this pull request Jun 6, 2024

Bump to version 2.0.0 #3687

Merged

ivoanjo mentioned this pull request Jun 12, 2024

[NO-TICKET] Minor: Use fiddle to simulate crash in spec #3708

Merged

ivoanjo mentioned this pull request Aug 2, 2024

[PROF-10241] Extract libdatadog crashtracker telemetry into separate extension #3824

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-8917] Add support for the libdatadog crash tracker #3384

[PROF-8917] Add support for the libdatadog crash tracker #3384

ivoanjo commented Jan 16, 2024 •

edited

Loading

ivoanjo commented May 9, 2024

danielsn left a comment

danielsn May 9, 2024

danielsn May 9, 2024

ivoanjo May 10, 2024

danielsn May 9, 2024

ivoanjo May 10, 2024

danielsn May 9, 2024

ivoanjo May 10, 2024

danielsn May 9, 2024

ivoanjo May 10, 2024

danielsn May 9, 2024

danielsn May 9, 2024

ivoanjo May 10, 2024

danielsn May 9, 2024

ivoanjo May 10, 2024

lloeki May 28, 2024 •

edited

Loading

ivoanjo Jun 4, 2024

danielsn May 9, 2024

ivoanjo May 10, 2024

danielsn May 9, 2024

codecov-commenter commented May 10, 2024

ivoanjo commented May 10, 2024


		crash_report = JSON.parse(request.body, symbolize_names: true)[:payload].first

		expect(crash_report[:stack_trace]).to_not be_empty

[PROF-8917] Add support for the libdatadog crash tracker #3384

[PROF-8917] Add support for the libdatadog crash tracker #3384

Conversation

ivoanjo commented Jan 16, 2024 • edited Loading

ivoanjo commented May 9, 2024

danielsn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lloeki May 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented May 10, 2024

Codecov Report

ivoanjo commented May 10, 2024

ivoanjo commented Jan 16, 2024 •

edited

Loading

lloeki May 28, 2024 •

edited

Loading