Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-8917] Add support for the libdatadog crash tracker #3384

Merged
merged 48 commits into from
May 13, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Jan 16, 2024

What does this PR do?

This PR adds support for the libdatadog crash tracker feature (off by default).

The crash tracker works by detecting when the Ruby VM reaches a segmentation fault, reporting the crash information as a last profile before the VM dies.

All of the interesting work is in DataDog/libdatadog#282, this PR basically just wires things up.

Motivation:

This will be a useful tool when debugging VM crashes.

Additional Notes:

I'm opening this PR as a draft as the libdatadog support has not yet landed/been released. Also, there's a few open questions on:

  • fork handling
  • when to shut down

How to test the change?

(You'll need to build <DataDog/libdatadog#282 until the crash tracker gets included in a libdatadog release)

To test the crash tracker with an actual crash, try running the following on Ruby 2.6:

$ DD_PROFILING_ENABLED=true DD_PROFILING_EXPERIMENTAL_CRASH_TRACKING_ENABLED=true DD_TRACE_DEBUG=true bundle exec ddprofrb exec ruby -e 'Process.detach(fork { exit! }).instance_variable_get(:@foo)'

This should also work in the future but right now it doesn't work correctly; still looking into why:

This also works on every Ruby:

$ DD_PROFILING_ENABLED=true DD_PROFILING_EXPERIMENTAL_CRASH_TRACKING_ENABLED=true DD_TRACE_DEBUG=true bundle exec ddprofrb exec ruby -e 'Process.kill("SEGV", Process.pid)'

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

@github-actions github-actions bot added core Involves Datadog core libraries profiling Involves Datadog profiling labels Jan 16, 2024
…bdatadog helpers

These are going to be needed also for the crash tracker code.

While doing this extraction, I've gone ahead and made the failure
logging on `convert_tags` use the ddtrace logger directly, rather than
having to rely on an extra method to do this conversion.

Since these methods are covered by the tests in
`http_transport_spec.rb`, I've chosen not to expose them in other ways
for testing.
This is also going to be needed by the crash tracking feature.
**What does this PR do?**

This PR adds support for the libdatadog crash tracker feature (off
by default).

The crash tracker works by detecting when the Ruby VM reaches a
segmentation fault, reporting the crash information as a last profile
before the VM dies.

All of the interesting work is in
<DataDog/libdatadog#282>, this PR basically
just wires things up.

**Motivation:**

This will be a useful tool when debugging VM crashes.

**Additional Notes:**

I'm opening this PR as a draft as the libdatadog support has not
yet landed/been released. Also, there's a few open questions on:

* fork handling
* when to shut down

**How to test the change?**

(You'll need to build <<DataDog/libdatadog#282>
until the crash tracker gets included in a libdatadog release)

To test the crash tracker with an actual crash, try running the
following on Ruby 2.6:

```bash
$ DD_PROFILING_ENABLED=true DD_PROFILING_EXPERIMENTAL_CRASH_TRACKING_ENABLED=true DD_TRACE_DEBUG=true bundle exec ddtracerb exec ruby -e 'Process.detach(fork { exit! }).instance_variable_get(:@foo)'
```

This should also work in the future but right now it doesn't work
correctly; still looking into why:

```bash
$ DD_PROFILING_ENABLED=true DD_PROFILING_EXPERIMENTAL_CRASH_TRACKING_ENABLED=true DD_TRACE_DEBUG=true bundle exec ddtracerb exec ruby -e 'Process.kill("SEGV", Process.pid)'
```
`EventGroup` has long since been removed from the codebase
@ivoanjo ivoanjo force-pushed the ivoanjo/prof-8917-crash-tracker-ruby branch from 5efb033 to 18758de Compare March 11, 2024 17:14
ivoanjo added a commit that referenced this pull request Mar 19, 2024
**What does this PR do?**

This PR upgrades dd-trace-rb to use libdatadog 7.

There was a small breaking API change -- the rename of `ddog_Endpoint`
APIs to `ddog_prof_Endpoint`.

**Motivation:**

Make sure Ruby is up-to-date on libdatadog.

**Additional Notes:**

As far as Ruby is impacted, the main changes in libdatadog 7 are
a number of fixes to the crash tracker (getting us closer to merging
 #3384) as well as some potential memory improvements from
(DataDog/libdatadog#303).

I'm opening this as a draft PR as libdatadog 7 is not yet available on
rubygems.org, and I'll come back to re-trigger CI and mark this as
non-draft once it is.

**How to test the change?**

Our existing test coverage includes libdatadog testing, so a green CI
is good here :)
@ivoanjo ivoanjo mentioned this pull request Mar 19, 2024
2 tasks
ivoanjo added a commit that referenced this pull request Mar 27, 2024
**What does this PR do?**

This PR upgrades dd-trace-rb to use libdatadog 7.

There was a small breaking API change -- the rename of `ddog_Endpoint`
APIs to `ddog_prof_Endpoint`.

**Motivation:**

Make sure Ruby is up-to-date on libdatadog.

**Additional Notes:**

As far as Ruby is impacted, the main changes in libdatadog 7 are
a number of fixes to the crash tracker (getting us closer to merging
 #3384) as well as some potential memory improvements from
(DataDog/libdatadog#303).

I'm opening this as a draft PR as libdatadog 7 is not yet available on
rubygems.org, and I'll come back to re-trigger CI and mark this as
non-draft once it is.

**How to test the change?**

Our existing test coverage includes libdatadog testing, so a green CI
is good here :)
The in-process option has known though issues to be solved, so let's
go with the much safer option of having the receiver do the hard work.
@ivoanjo ivoanjo changed the base branch from master to alexjf/libdatadog9 May 9, 2024 15:44
@github-actions github-actions bot removed the core Involves Datadog core libraries label May 9, 2024
@ivoanjo
Copy link
Member Author

ivoanjo commented May 9, 2024

Update: I've changed this PR to be on top of #3627 ( libdatadog v9 upgrade) + updated it for libdatadog v9 compatibility.

I think once libdatadog 9 support lands in master, this is finally 100% good to go :)

Copy link

@danielsn danielsn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A few comments/questions

.additional_files = {},
// The Ruby VM already uses an alt stack to detect stack overflows so the crash handler must not overwrite it.
//
// @ivoanjo: Specifically, with `create_alt_stack = true` I saw a segfault, such as Ruby 2.6's bug with
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, this is why this is an option here :)


ddog_prof_CrashtrackerReceiverConfig receiver_config = {
.args = {},
.env = {.ptr = &ld_library_path_env, .len = 1},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might override the env rather than appending. Which would you expect to happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question -- I don't particularly have a preference.

On one side, inheriting the full env would be useful if the crash tracker wanted to log some extra info from there. On the other hand, we can always add anything else we want extra later.

Dealer's choice? Do you think it'd be useful for me to pass along the env that Ruby has + with the addition of the ld_library_path?

@@ -60,3 +62,87 @@ size_t read_ddogerr_string_and_drop(ddog_Error *error, char *string, size_t capa
ddog_Error_drop(error);
return error_msg_size;
}

__attribute__((warn_unused_result))
ddog_prof_Endpoint endpoint_from(VALUE exporter_configuration) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you want a way to send to file endpoints?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine some uses for it in some weird debugging cases... 🤔

From my side, I'm fine with not having that ability, so I don't particularly think it's worth the effort.

# otherwise `false`
option :experimental_crash_tracking_enabled do |o|
o.type :bool
o.env 'DD_PROFILING_EXPERIMENTAL_CRASH_TRACKING_ENABLED'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document this publicly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good idea. Did you have something in mind? I'm thinking I could open a PR to add to https://docs.datadoghq.com/profiler/profiler_troubleshooting/ruby/ .


unless transport.respond_to?(:exporter_configuration)
Datadog.logger.warn(
'Cannot enable profiling crash tracking as a custom settings.profiling.exporter.transport is configured'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, what is this and why does it prevent crashtracking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, really good question -- I've added a big comment to the code to explain this in 09976cc :)

Comment on lines +795 to +800
describe '#experimental_crash_tracking_enabled=' do
it 'updates the #experimental_crash_tracking_enabled setting' do
expect { settings.profiling.advanced.experimental_crash_tracking_enabled = true }
.to change { settings.profiling.advanced.experimental_crash_tracking_enabled }
.from(false)
.to(true)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ruby is kinda a fun language

expect_in_fork do
crashtracker.reset_after_fork

expect(`pgrep -f libdatadog-crashtracking-receiver`.lines.size).to be 2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have other processes running that could be captured here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, although we never run our tests in parallel inside the same container in CI, so only if you're running the test suite locally would this potentially happen.

I've added an extra check at the beginning of these tests to check that no existing crash tracker was already running: 5431d2b .

expect_in_fork(fork_expectations: fork_expectations) do
crashtracker.start

Process.kill('SEGV', Process.pid)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it also make sense to test an actual crash (e.g. null ptr deref)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That I know of, there's no Ruby-level way to cause a segfault. I even thought to check the built-in ffi fiddle gem, which actually is neat enough to handle its own potential segfaults cleanly and turn them into Ruby exceptions:

spec/datadog/profiling/crashtracker_spec.rb:188:in `[]': NULL pointer dereference (Fiddle::DLError)

So to actually do this, I would need to add a cause_crash method on the native side of the Profiler, which.... seems like a sharp edge that is not great to leave lying around >_> ?

Copy link
Contributor

@lloeki lloeki May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't try hard enough? :D

$ ruby -r fiddle -e 'Fiddle.free(42)'
-e:1: [BUG] Segmentation fault at 0x0000000000000022

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting, I didn't think of that :D . Thanks @lloeki for the suggestion, I'll make a note to shoot a small PR with your suggestion :)


crash_report = JSON.parse(request.body, symbolize_names: true)[:payload].first

expect(crash_report[:stack_trace]).to_not be_empty
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also check that the correct signal type was reported

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in a2ae730

context 'when a crash tracker instance is provided' do
let(:optional_crashtracker) { instance_double(Datadog::Profiling::Crashtracker) }

it 'signals the crash tracker to start before other components' do
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

This makes the test fail early if there's an unexpected crash tracker
instance running, rather than failing later in a more confusing way.
@github-actions github-actions bot added the core Involves Datadog core libraries label May 10, 2024
Base automatically changed from alexjf/libdatadog9 to master May 10, 2024 11:48
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.11%. Comparing base (bcb6785) to head (42f6ca5).
Report is 9 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3384      +/-   ##
==========================================
- Coverage   98.13%   98.11%   -0.02%     
==========================================
  Files        1223     1225       +2     
  Lines       72139    72379     +240     
  Branches     3421     3433      +12     
==========================================
+ Hits        70795    71018     +223     
- Misses       1344     1361      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ivoanjo
Copy link
Member Author

ivoanjo commented May 10, 2024

CI is now green on this! Took a while but we made it :D

@ivoanjo ivoanjo merged commit 3bd8b05 into master May 13, 2024
166 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-8917-crash-tracker-ruby branch May 13, 2024 10:53
@github-actions github-actions bot added this to the 2.0.0 milestone May 13, 2024
@TonyCTHsu TonyCTHsu mentioned this pull request Jun 6, 2024
ivoanjo added a commit that referenced this pull request Jun 12, 2024
**What does this PR do?**

This PR adds an additional test case to the profiling crashtracker
as discussed in
#3384 (comment)
(thanks @lloeki for the suggestion).

**Motivation:**

Improve test coverage for the feature.

**Additional Notes:**

The diff looks more noiser than it is because of whitespace changes,
I recommend reviewing this PR without them.

**How to test the change?**

Check that the new test passes!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants