Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NO-TICKET] Check extension dir as well when loading profiler #3582

Merged
merged 1 commit into from
Apr 11, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Apr 11, 2024

What does this PR do?

This PR tweaks our existing logic for loading the profiling native extension to try to find the file in two directories:

  1. The lib/ directory
  2. The Ruby extensions directory

Previously we would only check 1.

Motivation:

I'll admit I don't 100% understand when one or the other folder gets used.

For instance, on a fresh install of ddtrace on my machine I get

$ bundle exec gem contents ddtrace | grep so$
.rvm/gems/ruby-3.1.4/gems/ddtrace-1.21.1/lib/datadog_profiling_loader.3.1.4_x86_64-linux.so
.rvm/gems/ruby-3.1.4/gems/ddtrace-1.21.1/lib/datadog_profiling_native_extension.3.1.4_x86_64-linux.so

(this is 1., above)

but also...

$ find `bundle exec ruby -e "require 'ddtrace'; puts Gem::loaded_specs['ddtrace'].extension_dir"` | grep so$
.rvm/gems/ruby-3.1.4/extensions/x86_64-linux/3.1.0/ddtrace-1.21.1/datadog_profiling_native_extension.3.1.4_x86_64-linux.so
.rvm/gems/ruby-3.1.4/extensions/x86_64-linux/3.1.0/ddtrace-1.21.1/datadog_profiling_loader.3.1.4_x86_64-linux.so

(this is 2., above)

(Aside: These are not leftovers -- if I uninstall the gem, the files get removed, and get recreated when I install them.)

So, on my machine, the files get installed to two different folders.

And for most of our customers, I'm pretty sure they get installed into 1. because otherwise the profiler won't work.

But one customer reported the following error:

Profiling was requested but is not supported, profiling disabled: There was an error loading the profiling native extension due to 'RuntimeError Failure to load ddtrace_profiling_native_extension.3.2.2_x86_64-linux due to /var/app/current/vendor/bundle/ruby/3.2.0/gems/ddtrace-1.20.0/lib/datadog/profiling/../../ddtrace_profiling_native_extension.3.2.2_x86_64-linux.so: cannot open shared object file: No such file or directory' at '/var/app/current/vendor/bundle/ruby/3.2.0/gems/ddtrace-1.20.0/lib/datadog/profiling/load_native_extension.rb:26:in <top (required)>''`

Now my first intuition was to suspect -- something is wrong, the native extension didn't get built. But, actually, that's not the case, and here's a very important detail that took me a while to fully understand: the error message says that the
ddtrace_profiling_native_extension.3.2.2_x86_64-linux.so was not found, not the datadog_profiling_loader.3.2.2_x86_64-linux.so.

This distinction is important, because it's the loader that attempts to load the extension. So if the loader didn't fail first, it means the loader was found; but the extension wasn't.

With a bit more input from the customer, and some testing on my side, it looks like Ruby puts both folders above (1. and 2.) in the $LOAD_PATH, so when using require to load a file, both get checked.

And for this one customer, the extension only existed in folder 2., not 1. (even after reinstalling the gem).

Since we use a regular require to load the loader, it was still found, regardless of where it was. BUT because for the extension we try to load by using a specific path, and that path only checked lib (1.), then the extension failed to load.

This is why the customer saw an error message that pointed to the loader working, but the extension not being found.

Now because our previous logic of only checking folder 1. worked for all other customers we have, I'm not really sure why this one customer had the extension only in folder 2., but not in 1. . But regardless, having either folder in use is something that Ruby does, so I think it's reasonable for us to emulate that logic.

Additional Notes:

If you're curious why we are reimplementing Ruby logic for loading extensions, check the comments on datadog_profiling_loader.c.

TL;DR is, we want to load the library with specific system flags, and Ruby doesn't provide a way to do so, so we need to basically duplicate a bunch of Ruby's library loading logic ourselves to be able to emulate Ruby's behavior but change the system flags.

How to test the change?

I was able to reproduce the issue locally by installing ddtrace, and then manually deleting the files in 1., leaving them only in 2.

Without this change, I saw exactly the same error message as the customer did; with this change, profiling works.

On top of the added specs, I manually tested this fix on Ruby versions 2.3, 3.1 and 3.3.

**What does this PR do?**

This PR tweaks our existing logic for loading the profiling native
extension to try to find the file in two directories:

1. The lib/ directory
2. The Ruby extensions directory

Previously we would only check 1.

**Motivation:**

I'll admit I don't 100% understand when one or the other folder
gets used.

For instance, on a fresh install of `ddtrace` on my machine I get

```
$ bundle exec gem contents ddtrace | grep so$
.rvm/gems/ruby-3.1.4/gems/ddtrace-1.21.1/lib/datadog_profiling_loader.3.1.4_x86_64-linux.so
.rvm/gems/ruby-3.1.4/gems/ddtrace-1.21.1/lib/datadog_profiling_native_extension.3.1.4_x86_64-linux.so
```

(this is 1., above)

but also...

```
$ find `bundle exec ruby -e "require 'ddtrace'; puts Gem::loaded_specs['ddtrace'].extension_dir"` | grep so$
.rvm/gems/ruby-3.1.4/extensions/x86_64-linux/3.1.0/ddtrace-1.21.1/datadog_profiling_native_extension.3.1.4_x86_64-linux.so
.rvm/gems/ruby-3.1.4/extensions/x86_64-linux/3.1.0/ddtrace-1.21.1/datadog_profiling_loader.3.1.4_x86_64-linux.so
```

(this is 2., above)

(Aside: These are not leftovers -- if I uninstall the gem, the files get
removed, and get recreated when I install them.)

So, on my machine, the files get installed to two different folders.

And for most of our customers, I'm pretty sure they get installed into
1. because otherwise the profiler won't work.

But one customer reported the following error:

```
Profiling was requested but is not supported, profiling disabled: There was an error loading the profiling native extension due to 'RuntimeError Failure to load ddtrace_profiling_native_extension.3.2.2_x86_64-linux due to /var/app/current/vendor/bundle/ruby/3.2.0/gems/ddtrace-1.20.0/lib/datadog/profiling/../../ddtrace_profiling_native_extension.3.2.2_x86_64-linux.so: cannot open shared object file: No such file or directory' at '/var/app/current/vendor/bundle/ruby/3.2.0/gems/ddtrace-1.20.0/lib/datadog/profiling/load_native_extension.rb:26:in `<top (required)>''
```

Now my first intuition was to suspect -- something is wrong, the
native extension didn't get built. But, actually, that's not the
case, and here's a very important detail that took me a while to
fully understand: the error message says that the
**ddtrace_profiling_native_extension.3.2.2_x86_64-linux.so** was
not found, not the **datadog_profiling_loader.3.2.2_x86_64-linux.so**.

This distinction is important, because it's the loader that
attempts to load the extension. So if the loader didn't fail first,
it means the loader **was found**; but the extension **wasn't**.

With a bit more input from the customer, and some testing on my side,
it looks like Ruby puts both folders above (1. and 2.)
in the `$LOAD_PATH`, so when using `require` to load a file, both
get checked.

And for this one customer, the extension only existed in folder 2.,
not 1. (even after reinstalling the gem).

Since we use a regular `require` to load the loader, it was still
found, regardless of where it was. BUT because for the extension
we try to load by using a specific path, and that path only checked
lib (1.), then the extension failed to load.

This is why the customer saw an error message that pointed to the
loader working, but the extension not being found.

Now because our previous logic of only checking folder 1. worked
for all other customers we have, I'm not really sure why this one
customer had the extension only in folder 2., but not in 1. .
But regardless, having either folder in use is something that Ruby
does, so I think it's reasonable for us to emulate that logic.

**Additional Notes:**

If you're curious why we are reimplementing Ruby logic for loading
extensions, check the comments on `datadog_profiling_loader.c`.

TL;DR is, we want to load the library with specific system flags,
and Ruby doesn't provide a way to do so, so we need to basically
duplicate a bunch of Ruby's library loading logic ourselves to
be able to emulate Ruby's behavior but change the system flags.

**How to test the change?**

I was able to reproduce the issue locally
by installing ddtrace, and then manually deleting the files in 1.,
leaving them only in 2.

Without this change, I saw exactly the same error message as the
customer did; with this change, profiling works.

On top of the added specs, I manually tested
this fix on Ruby versions 2.3, 3.1 and 3.3.
@ivoanjo ivoanjo requested review from a team as code owners April 11, 2024 10:17
@github-actions github-actions bot added the profiling Involves Datadog profiling label Apr 11, 2024
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 97.91667% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 98.24%. Comparing base (07c75b8) to head (c6270c9).
Report is 3 commits behind head on master.

Files Patch % Lines
...ec/datadog/profiling/load_native_extension_spec.rb 97.56% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3582      +/-   ##
==========================================
- Coverage   98.24%   98.24%   -0.01%     
==========================================
  Files        1254     1255       +1     
  Lines       74355    74402      +47     
  Branches     3529     3536       +7     
==========================================
+ Hits        73052    73098      +46     
- Misses       1303     1304       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ivoanjo ivoanjo added this to the 1.22.0 milestone Apr 11, 2024
Copy link
Contributor

@AlexJF AlexJF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love these weird behaviours! LGTM!

@ivoanjo ivoanjo merged commit 51c16e7 into master Apr 11, 2024
207 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/check-extension-dir-as-well branch April 11, 2024 10:47
ivoanjo added a commit that referenced this pull request Jun 5, 2024
…ension dir

**What does this PR do?**

This PR is a follow-up to
#3582 .

In that PR, we fixed loading the profiling native extension so that
it could be loaded from the Ruby extensions directory (see the original
PR for more details).

It turns out this was not enough! Specifically, the customer reported
that they saw the following error

> Profiling was requested but is not supported, profiling disabled: There was an error loading the profiling
> native extension due to 'RuntimeError Failure to load datadog_profiling_native_extension.3.2.2_x86_64-linux
> due to libdatadog_profiling.so: cannot open shared object file: No such file or directory

Specifically, what this message tells is that we're finding the
profiling native extension BUT it's failing to load BECAUSE the dynamic
loader is not able to find its `libdatadog_profiling.so` dependency.

From debugging the issue with the customer, I suspect that what
we're seeing here is a repeat of
#2067 /
#2125 , that is, the
paths where the profiler is compiled are changed at deployment, and
so we also need to adjust the relative rpath to account for this.

I haven't yet confirmed with the customer that this is their issue,
BUT I was able to reproduce the exact problem if I moved the
installation of the library in the way I mention above (see "how to test
the change", below).

**Motivation:**

Fix this weird corner case that made the profiler not load.

**Additional Notes:**

This is a really really weird corner case, so I'm happy to further
describe what the issue is if my description above + the comments in the
code are still too cryptic to understand.

**How to test the change?**

I've added test code for the helper, but actually validating the whole
rpath thing is a bit annoying.

Here's how I triggered the issue myself, and then used it to validate
the fix:

```
 # Build fixed gem into folder, will be used later
$ bundle exec rake build
datadog 2.0.0.rc1 built to pkg/datadog-2.0.0.rc1.gem.

 # Open a clean Ruby docker installation
$ docker run --network=host -ti -v `pwd`:/working ruby:3.2.2-bookworm /bin/bash

 # I've created a minimal test gemfile ahead of time
/working/rpathtest# cat gems.rb
source 'https://rubygems.org'

gem 'datadog'
 # Tell bundler to install the gem into a folder
/working/rpathtest# bundle config set --local path 'vendor/bundle'
/working/rpathtest# bundle install

 # Confirm profiler works:
/working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1"
 # ... No errors loading profiler ...

 # Now let's simulate the native extension being loaded from the
 # extensions directory:
/working/rpathtest# find | grep \.so$ | grep datadog
./vendor/bundle/ruby/3.2.0/extensions/x86_64-linux/3.2.0/datadog-2.0.0.rc1/datadog_profiling_native_extension.3.2.2_x86_64-linux.so
./vendor/bundle/ruby/3.2.0/extensions/x86_64-linux/3.2.0/datadog-2.0.0.rc1/datadog_profiling_loader.3.2.2_x86_64-linux.so
./vendor/bundle/ruby/3.2.0/gems/libdatadog-9.0.0.1.0-x86_64-linux/vendor/libdatadog-9.0.0/x86_64-linux/libdatadog-x86_64-unknown-linux-gnu/lib/libdatadog_profiling.so
./vendor/bundle/ruby/3.2.0/gems/libdatadog-9.0.0.1.0-x86_64-linux/vendor/libdatadog-9.0.0/x86_64-linux-musl/libdatadog-x86_64-alpine-linux-musl/lib/libdatadog_profiling.so
./vendor/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_native_extension.3.2.2_x86_64-linux.so
./vendor/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_loader.3.2.2_x86_64-linux.so
/working/rpathtest# rm ./vendor/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_native_extension.3.2.2_x86_64-linux.so  ./vendor/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_loader.3.2.2_x86_64-linux.so

 # Confirm profiler still works:
/working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1"
 # ... No errors loading profiler ...

 # Now let's simulate the folders being moved (the issue being fixed):
/working/rpathtest# cat /usr/local/bundle/config
---
BUNDLE_PATH: "vendor/bundle"
 # Update this to vendor2...
working/rpathtest# cat /usr/local/bundle/config
---
BUNDLE_PATH: "vendor2/bundle"
 # and move the folder
/working/rpathtest# mv vendor/ vendor2

 # Now we've triggered the exact same error message as reported by the
 # customer
/working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1"
W, [2024-06-05T15:51:12.488843 #517]  WARN -- datadog: [datadog] Profiling was requested but is not supported, profiling disabled: There was an error loading the profiling native extension due to 'RuntimeError Failure to load datadog_profiling_native_extension.3.2.2_x86_64-linux due to libdatadog_profiling.so: cannot open shared object file: No such file or directory' at '/working/rpathtest/vendor2/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog/profiling/load_native_extension.rb:41:in `<top (required)>''

 # Now let's test the fix. Let's start by recreating the issue:
 # Put the fixed version into the bundler cache...
/working/rpathtest# cp /working/pkg/datadog-2.0.0.rc1.gem vendor2/bundle/ruby/3.2.0/cache/datadog-2.0.0.rc1.gem
 # force bundler to reinstall...
working/rpathtest# rm -rf vendor2/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/
working/rpathtest# bundle install
 # Force gem to be loaded from extension directory
/working/rpathtest# rm ./vendor2/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_native_extension.3.2.2_x86_64-linux.so  ./vendor2/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_loader.3.2.2_x86_64-linux.so
 # Confirm it works:
/working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1"
 # ... No errors loading profiler ...

 # Let's now change the vendor folder again:
/working/rpathtest# cat /usr/local/bundle/config
---
BUNDLE_PATH: "vendor3/bundle"
/working/rpathtest# mv vendor2/ vendor3

 # And it now doesn't fail:
/working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1"
 # ... No errors loading profiler ...

 # And extra confirmation that the relative paths are working:
/working/rpathtest# ldd ./vendor3/bundle/ruby/3.2.0/extensions/x86_64-linux/3.2.0/datadog-2.0.0.rc1/datadog_profiling_native_extension.3.2.2_x86_64-linux.so
	libdatadog_profiling.so => /working/rpathtest/./vendor3/bundle/ruby/3.2.0/extensions/x86_64-linux/3.2.0/datadog-2.0.0.rc1/../../../../gems/libdatadog-9.0.0.1.0-x86_64-linux/vendor/libdatadog-9.0.0/x86_64-linux/libdatadog-x86_64-unknown-linux-gnu/lib/libdatadog_profiling.so (0x00007ff127c00000)
```
ivoanjo added a commit that referenced this pull request Jun 5, 2024
…ension dir

**What does this PR do?**

This PR is a follow-up to
#3582 .

In that PR, we fixed loading the profiling native extension so that
it could be loaded from the Ruby extensions directory (see the original
PR for more details).

It turns out this was not enough! Specifically, the customer reported
that they saw the following error

> Profiling was requested but is not supported, profiling disabled: There was an error loading the profiling
> native extension due to 'RuntimeError Failure to load datadog_profiling_native_extension.3.2.2_x86_64-linux
> due to libdatadog_profiling.so: cannot open shared object file: No such file or directory

Specifically, what this message tells is that we're finding the
profiling native extension BUT it's failing to load BECAUSE the dynamic
loader is not able to find its `libdatadog_profiling.so` dependency.

From debugging the issue with the customer, I suspect that what
we're seeing here is a repeat of
#2067 /
#2125 , that is, the
paths where the profiler is compiled are changed at deployment, and
so we also need to adjust the relative rpath to account for this.

I haven't yet confirmed with the customer that this is their issue,
BUT I was able to reproduce the exact problem if I moved the
installation of the library in the way I mention above (see "how to test
the change", below).

**Motivation:**

Fix this weird corner case that made the profiler not load.

**Additional Notes:**

This is a really really weird corner case, so I'm happy to further
describe what the issue is if my description above + the comments in the
code are still too cryptic to understand.

**How to test the change?**

I've added test code for the helper, but actually validating the whole
rpath thing is a bit annoying.

Here's how I triggered the issue myself, and then used it to validate
the fix:

```
 # Build fixed gem into folder, will be used later
$ bundle exec rake build
datadog 2.0.0.rc1 built to pkg/datadog-2.0.0.rc1.gem.

 # Open a clean Ruby docker installation
$ docker run --network=host -ti -v `pwd`:/working ruby:3.2.2-bookworm /bin/bash

 # I've created a minimal test gemfile ahead of time
/working/rpathtest# cat gems.rb
source 'https://rubygems.org'

gem 'datadog'
 # Tell bundler to install the gem into a folder
/working/rpathtest# bundle config set --local path 'vendor/bundle'
/working/rpathtest# bundle install

 # Confirm profiler works:
/working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1"
 # ... No errors loading profiler ...

 # Now let's simulate the native extension being loaded from the
 # extensions directory:
/working/rpathtest# find | grep \.so$ | grep datadog
./vendor/bundle/ruby/3.2.0/extensions/x86_64-linux/3.2.0/datadog-2.0.0.rc1/datadog_profiling_native_extension.3.2.2_x86_64-linux.so
./vendor/bundle/ruby/3.2.0/extensions/x86_64-linux/3.2.0/datadog-2.0.0.rc1/datadog_profiling_loader.3.2.2_x86_64-linux.so
./vendor/bundle/ruby/3.2.0/gems/libdatadog-9.0.0.1.0-x86_64-linux/vendor/libdatadog-9.0.0/x86_64-linux/libdatadog-x86_64-unknown-linux-gnu/lib/libdatadog_profiling.so
./vendor/bundle/ruby/3.2.0/gems/libdatadog-9.0.0.1.0-x86_64-linux/vendor/libdatadog-9.0.0/x86_64-linux-musl/libdatadog-x86_64-alpine-linux-musl/lib/libdatadog_profiling.so
./vendor/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_native_extension.3.2.2_x86_64-linux.so
./vendor/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_loader.3.2.2_x86_64-linux.so
/working/rpathtest# rm ./vendor/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_native_extension.3.2.2_x86_64-linux.so  ./vendor/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_loader.3.2.2_x86_64-linux.so

 # Confirm profiler still works:
/working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1"
 # ... No errors loading profiler ...

 # Now let's simulate the folders being moved (the issue being fixed):
/working/rpathtest# cat /usr/local/bundle/config
---
BUNDLE_PATH: "vendor/bundle"
 # Update this to vendor2...
working/rpathtest# cat /usr/local/bundle/config
---
BUNDLE_PATH: "vendor2/bundle"
 # and move the folder
/working/rpathtest# mv vendor/ vendor2

 # Now we've triggered the exact same error message as reported by the
 # customer
/working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1"
W, [2024-06-05T15:51:12.488843 #517]  WARN -- datadog: [datadog] Profiling was requested but is not supported, profiling disabled: There was an error loading the profiling native extension due to 'RuntimeError Failure to load datadog_profiling_native_extension.3.2.2_x86_64-linux due to libdatadog_profiling.so: cannot open shared object file: No such file or directory' at '/working/rpathtest/vendor2/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog/profiling/load_native_extension.rb:41:in `<top (required)>''

 # Now let's test the fix. Let's start by recreating the issue:
 # Put the fixed version into the bundler cache...
/working/rpathtest# cp /working/pkg/datadog-2.0.0.rc1.gem vendor2/bundle/ruby/3.2.0/cache/datadog-2.0.0.rc1.gem
 # force bundler to reinstall...
working/rpathtest# rm -rf vendor2/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/
working/rpathtest# bundle install
 # Force gem to be loaded from extension directory
/working/rpathtest# rm ./vendor2/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_native_extension.3.2.2_x86_64-linux.so  ./vendor2/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_loader.3.2.2_x86_64-linux.so
 # Confirm it works:
/working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1"
 # ... No errors loading profiler ...

 # Let's now change the vendor folder again:
/working/rpathtest# cat /usr/local/bundle/config
---
BUNDLE_PATH: "vendor3/bundle"
/working/rpathtest# mv vendor2/ vendor3

 # And it now doesn't fail:
/working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1"
 # ... No errors loading profiler ...

 # And extra confirmation that the relative paths are working:
/working/rpathtest# ldd ./vendor3/bundle/ruby/3.2.0/extensions/x86_64-linux/3.2.0/datadog-2.0.0.rc1/datadog_profiling_native_extension.3.2.2_x86_64-linux.so
	libdatadog_profiling.so => /working/rpathtest/./vendor3/bundle/ruby/3.2.0/extensions/x86_64-linux/3.2.0/datadog-2.0.0.rc1/../../../../gems/libdatadog-9.0.0.1.0-x86_64-linux/vendor/libdatadog-9.0.0/x86_64-linux/libdatadog-x86_64-unknown-linux-gnu/lib/libdatadog_profiling.so (0x00007ff127c00000)
```
@@ -18,7 +18,20 @@
end

extension_name = "datadog_profiling_native_extension.#{RUBY_VERSION}_#{RUBY_PLATFORM}"
full_file_path = "#{__dir__}/../../#{extension_name}.#{RbConfig::CONFIG['DLEXT']}"
file_name = "#{extension_name}.#{RbConfig::CONFIG['DLEXT']}"
full_file_path = "#{__dir__}/../../#{file_name}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not use File.join for best results?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed... The profiler really only targets Linux/Unix right now, so a number of linux/unixisms have crept in.

The native extension loader that's getting called using this path, for instance, uses dlopen, so that's another thing that wouldn't work.

So yeah, I'm not sure it's worth trying to piecemeal support all OSs for the profiler in preparation for maybe supporting them in the future, I think it's easier to evaluate the whole thing once we want that (e.g. for instance not using the load_native_extension.rb on Windows at all).

ivoanjo added a commit to DataDog/prof-correctness that referenced this pull request Jun 12, 2024
…d relative rpath is needed

**What does this PR do?**

This PR adds a new test case that validates that
DataDog/dd-trace-rb#3582 and
DataDog/dd-trace-rb#3683 keep working fine.

**Motivation:**

As described in DataDog/dd-trace-rb#3683, this
a somewhat annoying thing to test, but important to avoid regressing.

**Additional Notes:**

You can actually see the evolution of both of those fixes in
this test.

E.g. here's dd-trace-rb 1.21.0 (prior to
DataDog/dd-trace-rb#3582 ):

```
W, [2024-06-12T09:34:08.759519 #7]  WARN -- ddtrace: [ddtrace] (/app/vendor-moved/bundle/ruby/3.3.0/gems/ddtrace-1.21.1/lib/datadog/core/configuration/components.rb:115:in `startup!') Profiling was requested but is not supported, profiling disabled: There was an error loading the profiling native extension due to 'RuntimeError Failure to load datadog_profiling_native_extension.3.3.2_x86_64-linux due to /app/vendor-moved/bundle/ruby/3.3.0/gems/ddtrace-1.21.1/lib/datadog/profiling/../../datadog_profiling_native_extension.3.3.2_x86_64-linux.so: cannot open shared object file: No such file or directory' at '/app/vendor-moved/bundle/ruby/3.3.0/gems/ddtrace-1.21.1/lib/datadog/profiling/load_native_extension.rb:26:in `<top (required)>''
    --- FAIL: TestScenarios/scenarios/ruby_extension_dir_and_rpath (14.86s)
```

in this version, we failed because we couldn't load the native
extension.

Then here's dd-trace-rb 1.23.1 (without
DataDog/dd-trace-rb#3683 ) and if we
don't move the `vendor` folder (but still delete the so from the
lib folder):

```
    --- PASS: TestScenarios/scenarios/ruby_extension_dir_and_rpath (18.96s)
```

...but if we additionally move the vendor folder (aka what this PR
does in the Dockerfile):

```
W, [2024-06-12T09:37:33.517188 #6]  WARN -- ddtrace: [ddtrace] (/app/vendor-moved/bundle/ruby/3.3.0/gems/ddtrace-1.23.1/lib/datadog/core/configuration/components.rb:116:in `startup!') Profiling was requested but is not supported, profiling disabled: There was an error loading the profiling native extension due to 'RuntimeError Failure to load datadog_profiling_native_extension.3.3.2_x86_64-linux due to libdatadog_profiling.so: cannot open shared object file: No such file or directory' at '/app/vendor-moved/bundle/ruby/3.3.0/gems/ddtrace-1.23.1/lib/datadog/profiling/load_native_extension.rb:39:in `<top (required)>''
    --- FAIL: TestScenarios/scenarios/ruby_extension_dir_and_rpath (3.25s)
```

Notice it fails BUT the error is now different from the one above --
the error is relating to loading `libdatadog_profiling.so`, not
`datadog_profiling_native_extension.3.3.2_x86_64-linux.so`.

And with the change in DataDog/dd-trace-rb#3683
(which will be in 1.23.2):

```
    --- PASS: TestScenarios/scenarios/ruby_extension_dir_and_rpath (9.60s)
```

**NOTE**: For this test, unlike other Ruby tests we have, we're pulling
in the latest **released** gem version (e.g. with `gem 'datadog'` on the
`gems.rb` file), not the latest from git (as we do for other Ruby
tests).

This is because gems get installed in different paths when bundler
downloads them directly from git, and we want to validate the path when
a stable version is installed.

This also means that this PR will show up as failed until the latest
datadog release (which will be 2.2.0) gets released. (Or 1.23.2, but
I left the test setup to test the latest 2.x releases, not the 1.x ones,
although I used 1.x on my tests above to show the evolution of the
issue).
ivoanjo added a commit that referenced this pull request Jun 12, 2024
…ension dir

**What does this PR do?**

This PR is a follow-up to
#3582 .

In that PR, we fixed loading the profiling native extension so that
it could be loaded from the Ruby extensions directory (see the original
PR for more details).

It turns out this was not enough! Specifically, the customer reported
that they saw the following error

> Profiling was requested but is not supported, profiling disabled: There was an error loading the profiling
> native extension due to 'RuntimeError Failure to load datadog_profiling_native_extension.3.2.2_x86_64-linux
> due to libdatadog_profiling.so: cannot open shared object file: No such file or directory

Specifically, what this message tells is that we're finding the
profiling native extension BUT it's failing to load BECAUSE the dynamic
loader is not able to find its `libdatadog_profiling.so` dependency.

From debugging the issue with the customer, I suspect that what
we're seeing here is a repeat of
#2067 /
#2125 , that is, the
paths where the profiler is compiled are changed at deployment, and
so we also need to adjust the relative rpath to account for this.

I haven't yet confirmed with the customer that this is their issue,
BUT I was able to reproduce the exact problem if I moved the
installation of the library in the way I mention above (see "how to test
the change", below).

**Motivation:**

Fix this weird corner case that made the profiler not load.

**Additional Notes:**

This is a really really weird corner case, so I'm happy to further
describe what the issue is if my description above + the comments in the
code are still too cryptic to understand.

**How to test the change?**

I've added test code for the helper, but actually validating the whole
rpath thing is a bit annoying.

Here's how I triggered the issue myself, and then used it to validate
the fix:

```
 # Build fixed gem into folder, will be used later
$ bundle exec rake build
datadog 2.0.0.rc1 built to pkg/datadog-2.0.0.rc1.gem.

 # Open a clean Ruby docker installation
$ docker run --network=host -ti -v `pwd`:/working ruby:3.2.2-bookworm /bin/bash

 # I've created a minimal test gemfile ahead of time
/working/rpathtest# cat gems.rb
source 'https://rubygems.org'

gem 'datadog'
 # Tell bundler to install the gem into a folder
/working/rpathtest# bundle config set --local path 'vendor/bundle'
/working/rpathtest# bundle install

 # Confirm profiler works:
/working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1"
 # ... No errors loading profiler ...

 # Now let's simulate the native extension being loaded from the
 # extensions directory:
/working/rpathtest# find | grep \.so$ | grep datadog
./vendor/bundle/ruby/3.2.0/extensions/x86_64-linux/3.2.0/datadog-2.0.0.rc1/datadog_profiling_native_extension.3.2.2_x86_64-linux.so
./vendor/bundle/ruby/3.2.0/extensions/x86_64-linux/3.2.0/datadog-2.0.0.rc1/datadog_profiling_loader.3.2.2_x86_64-linux.so
./vendor/bundle/ruby/3.2.0/gems/libdatadog-9.0.0.1.0-x86_64-linux/vendor/libdatadog-9.0.0/x86_64-linux/libdatadog-x86_64-unknown-linux-gnu/lib/libdatadog_profiling.so
./vendor/bundle/ruby/3.2.0/gems/libdatadog-9.0.0.1.0-x86_64-linux/vendor/libdatadog-9.0.0/x86_64-linux-musl/libdatadog-x86_64-alpine-linux-musl/lib/libdatadog_profiling.so
./vendor/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_native_extension.3.2.2_x86_64-linux.so
./vendor/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_loader.3.2.2_x86_64-linux.so
/working/rpathtest# rm ./vendor/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_native_extension.3.2.2_x86_64-linux.so  ./vendor/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_loader.3.2.2_x86_64-linux.so

 # Confirm profiler still works:
/working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1"
 # ... No errors loading profiler ...

 # Now let's simulate the folders being moved (the issue being fixed):
/working/rpathtest# cat /usr/local/bundle/config
---
BUNDLE_PATH: "vendor/bundle"
 # Update this to vendor2...
working/rpathtest# cat /usr/local/bundle/config
---
BUNDLE_PATH: "vendor2/bundle"
 # and move the folder
/working/rpathtest# mv vendor/ vendor2

 # Now we've triggered the exact same error message as reported by the
 # customer
/working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1"
W, [2024-06-05T15:51:12.488843 #517]  WARN -- datadog: [datadog] Profiling was requested but is not supported, profiling disabled: There was an error loading the profiling native extension due to 'RuntimeError Failure to load datadog_profiling_native_extension.3.2.2_x86_64-linux due to libdatadog_profiling.so: cannot open shared object file: No such file or directory' at '/working/rpathtest/vendor2/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog/profiling/load_native_extension.rb:41:in `<top (required)>''

 # Now let's test the fix. Let's start by recreating the issue:
 # Put the fixed version into the bundler cache...
/working/rpathtest# cp /working/pkg/datadog-2.0.0.rc1.gem vendor2/bundle/ruby/3.2.0/cache/datadog-2.0.0.rc1.gem
 # force bundler to reinstall...
working/rpathtest# rm -rf vendor2/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/
working/rpathtest# bundle install
 # Force gem to be loaded from extension directory
/working/rpathtest# rm ./vendor2/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_native_extension.3.2.2_x86_64-linux.so  ./vendor2/bundle/ruby/3.2.0/gems/datadog-2.0.0.rc1/lib/datadog_profiling_loader.3.2.2_x86_64-linux.so
 # Confirm it works:
/working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1"
 # ... No errors loading profiler ...

 # Let's now change the vendor folder again:
/working/rpathtest# cat /usr/local/bundle/config
---
BUNDLE_PATH: "vendor3/bundle"
/working/rpathtest# mv vendor2/ vendor3

 # And it now doesn't fail:
/working/rpathtest# DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby -e "sleep 1"
 # ... No errors loading profiler ...

 # And extra confirmation that the relative paths are working:
/working/rpathtest# ldd ./vendor3/bundle/ruby/3.2.0/extensions/x86_64-linux/3.2.0/datadog-2.0.0.rc1/datadog_profiling_native_extension.3.2.2_x86_64-linux.so
	libdatadog_profiling.so => /working/rpathtest/./vendor3/bundle/ruby/3.2.0/extensions/x86_64-linux/3.2.0/datadog-2.0.0.rc1/../../../../gems/libdatadog-9.0.0.1.0-x86_64-linux/vendor/libdatadog-9.0.0/x86_64-linux/libdatadog-x86_64-unknown-linux-gnu/lib/libdatadog_profiling.so (0x00007ff127c00000)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants