Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch Sinatra only once #3515

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Patch Sinatra only once #3515

wants to merge 2 commits into from

Conversation

lloeki
Copy link
Contributor

@lloeki lloeki commented Mar 8, 2024

What does this PR do?

Sinatra needs to be patched once only.

Motivation:

Sinatra sets things up at first-request time, which may cause concurrent requests to call configure and break RC (thread leak).

Additional Notes:

The integration hooks onto Sinatra::Base#setup_middleware, which then (much later, as Sinatra::Base#setup_middleware appears to be invoked at first request) calls our Contrib::Sinatra::Framework.setup, which calls Datadog.configure and activate_rack! which invokes config.tracing.instrument :rack, which ultimately inserts the Rack middleware.

The trouble appears to be that this Datadog.configure call is not concurrency-protected enough, which leads to RC workers being started then mid-way the RC component being rebuilt with inconsistent access through active_remote.

How to test the change?

  • CI
  • test in a Sinatra app

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

Unsure? Have a question? Request a review!

@lloeki lloeki requested a review from a team as a code owner March 8, 2024 13:52
@github-actions github-actions bot added integrations Involves tracing integrations tracing labels Mar 8, 2024
@lloeki lloeki marked this pull request as draft March 8, 2024 14:07
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.23%. Comparing base (44b9c6c) to head (9164ae1).
Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3515      +/-   ##
==========================================
- Coverage   98.23%   98.23%   -0.01%     
==========================================
  Files        1275     1275              
  Lines       75226    75228       +2     
  Branches     3552     3552              
==========================================
+ Hits        73901    73902       +1     
- Misses       1325     1326       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lloeki
Copy link
Contributor Author

lloeki commented Mar 8, 2024

There are two things happening here that should be split:

  • setting the configuration settings for Sinatra+Rack, which should be done immediately (not delayed to a later time):
    • saying that this integration is enabled
    • setting any passed options, or setting to default values
  • inserting our middlewares into the Rack stack, which can only be done when there is a Rack stack (which Sinatra appears to do lazily, hence the hook on setup_middlewares and the cause of this happening when requests start to be received)

The former goes through the configure block, the latter should be done outside of a configure block.

IOW THERE SHOULD BE NO CALL TO CONFIGURE AT APP RUNTIME (SERVING REQUESTS). The design of configure is just not amenable to that currently.

That would go a long way towards this class of issue disappearing.

Comment on lines +66 to +69
# Configuration should be considered RO past this point:
# - Patcher's `patch` must not modify configuration.
# - use Integration's `subconfigure` instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about calling subconfigure here instead:

  • gives access to full config, not just tracing's, which would allow cross-product integration subconfiguration (e.g appsec could enable tracing)
  • config is final state after whole configure's &block is executed vs at the time the instrument :sinatra line is

@@ -31,6 +31,14 @@ def new_configuration
Configuration::Settings.new
end

def subconfigure(tracing, sinatra)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lousy name, configure is already used.

I considered merely def configure and supering but that requires giving arguments here that are sorta internal WRT this Integration DSL-ish descriptive class. So I preferred not leaking internal details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instrument_dependencies or instrument_components? It looks like in most cases this is what we end up doing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.

Indeed in this case we do c.instrument, but that may not always be the case, it could also do c.some_setting = :foo in which case instrument_* is misleading.

Technically this is really a callback within the Datadog.configure block, so that we don't nest configure anymore. In that case I believe the name should include configure in some way for clarity. Since for Integrations there's a def configure that is called before, so maybe def after_configure? That would give clarity that i.e that it happens after Integration#configure has been called.

Functionally this is about setting up dependencies, so it may go beyond just c.something = :foo or c.instrument. So def dependencies or def configure_dependencies might cut it.

e.g I'll need to apply similar changes to AppSec so it makes me think that I'm yet unsure how that would unfold, therefore I'm trying to keep an open-ended approach where integrations could have dependencies across products. e.g AppSec or CI may decide that they'd want to enable tracing and/or enable specific tracing integrations when tracing is enabled to ensure Things Just Work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. I like the after_configure, seems like it's a reasonable description for what's happening.

Comment on lines +35 to +39
tracing.instrument(
:rack,
service_name: sinatra.service_name,
distributed_tracing: sinatra.distributed_tracing,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most probably:

  • other integrations are affected by this: Rails, Grape, etc...
  • a similar pattern should be used by e.g AppSec integrations.

Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems quite reasonable!

I'll admit that I'm less confident about some of the specific details of the changes to how we configure integrations.

Having said that, I think it's a good insight that we have these ad-hoc hooks for "oh I'm setting up sinatra, and now need to setup rack; or I'm setting up rails and now need to setup active_...whatever" and making them part of the API avoids integrations needing to solve the problem on their own.

Thoughts on next steps? Perhaps pull in the tracing folks to make a decision on moving forward (or not) with this change?

Comment on lines -56 to 45
setup_tracer
end

def register_tracer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⬇️ Don't forget to remove setup_tracer, below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, removed.

@@ -31,6 +31,14 @@ def new_configuration
Configuration::Settings.new
end

def subconfigure(tracing, sinatra)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instrument_dependencies or instrument_components? It looks like in most cases this is what we end up doing here.

def subconfigure(tracing, sinatra)
tracing.instrument(
:rack,
service_name: sinatra.service_name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should deprecate setting the rack service_name from the sinatra service name for 2.0?

E.g. instead suggest customers configure a top-level service name instead, or if they really want to, they can configure a service name in rack directly.

Copy link
Contributor Author

@lloeki lloeki Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not so much about passing a manually configured service name as much as when that sinatra.service_name is defaulted and inferred e.g from the app class or something.

Same for Rails where it's the application name as a default.

Cue autoinstrumentation as well.

Tangentially there may be a conundrum to resolve here in such default cases: ordering may matter when there's both Sinatra and Rails in the same app, as both would set their service name and one would win. But that's a problem for another day.

@@ -41,7 +41,7 @@ def configurations

# Create or update configuration associated with `matcher` with
# the provided `options` and `&block`.
def configure(matcher = :default, options = {}, &block)
def configure(matcher = :default, options = {}, tracing = nil, &block)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be tracing_configuration or tracing_instrumentation instead maybe?

Copy link
Contributor Author

@lloeki lloeki Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a bit ambiguous. Since this is a Configuration::Settings object and the other name here is config I went with tracing_config.

ivoanjo added a commit that referenced this pull request Mar 12, 2024
**What does this PR do?**

This PR fixes the remote configuration worker thread leak that is
discussed in more detail in the internal
"dd-trace-rb Remote::Worker thread leak investigation" doc.

In a nutshell, there's a potential race between threads that call
`Datadog::Core::Remote::Tie.boot` and threads that call
`Datadog.configure` that can, in really unlucky situations,
mean that the `Remote::Worker thread gets leaked during library
reconfiguration.

This rare issue is O(number of times Datadog.configure gets called)
while requests are being processed in a web-app.

I was only able to reproduce it with the weird reproducer below, and
by adding `sleep` to some parts of our code, so while we have observed
it in production, we expect it to be rare.

Being more specific, this issue is related to a race between the remote
configuration thread being shut down, and other parts of the code base
trying to restart it.

In practice, the remote configuration would previously always get
started by calls to `Datadog::Core::Remote.active_remote.barrier(:once)`.
During dd-trace-rb reconfiguration, it’s possible for the remote
configuration thread to be shut down, and before the new configuration
kicks in, for any other thread to call
`Datadog::Core::Remote.active_remote.barrier(:once)`, restarting the
remote configuration thread that belonged to the old configuration.

This PR fixes this issue introducing a `@stop_requested` variable that
flags a "terminal state" for the remote configuration worker.

Thus, the default behavior of `Remote::Worker.start` is changed
from always starting the thread to not starting the thread if the
same worker was stopped before.

**Motivation:**

Fix issue.

**Additional Notes:**

We are also working to remove cases where `Datadog.configure` gets
called during request handling, e.g. #3515 .

**How to test the change?**

I've added unit test coverage for the worker not restarting itself
again.

To validate that this fixes the worker thread leak, the following
reproducer can be used:

```diff
diff --git a/lib/datadog/core/configuration.rb b/lib/datadog/core/configuration.rb
index 7a20444784..fe0ca8ba0f 100644
--- a/lib/datadog/core/configuration.rb
+++ b/lib/datadog/core/configuration.rb
@@ -254,6 +254,12 @@ module Datadog
         components = Components.new(settings)

         old.shutdown!(components)
+        puts "After shutdown, sleeping!"
+
+        sleep 1
+
+        puts "Woke up after shutdown!"
+
         components.startup!(settings)
         components
       end
diff --git a/remote-worker-thread-leak.rb b/remote-worker-thread-leak.rb
new file mode 100644
index 0000000000..3552c654b5
--- /dev/null
+++ b/remote-worker-thread-leak.rb
@@ -0,0 +1,22 @@
+require 'ddtrace'
+
+configure_thread = Thread.new do
+  15.times {
+    Datadog.configure { Thread.pass }
+    Thread.pass
+  }
+end
+
+trigger_rc_thread = Thread.new do
+  loop {
+    sleep 0.5
+    Datadog::Core::Remote.active_remote.barrier(:once)
+    Thread.pass
+  }
+end
+
+configure_thread.join
+trigger_rc_thread.kill
+trigger_rc_thread.join
+
+puts Thread.list
```

Before the fix, the list of threads printed would show a lot of
leaked remote worker configuration threads. With the fix, the
threads no longer leak, and if you enable debug logging you will
see that the restart is being prevented and the debug log I
added will show up

```
D, [2024-03-12T16:27:02.951950 #742185] DEBUG -- ddtrace: [ddtrace]
(dd-trace-rb/lib/datadog/core/remote/worker.rb:28:in `start')
remote worker: refusing to restart after previous stop
```

This debug message is harmless and shows the fix is working fine.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integrations Involves tracing integrations tracing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants