-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patch Sinatra only once #3515
base: master
Are you sure you want to change the base?
Patch Sinatra only once #3515
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3515 +/- ##
==========================================
- Coverage 98.23% 98.23% -0.01%
==========================================
Files 1275 1275
Lines 75226 75228 +2
Branches 3552 3552
==========================================
+ Hits 73901 73902 +1
- Misses 1325 1326 +1 ☔ View full report in Codecov by Sentry. |
There are two things happening here that should be split:
The former goes through the configure block, the latter should be done outside of a configure block. IOW THERE SHOULD BE NO CALL TO CONFIGURE AT APP RUNTIME (SERVING REQUESTS). The design of configure is just not amenable to that currently. That would go a long way towards this class of issue disappearing. |
# Configuration should be considered RO past this point: | ||
# - Patcher's `patch` must not modify configuration. | ||
# - use Integration's `subconfigure` instead | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering about calling subconfigure
here instead:
- gives access to full config, not just tracing's, which would allow cross-product integration subconfiguration (e.g appsec could enable tracing)
- config is final state after whole
configure
's&block
is executed vs at the time theinstrument :sinatra
line is
@@ -31,6 +31,14 @@ def new_configuration | |||
Configuration::Settings.new | |||
end | |||
|
|||
def subconfigure(tracing, sinatra) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lousy name, configure
is already used.
I considered merely def configure
and super
ing but that requires giving arguments here that are sorta internal WRT this Integration
DSL-ish descriptive class. So I preferred not leaking internal details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instrument_dependencies
or instrument_components
? It looks like in most cases this is what we end up doing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.
Indeed in this case we do c.instrument
, but that may not always be the case, it could also do c.some_setting = :foo
in which case instrument_*
is misleading.
Technically this is really a callback within the Datadog.configure
block, so that we don't nest configure anymore. In that case I believe the name should include configure
in some way for clarity. Since for Integration
s there's a def configure
that is called before, so maybe def after_configure
? That would give clarity that i.e that it happens after Integration#configure
has been called.
Functionally this is about setting up dependencies, so it may go beyond just c.something = :foo
or c.instrument
. So def dependencies
or def configure_dependencies
might cut it.
e.g I'll need to apply similar changes to AppSec
so it makes me think that I'm yet unsure how that would unfold, therefore I'm trying to keep an open-ended approach where integrations could have dependencies across products. e.g AppSec
or CI
may decide that they'd want to enable tracing and/or enable specific tracing integrations when tracing is enabled to ensure Things Just Work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points. I like the after_configure
, seems like it's a reasonable description for what's happening.
tracing.instrument( | ||
:rack, | ||
service_name: sinatra.service_name, | ||
distributed_tracing: sinatra.distributed_tracing, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most probably:
- other integrations are affected by this: Rails, Grape, etc...
- a similar pattern should be used by e.g
AppSec
integrations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems quite reasonable!
I'll admit that I'm less confident about some of the specific details of the changes to how we configure integrations.
Having said that, I think it's a good insight that we have these ad-hoc hooks for "oh I'm setting up sinatra, and now need to setup rack; or I'm setting up rails and now need to setup active_...whatever" and making them part of the API avoids integrations needing to solve the problem on their own.
Thoughts on next steps? Perhaps pull in the tracing folks to make a decision on moving forward (or not) with this change?
setup_tracer | ||
end | ||
|
||
def register_tracer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⬇️ Don't forget to remove setup_tracer
, below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, removed.
@@ -31,6 +31,14 @@ def new_configuration | |||
Configuration::Settings.new | |||
end | |||
|
|||
def subconfigure(tracing, sinatra) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instrument_dependencies
or instrument_components
? It looks like in most cases this is what we end up doing here.
def subconfigure(tracing, sinatra) | ||
tracing.instrument( | ||
:rack, | ||
service_name: sinatra.service_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should deprecate setting the rack service_name
from the sinatra service name for 2.0?
E.g. instead suggest customers configure a top-level service name instead, or if they really want to, they can configure a service name in rack directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not so much about passing a manually configured service name as much as when that sinatra.service_name
is defaulted and inferred e.g from the app class or something.
Same for Rails where it's the application name as a default.
Cue autoinstrumentation as well.
Tangentially there may be a conundrum to resolve here in such default cases: ordering may matter when there's both Sinatra and Rails in the same app, as both would set their service name and one would win. But that's a problem for another day.
@@ -41,7 +41,7 @@ def configurations | |||
|
|||
# Create or update configuration associated with `matcher` with | |||
# the provided `options` and `&block`. | |||
def configure(matcher = :default, options = {}, &block) | |||
def configure(matcher = :default, options = {}, tracing = nil, &block) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be tracing_configuration
or tracing_instrumentation
instead maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's a bit ambiguous. Since this is a Configuration::Settings
object and the other name here is config
I went with tracing_config
.
**What does this PR do?** This PR fixes the remote configuration worker thread leak that is discussed in more detail in the internal "dd-trace-rb Remote::Worker thread leak investigation" doc. In a nutshell, there's a potential race between threads that call `Datadog::Core::Remote::Tie.boot` and threads that call `Datadog.configure` that can, in really unlucky situations, mean that the `Remote::Worker thread gets leaked during library reconfiguration. This rare issue is O(number of times Datadog.configure gets called) while requests are being processed in a web-app. I was only able to reproduce it with the weird reproducer below, and by adding `sleep` to some parts of our code, so while we have observed it in production, we expect it to be rare. Being more specific, this issue is related to a race between the remote configuration thread being shut down, and other parts of the code base trying to restart it. In practice, the remote configuration would previously always get started by calls to `Datadog::Core::Remote.active_remote.barrier(:once)`. During dd-trace-rb reconfiguration, it’s possible for the remote configuration thread to be shut down, and before the new configuration kicks in, for any other thread to call `Datadog::Core::Remote.active_remote.barrier(:once)`, restarting the remote configuration thread that belonged to the old configuration. This PR fixes this issue introducing a `@stop_requested` variable that flags a "terminal state" for the remote configuration worker. Thus, the default behavior of `Remote::Worker.start` is changed from always starting the thread to not starting the thread if the same worker was stopped before. **Motivation:** Fix issue. **Additional Notes:** We are also working to remove cases where `Datadog.configure` gets called during request handling, e.g. #3515 . **How to test the change?** I've added unit test coverage for the worker not restarting itself again. To validate that this fixes the worker thread leak, the following reproducer can be used: ```diff diff --git a/lib/datadog/core/configuration.rb b/lib/datadog/core/configuration.rb index 7a20444784..fe0ca8ba0f 100644 --- a/lib/datadog/core/configuration.rb +++ b/lib/datadog/core/configuration.rb @@ -254,6 +254,12 @@ module Datadog components = Components.new(settings) old.shutdown!(components) + puts "After shutdown, sleeping!" + + sleep 1 + + puts "Woke up after shutdown!" + components.startup!(settings) components end diff --git a/remote-worker-thread-leak.rb b/remote-worker-thread-leak.rb new file mode 100644 index 0000000000..3552c654b5 --- /dev/null +++ b/remote-worker-thread-leak.rb @@ -0,0 +1,22 @@ +require 'ddtrace' + +configure_thread = Thread.new do + 15.times { + Datadog.configure { Thread.pass } + Thread.pass + } +end + +trigger_rc_thread = Thread.new do + loop { + sleep 0.5 + Datadog::Core::Remote.active_remote.barrier(:once) + Thread.pass + } +end + +configure_thread.join +trigger_rc_thread.kill +trigger_rc_thread.join + +puts Thread.list ``` Before the fix, the list of threads printed would show a lot of leaked remote worker configuration threads. With the fix, the threads no longer leak, and if you enable debug logging you will see that the restart is being prevented and the debug log I added will show up ``` D, [2024-03-12T16:27:02.951950 #742185] DEBUG -- ddtrace: [ddtrace] (dd-trace-rb/lib/datadog/core/remote/worker.rb:28:in `start') remote worker: refusing to restart after previous stop ``` This debug message is harmless and shows the fix is working fine.
What does this PR do?
Sinatra needs to be patched once only.
Motivation:
Sinatra sets things up at first-request time, which may cause concurrent requests to call
configure
and break RC (thread leak).Additional Notes:
The integration hooks onto
Sinatra::Base#setup_middleware
, which then (much later, asSinatra::Base#setup_middleware
appears to be invoked at first request) calls ourContrib::Sinatra::Framework.setup
, which callsDatadog.configure
andactivate_rack!
which invokesconfig.tracing.instrument :rack
, which ultimately inserts the Rack middleware.The trouble appears to be that this
Datadog.configure
call is not concurrency-protected enough, which leads to RC workers being started then mid-way the RC component being rebuilt with inconsistent access throughactive_remote
.How to test the change?
For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance
.Unsure? Have a question? Request a review!