Description
openedon Mar 8, 2020
I've noticed a small runtime metrics issue. I am including a simple reproduction below and also a draft PR with a patch for this, just to demonstrate one way to resolve it.
The issue: In the Datadog.configure
block, when configuring runtime_metrics
, if a call to configure the tracer host or port (c.tracer port: ...
) occurs after the call to c.runtime_metrics
, any runtime metrics configuration will get lost and the trace writer will use the default statsd client and configuration. What it appears is happening is that all that runtime metrics configuration is getting attached to the writer
itself but, for reasons i'm not entirely clear on but appear to be pretty complex and an active area of development right now, that writer gets dropped and a new one gets built/rebuilt when c.tracer
is called with certain configuration options.
There's a few ways users can be impacted by this, because it's not documented anywhere that the ordering of c.runtime_metrics
within the configure
block matters. One way is that if a user passes in their own statsd client with their own namespace, that namespace will get lost when the writer is dropped from c.tracer
kicking off a rebuild. Another is that if a user is passing in their own statsd client c.runtime_metrics statsd: <statsd_client_used_elsewhere_in_app>
, then if that statsd client needs to point to a port different than 8125 or non-local host, then that config will get lost and runtime metrics as a feature won't work. Or, it'll technically work just fine, but just fire a bunch of udp packets off into space to a host/port that isn't listening for them. Needless to say since it's UDP it was quite difficult to debug this behavior.
In the PR I'm making i've included an, admittedly, pretty hacky patch. It just shoves the existing runtime metrics config into the writer_options
when rebuilding the writer, if that writer's options don't contain any runtime metrics specific options already. I think it works fine though it could use a unit test, but I also realize that y'all are already working on areas that touch this, so didn't want to go too far down the rabbit hole here if this will just sort've solve itself from the internal refactoring/simplification of the trace writer. It's also possible there's a cleaner way to patch this, what I attached was just the easiest thing I could get working.
An alternative (or short term remediation step) to this would be to simply document the constraints of runtime_metrics configuration at present, which is c.runtime_metrics
needs to be called last in the Datadog.configure
block, but ideally the ordering in the configure block shouldn't really matter and it would appear this isn't intentional.
I've included a little sample reproduction minitest suite here: https://gist.github.com/ericmustin/1d1c228053078923e7f9b19c0add8ce7
Draft PR here: #968
Sorry for the opposite of succinct writing, thanks for taking the time, cheers!