Fix for broken mongodb spans #658

zachmccormick · 2018-12-18T00:02:42Z

I noticed with debug logging on that a number of our sidekiq traces were
being lost (sub-spans were marked as incomplete, so it was dropping many
of our sidekiq spans). I was able to get it working when I disabled the
Mongo spans, so I hunted around and figured out that the
:datadog_mongo_span thread-local variable was being overwritten when a
sidekiq job had multiple Mongo queries in it. I patched the code here to
use a hash with the event.request_id as the key and I no longer see any
of our traces being dropped!

The ticket in question that I raised originally is here: https://help.datadoghq.com/hc/en-us/requests/184670?page=1

delner · 2018-12-18T00:12:04Z

Thanks for bringing this to our attention @zachmccormick! Is this because the job (running on a thread) had multiple Mongo queries whose instrumentation was nested in one another? (Like a Russian doll?)

The changes from an eyeball test look good, just want to correctly understand the cause. Do you think you could add one that replicates this multi-query scenario? It would be good if we added a test for this so we don't regress.

zachmccormick · 2018-12-18T00:16:53Z

I'm actually not certain exactly how to reproduce this error. We do use structures that look something like (this is a total approximation):

MyCollection.with(:secondary_preferred => true) do |coll|
  MyOtherCollection.with(:secondary_preferred => true) do |coll2|
    first_thing = coll.where(:id => some_var).to_a[0]
    second_thing = coll2.where(:id => first_thing.some_field).to_a[0]
  end
end

but those would be two queries that ought to be running in sequence (i.e. there should be no nesting or anything in those). Also the with calls don't have anything to do with the queries themselves. I'm not too certain why the instrumentation is getting nested like it is

delner · 2018-12-18T00:23:26Z

If I had to make a quick (unverified) guess, it's because we're using event subscriptions as described here. Perhaps that #started function gets called during the with? (Maybe the with produces its own query?) Instead of when the where query is actually produced? Such a thing doesn't make much sense to me either, but maybe that's what's happening.

You could pretty easily add that structure (like you have there in your code sample) to the tests, then run them before you apply your fix to at least verify it reproduces the problem. Then once your fix is in, we can be confident the issue is resolved, and that we don't break it again, even if we don't fully understand it at this point.

zachmccormick · 2018-12-18T00:30:39Z

It's not happening in .with(). The relevant code in the mongo-ruby-driver is here: https://github.com/mongodb/mongo-ruby-driver/blob/master/lib/mongo/monitoring/publishable.rb#L39

command_started, command_completed, and command_failed are all called before/after the command is issued over the wire to the server, not on any arbitrary method call like with.

delner · 2018-12-18T17:14:44Z

I guess the most plausible explanation I can think of right now is that publish_command is being nested within each other somehow, chaining because whatever happens in the yield block causes another command to execute? (I'm thinking some kind of query structured a bit like a SQL transaction, I don't know if that would be a thing here or not.)

Anyways, I think the best way forward on this is to make a test that replicates the issue, which we could use to pretty easily expose the backtrace with caller and deconstruct how this is chaining.

zachmccormick · 2018-12-27T17:42:07Z

https://www.dropbox.com/s/i0dnr74g17qq9zl/Screenshot%202018-12-27%2012.39.50.png?dl=0

Actually now I think it's the saslStart and saslContinue commands being used to read from a cursor in multiple chunks, rather than any threading-related issue.

delner · 2018-12-27T18:09:21Z

@zachmccormick I was trying to experiment with your code example, but you use Collection#where which is not a function in mongo. Are you using Mongoid? If so, maybe something in Mongoid is contributing to this.

zachmccormick · 2018-12-27T18:11:10Z

Yes we're using Mongoid. I can check through some traces later to see if there are instances of this where we're using the lower-level driver or if it's isolated to only cases where we're using Mongoid (and reading from a cursor instead of pulling all of the results back at once)

delner · 2018-12-27T18:15:09Z

I'm not sure I'm understanding your last correctly (regarding SASL), but is that to say that Mongo is trying to simultaneously read from a few different sources (e.g. start 3 commands before waiting for any to finish) which causes the span we store in the Thread variable to be overwritten incorrectly?

zachmccormick · 2018-12-27T18:19:38Z

Oops my bad on the explanation - it's not related to reading from a cursor. It's related to authorization. Reading about this now to try to think of a way to handle it: https://github.com/mongodb/specifications/blob/master/source/auth/auth.rst#sasl-mechanisms

zachmccormick · 2018-12-27T18:23:00Z

Maybe we want to combine the saslStart, the first saslContinue, through the final saslContinue, then consider that, in aggregate, to be the act of connecting to the Mongo instance at all.

It makes sense where we see this (and where we saw the tracer not working) because we access numerous different instances with different credentials

delner · 2018-12-27T18:37:51Z

If these are distinct Mongo commands, and each is sent to distinct sources, I don't think combining them is as simple nor as accurate as it should be. Each of these commands should be faithfully represented as they occurred, so I'd prefer to see spans 1-1 with Mongo commands.

Just to be clear, your PR as its written right now solves the problem for you, correct?

If so, I think the Hash approach is acceptable as a strategy, but I'm particularly interested in creating a test which replicates this authentication problem, where multiple starts will fire prior to any succeed/fail. The best way to do that I think would to mimic the authentication in a test that produces these spans. But if that doesn't work or is too complex to implement, we can simply mock this and force the "start" event to trigger twice by some other more aggressive means and verify the tracing code doesn't produce unfinished spans.

I noticed with debug logging on that a number of our sidekiq traces were being lost (sub-spans were marked as incomplete, so it was dropping many of our sidekiq spans). I was able to get it working when I disabled the Mongo spans, so I hunted around and figured out that the :datadog_mongo_span thread-local variable was being overwritten when a sidekiq job had multiple Mongo queries in it. I patched the code here to use a hash with the event.request_id as the key and I no longer see any of our traces being dropped!

delner · 2018-12-27T20:15:32Z

Rebased on the latest master and added a spec that reproduces the issue using LDAP/SASL. I verified your change fixes the problem.

@zachmccormick Can you give this PR a try again and verify it fixes your issue? If so, we'll merge this and put it in the next 0.18.2 release.

brettlangdon

This seems like a reasonable strategy.

zachmccormick · 2018-12-28T14:13:23Z

Looks good to me! FYI there are also saslContinue commands that get run after the saslStart, but I think the test case covers both.

delner self-assigned this Dec 18, 2018

delner added bug Involves a bug integrations Involves tracing integrations community Was opened by a community member labels Dec 18, 2018

delner added this to the 0.18.1 milestone Dec 18, 2018

delner self-requested a review December 18, 2018 17:15

delner assigned zachmccormick and unassigned delner Dec 18, 2018

delner modified the milestones: 0.18.1, 0.18.2 Dec 20, 2018

zachmccormick and others added 2 commits December 27, 2018 15:12

Added: Mongo LDAP/SASL client spec.

95af457

delner requested review from pawelchcki and brettlangdon December 27, 2018 20:32

brettlangdon approved these changes Dec 28, 2018

View reviewed changes

delner merged commit fa5abaf into DataDog:master Dec 28, 2018

delner mentioned this pull request Jan 17, 2019

MongoDB tracing error upon truncating for specs #674

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for broken mongodb spans #658

Fix for broken mongodb spans #658

zachmccormick commented Dec 18, 2018

delner commented Dec 18, 2018

zachmccormick commented Dec 18, 2018

delner commented Dec 18, 2018 •

edited

Loading

zachmccormick commented Dec 18, 2018

delner commented Dec 18, 2018 •

edited

Loading

zachmccormick commented Dec 27, 2018

delner commented Dec 27, 2018

zachmccormick commented Dec 27, 2018

delner commented Dec 27, 2018 •

edited

Loading

zachmccormick commented Dec 27, 2018

zachmccormick commented Dec 27, 2018

delner commented Dec 27, 2018 •

edited

Loading

delner commented Dec 27, 2018

brettlangdon left a comment

zachmccormick commented Dec 28, 2018

Fix for broken mongodb spans #658

Fix for broken mongodb spans #658

Conversation

zachmccormick commented Dec 18, 2018

delner commented Dec 18, 2018

zachmccormick commented Dec 18, 2018

delner commented Dec 18, 2018 • edited Loading

zachmccormick commented Dec 18, 2018

delner commented Dec 18, 2018 • edited Loading

zachmccormick commented Dec 27, 2018

delner commented Dec 27, 2018

zachmccormick commented Dec 27, 2018

delner commented Dec 27, 2018 • edited Loading

zachmccormick commented Dec 27, 2018

zachmccormick commented Dec 27, 2018

delner commented Dec 27, 2018 • edited Loading

delner commented Dec 27, 2018

brettlangdon left a comment

Choose a reason for hiding this comment

zachmccormick commented Dec 28, 2018

delner commented Dec 18, 2018 •

edited

Loading

delner commented Dec 18, 2018 •

edited

Loading

delner commented Dec 27, 2018 •

edited

Loading

delner commented Dec 27, 2018 •

edited

Loading