Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for broken mongodb spans #658

Merged
merged 2 commits into from
Dec 28, 2018
Merged

Fix for broken mongodb spans #658

merged 2 commits into from
Dec 28, 2018

Conversation

zachmccormick
Copy link
Contributor

I noticed with debug logging on that a number of our sidekiq traces were
being lost (sub-spans were marked as incomplete, so it was dropping many
of our sidekiq spans). I was able to get it working when I disabled the
Mongo spans, so I hunted around and figured out that the
:datadog_mongo_span thread-local variable was being overwritten when a
sidekiq job had multiple Mongo queries in it. I patched the code here to
use a hash with the event.request_id as the key and I no longer see any
of our traces being dropped!

The ticket in question that I raised originally is here: https://help.datadoghq.com/hc/en-us/requests/184670?page=1

@delner
Copy link
Contributor

delner commented Dec 18, 2018

Thanks for bringing this to our attention @zachmccormick! Is this because the job (running on a thread) had multiple Mongo queries whose instrumentation was nested in one another? (Like a Russian doll?)

The changes from an eyeball test look good, just want to correctly understand the cause. Do you think you could add one that replicates this multi-query scenario? It would be good if we added a test for this so we don't regress.

@delner delner self-assigned this Dec 18, 2018
@delner delner added bug Involves a bug integrations Involves tracing integrations community Was opened by a community member labels Dec 18, 2018
@delner delner added this to the 0.18.1 milestone Dec 18, 2018
@zachmccormick
Copy link
Contributor Author

I'm actually not certain exactly how to reproduce this error. We do use structures that look something like (this is a total approximation):

MyCollection.with(:secondary_preferred => true) do |coll|
  MyOtherCollection.with(:secondary_preferred => true) do |coll2|
    first_thing = coll.where(:id => some_var).to_a[0]
    second_thing = coll2.where(:id => first_thing.some_field).to_a[0]
  end
end

but those would be two queries that ought to be running in sequence (i.e. there should be no nesting or anything in those). Also the with calls don't have anything to do with the queries themselves. I'm not too certain why the instrumentation is getting nested like it is

@delner
Copy link
Contributor

delner commented Dec 18, 2018

If I had to make a quick (unverified) guess, it's because we're using event subscriptions as described here. Perhaps that #started function gets called during the with? (Maybe the with produces its own query?) Instead of when the where query is actually produced? Such a thing doesn't make much sense to me either, but maybe that's what's happening.

You could pretty easily add that structure (like you have there in your code sample) to the tests, then run them before you apply your fix to at least verify it reproduces the problem. Then once your fix is in, we can be confident the issue is resolved, and that we don't break it again, even if we don't fully understand it at this point.

@zachmccormick
Copy link
Contributor Author

It's not happening in .with(). The relevant code in the mongo-ruby-driver is here: https://github.com/mongodb/mongo-ruby-driver/blob/master/lib/mongo/monitoring/publishable.rb#L39

command_started, command_completed, and command_failed are all called before/after the command is issued over the wire to the server, not on any arbitrary method call like with.

@delner
Copy link
Contributor

delner commented Dec 18, 2018

I guess the most plausible explanation I can think of right now is that publish_command is being nested within each other somehow, chaining because whatever happens in the yield block causes another command to execute? (I'm thinking some kind of query structured a bit like a SQL transaction, I don't know if that would be a thing here or not.)

Anyways, I think the best way forward on this is to make a test that replicates the issue, which we could use to pretty easily expose the backtrace with caller and deconstruct how this is chaining.

@delner delner self-requested a review December 18, 2018 17:15
@delner delner assigned zachmccormick and unassigned delner Dec 18, 2018
@delner delner modified the milestones: 0.18.1, 0.18.2 Dec 20, 2018
@zachmccormick
Copy link
Contributor Author

https://www.dropbox.com/s/i0dnr74g17qq9zl/Screenshot%202018-12-27%2012.39.50.png?dl=0

Actually now I think it's the saslStart and saslContinue commands being used to read from a cursor in multiple chunks, rather than any threading-related issue.

@delner
Copy link
Contributor

delner commented Dec 27, 2018

@zachmccormick I was trying to experiment with your code example, but you use Collection#where which is not a function in mongo. Are you using Mongoid? If so, maybe something in Mongoid is contributing to this.

@zachmccormick
Copy link
Contributor Author

Yes we're using Mongoid. I can check through some traces later to see if there are instances of this where we're using the lower-level driver or if it's isolated to only cases where we're using Mongoid (and reading from a cursor instead of pulling all of the results back at once)

@delner
Copy link
Contributor

delner commented Dec 27, 2018

I'm not sure I'm understanding your last correctly (regarding SASL), but is that to say that Mongo is trying to simultaneously read from a few different sources (e.g. start 3 commands before waiting for any to finish) which causes the span we store in the Thread variable to be overwritten incorrectly?

@zachmccormick
Copy link
Contributor Author

Oops my bad on the explanation - it's not related to reading from a cursor. It's related to authorization. Reading about this now to try to think of a way to handle it: https://github.com/mongodb/specifications/blob/master/source/auth/auth.rst#sasl-mechanisms

@zachmccormick
Copy link
Contributor Author

Maybe we want to combine the saslStart, the first saslContinue, through the final saslContinue, then consider that, in aggregate, to be the act of connecting to the Mongo instance at all.

It makes sense where we see this (and where we saw the tracer not working) because we access numerous different instances with different credentials

@delner
Copy link
Contributor

delner commented Dec 27, 2018

If these are distinct Mongo commands, and each is sent to distinct sources, I don't think combining them is as simple nor as accurate as it should be. Each of these commands should be faithfully represented as they occurred, so I'd prefer to see spans 1-1 with Mongo commands.

Just to be clear, your PR as its written right now solves the problem for you, correct?

If so, I think the Hash approach is acceptable as a strategy, but I'm particularly interested in creating a test which replicates this authentication problem, where multiple starts will fire prior to any succeed/fail. The best way to do that I think would to mimic the authentication in a test that produces these spans. But if that doesn't work or is too complex to implement, we can simply mock this and force the "start" event to trigger twice by some other more aggressive means and verify the tracing code doesn't produce unfinished spans.

zachmccormick and others added 2 commits December 27, 2018 15:12
I noticed with debug logging on that a number of our sidekiq traces were
being lost (sub-spans were marked as incomplete, so it was dropping many
of our sidekiq spans). I was able to get it working when I disabled the
Mongo spans, so I hunted around and figured out that the
:datadog_mongo_span thread-local variable was being overwritten when a
sidekiq job had multiple Mongo queries in it. I patched the code here to
use a hash with the event.request_id as the key and I no longer see any
of our traces being dropped!
@delner
Copy link
Contributor

delner commented Dec 27, 2018

Rebased on the latest master and added a spec that reproduces the issue using LDAP/SASL. I verified your change fixes the problem.

@zachmccormick Can you give this PR a try again and verify it fixes your issue? If so, we'll merge this and put it in the next 0.18.2 release.

Copy link
Member

@brettlangdon brettlangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a reasonable strategy.

@zachmccormick
Copy link
Contributor Author

Looks good to me! FYI there are also saslContinue commands that get run after the saslStart, but I think the test case covers both.

@delner delner merged commit fa5abaf into DataDog:master Dec 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Involves a bug community Was opened by a community member integrations Involves tracing integrations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants