-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for broken mongodb spans #658
Fix for broken mongodb spans #658
Conversation
Thanks for bringing this to our attention @zachmccormick! Is this because the job (running on a thread) had multiple Mongo queries whose instrumentation was nested in one another? (Like a Russian doll?) The changes from an eyeball test look good, just want to correctly understand the cause. Do you think you could add one that replicates this multi-query scenario? It would be good if we added a test for this so we don't regress. |
I'm actually not certain exactly how to reproduce this error. We do use structures that look something like (this is a total approximation):
but those would be two queries that ought to be running in sequence (i.e. there should be no nesting or anything in those). Also the |
If I had to make a quick (unverified) guess, it's because we're using event subscriptions as described here. Perhaps that You could pretty easily add that structure (like you have there in your code sample) to the tests, then run them before you apply your fix to at least verify it reproduces the problem. Then once your fix is in, we can be confident the issue is resolved, and that we don't break it again, even if we don't fully understand it at this point. |
It's not happening in command_started, command_completed, and command_failed are all called before/after the command is issued over the wire to the server, not on any arbitrary method call like |
I guess the most plausible explanation I can think of right now is that Anyways, I think the best way forward on this is to make a test that replicates the issue, which we could use to pretty easily expose the backtrace with |
https://www.dropbox.com/s/i0dnr74g17qq9zl/Screenshot%202018-12-27%2012.39.50.png?dl=0 Actually now I think it's the |
@zachmccormick I was trying to experiment with your code example, but you use |
Yes we're using Mongoid. I can check through some traces later to see if there are instances of this where we're using the lower-level driver or if it's isolated to only cases where we're using Mongoid (and reading from a cursor instead of pulling all of the results back at once) |
I'm not sure I'm understanding your last correctly (regarding SASL), but is that to say that Mongo is trying to simultaneously read from a few different sources (e.g. start 3 commands before waiting for any to finish) which causes the span we store in the Thread variable to be overwritten incorrectly? |
Oops my bad on the explanation - it's not related to reading from a cursor. It's related to authorization. Reading about this now to try to think of a way to handle it: https://github.com/mongodb/specifications/blob/master/source/auth/auth.rst#sasl-mechanisms |
Maybe we want to combine the saslStart, the first saslContinue, through the final saslContinue, then consider that, in aggregate, to be the act of connecting to the Mongo instance at all. It makes sense where we see this (and where we saw the tracer not working) because we access numerous different instances with different credentials |
If these are distinct Mongo commands, and each is sent to distinct sources, I don't think combining them is as simple nor as accurate as it should be. Each of these commands should be faithfully represented as they occurred, so I'd prefer to see spans 1-1 with Mongo commands. Just to be clear, your PR as its written right now solves the problem for you, correct? If so, I think the Hash approach is acceptable as a strategy, but I'm particularly interested in creating a test which replicates this authentication problem, where multiple starts will fire prior to any succeed/fail. The best way to do that I think would to mimic the authentication in a test that produces these spans. But if that doesn't work or is too complex to implement, we can simply mock this and force the "start" event to trigger twice by some other more aggressive means and verify the tracing code doesn't produce unfinished spans. |
I noticed with debug logging on that a number of our sidekiq traces were being lost (sub-spans were marked as incomplete, so it was dropping many of our sidekiq spans). I was able to get it working when I disabled the Mongo spans, so I hunted around and figured out that the :datadog_mongo_span thread-local variable was being overwritten when a sidekiq job had multiple Mongo queries in it. I patched the code here to use a hash with the event.request_id as the key and I no longer see any of our traces being dropped!
Rebased on the latest master and added a spec that reproduces the issue using LDAP/SASL. I verified your change fixes the problem. @zachmccormick Can you give this PR a try again and verify it fixes your issue? If so, we'll merge this and put it in the next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a reasonable strategy.
Looks good to me! FYI there are also |
I noticed with debug logging on that a number of our sidekiq traces were
being lost (sub-spans were marked as incomplete, so it was dropping many
of our sidekiq spans). I was able to get it working when I disabled the
Mongo spans, so I hunted around and figured out that the
:datadog_mongo_span thread-local variable was being overwritten when a
sidekiq job had multiple Mongo queries in it. I patched the code here to
use a hash with the event.request_id as the key and I no longer see any
of our traces being dropped!
The ticket in question that I raised originally is here: https://help.datadoghq.com/hc/en-us/requests/184670?page=1