-
Notifications
You must be signed in to change notification settings - Fork 240
[CSHARP-969] - OpenTelemetry tracing #608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CSHARP-969] - OpenTelemetry tracing #608
Conversation
Hi @joao-r-reis , a bit later than I was expecting but here is the PR so we can proceed with the feature :) |
We didn't really cover how we were going to implement this part and now that I'm thinking a bit more about it we would have to add a new At the same time I'm thinking whether this will actually be useful and what would we lose if we just made the OpenTelemetry extension hardcode the operation name to I can't do a proper review of the PR right now but I'll try finding some time soon to look at it, thanks for the contribution again! |
hey @joao-r-reis how's it going? do you think you will be able to look at this soon? This particular subject is becoming an issue and we need to understand how we will move forward in the near future. thanks |
Hey, I'll do a proper review today but it was my impression that this was still a draft and there was still work left to be done. |
Hi @joao-r-reis , it was my bad. I didn't realized it was marked as draft. I believe this pull request implements what we discussed in technical document minus the operation name. Regarding that, I agree that we can change it to something more generic like the hardcoded string you mentioned above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good! Added a few comments.
I think initially we discussed that the OnNodeSuccess
and OnNodeFailure
methods were more complex to implement so they were going to be out of scope for now but it turns out they are super easy to implement after all. The OnNodeFailure
is pretty much already implemented in your changes and I added a comment that says where you can add the OnNodeSuccess
call.
src/Extensions/Cassandra.OpenTelemetry/CassandraInstrumentation.cs
Outdated
Show resolved
Hide resolved
src/Cassandra/Builder.cs
Outdated
@@ -1011,6 +1014,12 @@ public Builder WithMetrics(IDriverMetricsProvider driverMetricsProvider, DriverM | |||
return this; | |||
} | |||
|
|||
public Builder WithRequestTracker(IRequestTracker trace) | |||
{ | |||
_driverTracer = trace; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_requestTracker
or something that doesn't mention traces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, please review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still not addressed I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done now.
The build is still failing due to some analyzers that are not enabled by default when you build locally. If you want to build with all the analyzers enabled locally you can follow these instructions. |
Also make sure all elements of the public API are properly documented with xml docs (I already added a few comments about this on the review). |
namespace Cassandra.IntegrationTests.OpenTelemetry | ||
{ | ||
[Category(TestCategory.Short)] | ||
public class OpenTelemetryTests : SharedClusterTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a test that verifies that the activity context is correctly propagated? I don't think there is one at the moment that verifies this unless I missed it. If I'm not mistaken one of the use cases for this is to properly correlate these traces with (for example) an API call so the context needs to be properly propagated correct? Ideally we would have a test for statements, one for the mapper and one for LINQ. And each test should use both sync and async variants of the execute methods.
We should also test that the context propagation works correctly with auto paging, although I'm not 100% clear what behavior we want in this case. I'm assuming the context we want is the one that is active the moment where the enumerator pages the next page so the current changes might already do that but we need a test for it. The auto paging occurs at public async Task FetchMoreResultsAsync()
of RowSet.cs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. The context propagation tests can be straightforward. I will include a method that will be called after the database requests and check if the traceId is the same and also that the database span id is the parent of the method span id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on the tests currently 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know when I can take a look at the tests 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently I have the following tests:
- When the Keyspace is available check if DbOperation and DbName attributes are included;
- When the Keyspace is not available available check if DbName attribute is not included;
- With default builder options, the statement is not included as an attribute ;
- With the option
IncludeDatabaseStatement
the statement is included as an attribute; - A test with
Execute
andExecute
async methods in which I verify if the trace is created and also if the node span is child of session span; - A test with
Mapper
andMapper
async methods in which I verify if the trace is created and also if the node span is child of session span;
Currently missing is the LINQ test and the context propagation that you mentioned above.
I also wanna make a test with a retry to see if multiple spans are generated that are child of the same session span.
I will try to finish in the next few days so, if you have time available, you can start checking the current tests :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joao-r-reis I finished the tests. Since my previous comment I included
- A test to validate the activity when executing using LINQ
- A test to validate if the same TraceId is used when invoking a different method with a call to Cassandra in the mix
- A test to validate if, when a host is down and a retry happens, if it is produced more than one node activity for the same session activity. During this test, I realized that if multiple retries happens on the same host, the node level activity would be reused instead of being created a new one, so I fixed it in this commit 13f3eec
- A test for pagination
Hi @joao-r-reis , back at it again after some time off. Will start working on your review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good, I think most discussions are resolved now, there's only a couple of things left to address. IIUC you're still working on the tests right? I can take another look once that's done.
using System.Threading.Tasks; | ||
using OpenTelemetry.Trace; | ||
|
||
namespace Cassandra.OpenTelemetry.Implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the Implementation
folder/namespace adds much value here, we can just put this class on the "root" directory, also makes it easier for users without an IDE that auto suggests namespace imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was following the https://github.com/open-telemetry/opentelemetry-dotnet-contrib structure as a guide but it doesn't make much sense in our implementation. I changed it in the last commit
{ | ||
var hostInfo = new HostTrackingInfo { Host = host }; | ||
|
||
await _requestTracker.OnNodeErrorAsync(r, hostInfo, ex).ConfigureAwait(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a minor optimization that can be done here, I know this was true some years ago but I don't know if modern .NET versions have made this irrelevant:
In cases like this where there's only one await
and it's the last "statement" of the method then we can just remove async await
and just return the task itself.
public Task OnNodeRequestError(
Host host,
RequestErrorType errorType,
RetryDecision.RetryDecisionType decision,
RequestTrackingInfo r,
Exception ex)
{
var hostInfo = new HostTrackingInfo { Host = host };
return _requestTracker.OnNodeErrorAsync(r, hostInfo, ex);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can also be done for the other methods of this class. Again it's a very minor thing and not that important but it does make it so there's one less async/await
state machine that the compiler generates (unless modern .NET versions make this optimization automatically).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly don't know if the latest .NET versions have made the optimization but given that the package is .NET Standard, which can be executed by a .NET Framework runtime, I think it's safer to make the optimization.
Changed it in the last commit
|
||
Task OnRequestFailure(Exception ex, RequestTrackingInfo requestTrackingInfo); | ||
|
||
Task OnRequestSuccess(RequestTrackingInfo requestTrackingInfo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably add Async
suffix to the names of these methods to make it consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, please review
src/Cassandra/Builder.cs
Outdated
@@ -1011,6 +1014,12 @@ public Builder WithMetrics(IDriverMetricsProvider driverMetricsProvider, DriverM | |||
return this; | |||
} | |||
|
|||
public Builder WithRequestTracker(IRequestTracker trace) | |||
{ | |||
_driverTracer = trace; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still not addressed I think.
src/Cassandra/RequestTrackingInfo.cs
Outdated
|
||
public ConcurrentDictionary<string, object> Items { get; } | ||
|
||
public IStatement Statement { get; set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a public setter
here? Ideally the driver should set this when it creates this object and these properties should be immutable (well the Items
dictionary can be modified but the properties themselves shouldn't).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it so the statement is passed by constructor and the public property is getter only
namespace Cassandra.IntegrationTests.OpenTelemetry | ||
{ | ||
[Category(TestCategory.Short)] | ||
public class OpenTelemetryTests : SharedClusterTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know when I can take a look at the tests 👍
@@ -389,7 +396,7 @@ private void HandleRequestError(IRequestError error, Host host) | |||
break; | |||
} | |||
|
|||
_requestObserver.OnRequestError(host, retryInformation.Reason, retryInformation.Decision.DecisionType); | |||
_requestObserver.OnNodeRequestError(host, retryInformation.Reason, retryInformation.Decision.DecisionType, _requestTrackingInfo, ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm we might need to ensure this gets called before _parent.SetCompleted
because otherwise the "OnRequestFailure" will get called before "OnNodeRequestError" on the request tracker...
And if the retry decision is Retry
then the new "child trace" will be started before this one ends.
Something like this would work:
switch (retryInformation.Decision.DecisionType)
{
case RetryDecision.RetryDecisionType.Rethrow:
_requestObserver.OnNodeRequestError(host, retryInformation.Reason, RetryDecision.RetryDecisionType.Rethrow, _requestTrackingInfo, ex);
_parent.SetCompleted(ex);
break;
case RetryDecision.RetryDecisionType.Ignore:
_requestObserver.OnNodeRequestError(host, retryInformation.Reason, RetryDecision.RetryDecisionType.Ignore, _requestTrackingInfo, ex);
// The error was ignored by the RetryPolicy, return an empty rowset
_parent.SetCompleted(null, FillRowSet(RowSet.Empty(), null));
break;
case RetryDecision.RetryDecisionType.Retry:
_requestObserver.OnNodeRequestError(host, retryInformation.Reason, RetryDecision.RetryDecisionType.Retry, _requestTrackingInfo, ex);
//Retry the Request using the new consistency level
Retry(retryInformation.Decision.RetryConsistencyLevel, retryInformation.Decision.UseCurrentHost, host);
break;
default:
_requestObserver.OnNodeRequestError(host, retryInformation.Reason, retryInformation.Decision.DecisionType, _requestTrackingInfo, ex);
}
The above isn't pretty but I don't think there's pretty and efficient solutions for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it in the last commit. Please review
Adds OpenTelemetry extension package with support for session level tracing.
On it |
@joao-r-reis Included documentation in this commit I also included another commit to null check the |
Looks good, I added a comment about null pointer exceptions, I think there's still a chance that they happen, it would be nice to have a unit test or something to validate this. Can you also add a docs page on the new |
The API docs link to |
Trying to finish until the end of the day |
Oh also apparently
|
I'm currently trying this out with a local app and I just realized that the extension method should probably be changed from |
I was actually able to reproduce a scenario where the |
The statement CQL string is not being added, just the type of the statement. It's not a blocker for this PR though, it requires some changes to the statement classes so I'll work on this as a follow up PR after this one is merged. |
Actually don't worry about the |
@joao-r-reis Made the changes you requested and some more in today's commits:
|
Regarding the extension method name, |
The CQL string is included in some use cases given that the test So, maybe we are lacking some other use cases |
Yeah it works with SimpleStatements but not BatchStatements or BoundStatements, I already have a fix on my branch though so I'll merge that once this PR is merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome stuff! I'll merge this when CI finishes assuming it's green
Nice. It's a big PR but a good foundation for the next steps. Thanks for all the guidance and help. There are still some other attributes that can be included or some behavior that may need to be fixed. I'm available to help if needed. |
Yeah, changing the attributes should be simple to do going forward (and don't constitute a breaking change) so even if they aren't 100% perfect on the first release it's not a big deal. |
I'll tag you on my follow up PR but it's mostly a justfyi in case you're interested knowing about the changes I make (should be very small though) |
Absolutely. Cheers! |
As discussed in the OpenTelemetry proposal, this PR includes the session level tracing following the OpenTelemetry standard.
@joao-r-reis There are two things that are included in the proposal that are missing in this:
server.address
andserver.port
attributes because the Host information is only set in the Node level tracker methodsRequestTrackingInfo
is instantiated in theRequestHandler
but the method invocation is made before in theSession
class. One solution is to instantiate the tracker at the session level so we can access the method, but I wanted to see your opinion before going for that route in this PR.