Skip to content

[CSHARP-969] - OpenTelemetry tracing #608

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 41 commits into from
Sep 17, 2024

Conversation

simaoribeiro
Copy link
Contributor

As discussed in the OpenTelemetry proposal, this PR includes the session level tracing following the OpenTelemetry standard.

@joao-r-reis There are two things that are included in the proposal that are missing in this:

  • The server.address and server.port attributes because the Host information is only set in the Node level tracker methods
  • The operation name, as this is a more of a big deal, as the RequestTrackingInfo is instantiated in the RequestHandler but the method invocation is made before in the Session class. One solution is to instantiate the tracker at the session level so we can access the method, but I wanted to see your opinion before going for that route in this PR.

@simaoribeiro simaoribeiro changed the title [CSHARP-969] - OpenTelemetry session level tracing Draft: [CSHARP-969] - OpenTelemetry session level tracing May 21, 2024
@simaoribeiro
Copy link
Contributor Author

Hi @joao-r-reis , a bit later than I was expecting but here is the PR so we can proceed with the feature :)

@joao-r-reis
Copy link
Collaborator

The operation name, as this is a more of a big deal, as the RequestTrackingInfo is instantiated in the RequestHandler but the method invocation is made before in the Session class. One solution is to instantiate the tracker at the session level so we can access the method, but I wanted to see your opinion before going for that route in this PR.

We didn't really cover how we were going to implement this part and now that I'm thinking a bit more about it we would have to add a new Enum or just pass a string in the TrackingInfo but it's a bit awkward because it really only would be useful for this specific implementation of the RequestTracker.

At the same time I'm thinking whether this will actually be useful and what would we lose if we just made the OpenTelemetry extension hardcode the operation name to Execute... If we were to implement the new "field" on the tracker info object at best we would have 2 operation names: Execute and ExecuteAsync but then there's some cases where LINQ and Mapper call ExecuteAsync even if the top level method is a synchronous one so I'm actually leaning towards just hardcoding a string like CassandraRequest as the operation name in the extension (it should be a non existing driver method to avoid confusion I think)... And maybe in the future we can revisit this and potentially add a new field to the tracking info object (which wouldn't be a breaking change).


I can't do a proper review of the PR right now but I'll try finding some time soon to look at it, thanks for the contribution again!

@zehenrique
Copy link

hey @joao-r-reis how's it going?

do you think you will be able to look at this soon? This particular subject is becoming an issue and we need to understand how we will move forward in the near future.

thanks

@joao-r-reis
Copy link
Collaborator

hey @joao-r-reis how's it going?

do you think you will be able to look at this soon? This particular subject is becoming an issue and we need to understand how we will move forward in the near future.

thanks

Hey, I'll do a proper review today but it was my impression that this was still a draft and there was still work left to be done.

@simaoribeiro simaoribeiro changed the title Draft: [CSHARP-969] - OpenTelemetry session level tracing [CSHARP-969] - OpenTelemetry session level tracing Jun 26, 2024
@simaoribeiro
Copy link
Contributor Author

Hi @joao-r-reis , it was my bad. I didn't realized it was marked as draft. I believe this pull request implements what we discussed in technical document minus the operation name.

Regarding that, I agree that we can change it to something more generic like the hardcoded string you mentioned above.

Copy link
Collaborator

@joao-r-reis joao-r-reis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good! Added a few comments.

I think initially we discussed that the OnNodeSuccess and OnNodeFailure methods were more complex to implement so they were going to be out of scope for now but it turns out they are super easy to implement after all. The OnNodeFailure is pretty much already implemented in your changes and I added a comment that says where you can add the OnNodeSuccess call.

@@ -1011,6 +1014,12 @@ public Builder WithMetrics(IDriverMetricsProvider driverMetricsProvider, DriverM
return this;
}

public Builder WithRequestTracker(IRequestTracker trace)
{
_driverTracer = trace;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_requestTracker or something that doesn't mention traces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, please review

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not addressed I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done now.

@joao-r-reis
Copy link
Collaborator

The build is still failing due to some analyzers that are not enabled by default when you build locally. If you want to build with all the analyzers enabled locally you can follow these instructions.

@joao-r-reis
Copy link
Collaborator

Also make sure all elements of the public API are properly documented with xml docs (I already added a few comments about this on the review).

namespace Cassandra.IntegrationTests.OpenTelemetry
{
[Category(TestCategory.Short)]
public class OpenTelemetryTests : SharedClusterTest
Copy link
Collaborator

@joao-r-reis joao-r-reis Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a test that verifies that the activity context is correctly propagated? I don't think there is one at the moment that verifies this unless I missed it. If I'm not mistaken one of the use cases for this is to properly correlate these traces with (for example) an API call so the context needs to be properly propagated correct? Ideally we would have a test for statements, one for the mapper and one for LINQ. And each test should use both sync and async variants of the execute methods.

We should also test that the context propagation works correctly with auto paging, although I'm not 100% clear what behavior we want in this case. I'm assuming the context we want is the one that is active the moment where the enumerator pages the next page so the current changes might already do that but we need a test for it. The auto paging occurs at public async Task FetchMoreResultsAsync() of RowSet.cs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The context propagation tests can be straightforward. I will include a method that will be called after the database requests and check if the traceId is the same and also that the database span id is the parent of the method span id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on the tests currently 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know when I can take a look at the tests 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently I have the following tests:

  • When the Keyspace is available check if DbOperation and DbName attributes are included;
  • When the Keyspace is not available available check if DbName attribute is not included;
  • With default builder options, the statement is not included as an attribute ;
  • With the option IncludeDatabaseStatement the statement is included as an attribute;
  • A test with Execute and Execute async methods in which I verify if the trace is created and also if the node span is child of session span;
  • A test with Mapper and Mapper async methods in which I verify if the trace is created and also if the node span is child of session span;

Currently missing is the LINQ test and the context propagation that you mentioned above.
I also wanna make a test with a retry to see if multiple spans are generated that are child of the same session span.

I will try to finish in the next few days so, if you have time available, you can start checking the current tests :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joao-r-reis I finished the tests. Since my previous comment I included

  • A test to validate the activity when executing using LINQ
  • A test to validate if the same TraceId is used when invoking a different method with a call to Cassandra in the mix
  • A test to validate if, when a host is down and a retry happens, if it is produced more than one node activity for the same session activity. During this test, I realized that if multiple retries happens on the same host, the node level activity would be reused instead of being created a new one, so I fixed it in this commit 13f3eec
  • A test for pagination

@simaoribeiro
Copy link
Contributor Author

Hi @joao-r-reis , back at it again after some time off. Will start working on your review

Copy link
Collaborator

@joao-r-reis joao-r-reis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good, I think most discussions are resolved now, there's only a couple of things left to address. IIUC you're still working on the tests right? I can take another look once that's done.

using System.Threading.Tasks;
using OpenTelemetry.Trace;

namespace Cassandra.OpenTelemetry.Implementation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the Implementation folder/namespace adds much value here, we can just put this class on the "root" directory, also makes it easier for users without an IDE that auto suggests namespace imports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was following the https://github.com/open-telemetry/opentelemetry-dotnet-contrib structure as a guide but it doesn't make much sense in our implementation. I changed it in the last commit

{
var hostInfo = new HostTrackingInfo { Host = host };

await _requestTracker.OnNodeErrorAsync(r, hostInfo, ex).ConfigureAwait(false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a minor optimization that can be done here, I know this was true some years ago but I don't know if modern .NET versions have made this irrelevant:

In cases like this where there's only one await and it's the last "statement" of the method then we can just remove async await and just return the task itself.

public Task OnNodeRequestError(
            Host host,
            RequestErrorType errorType,
            RetryDecision.RetryDecisionType decision,
            RequestTrackingInfo r,
            Exception ex)
        {
            var hostInfo = new HostTrackingInfo { Host = host };

            return _requestTracker.OnNodeErrorAsync(r, hostInfo, ex);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also be done for the other methods of this class. Again it's a very minor thing and not that important but it does make it so there's one less async/await state machine that the compiler generates (unless modern .NET versions make this optimization automatically).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't know if the latest .NET versions have made the optimization but given that the package is .NET Standard, which can be executed by a .NET Framework runtime, I think it's safer to make the optimization.
Changed it in the last commit


Task OnRequestFailure(Exception ex, RequestTrackingInfo requestTrackingInfo);

Task OnRequestSuccess(RequestTrackingInfo requestTrackingInfo);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably add Async suffix to the names of these methods to make it consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, please review

@@ -1011,6 +1014,12 @@ public Builder WithMetrics(IDriverMetricsProvider driverMetricsProvider, DriverM
return this;
}

public Builder WithRequestTracker(IRequestTracker trace)
{
_driverTracer = trace;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not addressed I think.


public ConcurrentDictionary<string, object> Items { get; }

public IStatement Statement { get; set; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a public setter here? Ideally the driver should set this when it creates this object and these properties should be immutable (well the Items dictionary can be modified but the properties themselves shouldn't).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it so the statement is passed by constructor and the public property is getter only

namespace Cassandra.IntegrationTests.OpenTelemetry
{
[Category(TestCategory.Short)]
public class OpenTelemetryTests : SharedClusterTest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know when I can take a look at the tests 👍

@@ -389,7 +396,7 @@ private void HandleRequestError(IRequestError error, Host host)
break;
}

_requestObserver.OnRequestError(host, retryInformation.Reason, retryInformation.Decision.DecisionType);
_requestObserver.OnNodeRequestError(host, retryInformation.Reason, retryInformation.Decision.DecisionType, _requestTrackingInfo, ex);
Copy link
Collaborator

@joao-r-reis joao-r-reis Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm we might need to ensure this gets called before _parent.SetCompleted because otherwise the "OnRequestFailure" will get called before "OnNodeRequestError" on the request tracker...

And if the retry decision is Retry then the new "child trace" will be started before this one ends.

Something like this would work:

switch (retryInformation.Decision.DecisionType)
    {
        case RetryDecision.RetryDecisionType.Rethrow:
            _requestObserver.OnNodeRequestError(host, retryInformation.Reason, RetryDecision.RetryDecisionType.Rethrow, _requestTrackingInfo, ex);
            _parent.SetCompleted(ex);
            break;
        case RetryDecision.RetryDecisionType.Ignore:
            _requestObserver.OnNodeRequestError(host, retryInformation.Reason, RetryDecision.RetryDecisionType.Ignore, _requestTrackingInfo, ex);
            // The error was ignored by the RetryPolicy, return an empty rowset
            _parent.SetCompleted(null, FillRowSet(RowSet.Empty(), null));
            break;
        case RetryDecision.RetryDecisionType.Retry:
            _requestObserver.OnNodeRequestError(host, retryInformation.Reason, RetryDecision.RetryDecisionType.Retry, _requestTrackingInfo, ex);
            //Retry the Request using the new consistency level
            Retry(retryInformation.Decision.RetryConsistencyLevel, retryInformation.Decision.UseCurrentHost, host);
            break;
        default: 
           _requestObserver.OnNodeRequestError(host, retryInformation.Reason, retryInformation.Decision.DecisionType, _requestTrackingInfo, ex);
    }

The above isn't pretty but I don't think there's pretty and efficient solutions for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it in the last commit. Please review

@simaoribeiro
Copy link
Contributor Author

On it

@simaoribeiro
Copy link
Contributor Author

@joao-r-reis Included documentation in this commit
d9c697b

I also included another commit to null check the request.Statement that was missing
f5585cd

@joao-r-reis
Copy link
Collaborator

joao-r-reis commented Sep 16, 2024

Looks good, I added a comment about null pointer exceptions, I think there's still a chance that they happen, it would be nice to have a unit test or something to validate this.

Can you also add a docs page on the new Request Tracker interface? It doesn't have to be long since we can link the API docs of the interface there. Maybe a quick comment on how it can be leveraged, a link to the interface's API docs and a link to the opentelemetry page

@joao-r-reis
Copy link
Collaborator

The API docs link to IRequestTracker will be https://docs.datastax.com/en/drivers/csharp/latest/api/Cassandra.IRequestTracker.html I believe

@simaoribeiro
Copy link
Contributor Author

Trying to finish until the end of the day

@joao-r-reis
Copy link
Collaborator

Oh also apparently StartActivity can also return null so this can also throw a NPE:

 if (activity.IsAllDataRequested)
 {
     if (!string.IsNullOrEmpty(request.Statement.Keyspace))
     {
         activity.AddTag("db.name", request.Statement.Keyspace);
     }

     if (_instrumentationOptions.IncludeDatabaseStatement && request.Statement != null)
     {
         activity.AddTag("db.statement", request.Statement.ToString());
     }
 }

@joao-r-reis
Copy link
Collaborator

I'm currently trying this out with a local app and I just realized that the extension method should probably be changed from AddOpenTelemetryInstrumentation to WithOpenTelemetryInstrumentation because that's what almost all the other builder methods are named like.

@joao-r-reis
Copy link
Collaborator

I was actually able to reproduce a scenario where the Activity is null so we should just return right away if that happens and we need to add checks to other methods because the activity might not exist on the context dictionary.

@joao-r-reis
Copy link
Collaborator

joao-r-reis commented Sep 17, 2024

Activity.Tags:
    db.system: cassandra
    db.operation: Session Request
    db.statement: Cassandra.BoundStatement

The statement CQL string is not being added, just the type of the statement. It's not a blocker for this PR though, it requires some changes to the statement classes so I'll work on this as a follow up PR after this one is merged.

@joao-r-reis
Copy link
Collaborator

Actually don't worry about the AddOpenTelemetryInstrumentation rename either, I'll handle that one too, let's just make sure we're checking if the Activity is null then I'll merge the PR and work on a follow up PR to take care of these 2 things I brought up.

@simaoribeiro
Copy link
Contributor Author

@joao-r-reis Made the changes you requested and some more in today's commits:

  • Included docs for Request Tracker
  • Included unit tests OpenTelemetryTests to null check the activity and request for both node and session events
  • Fixed the naming of OnNodeStartAsync method (was missing the Async part unlike the other events)
  • Changed the name of some Activity tags according to the updated 1.27.0 version of Cassandra OTEL semantic conventions

@simaoribeiro
Copy link
Contributor Author

Regarding the extension method name, WithOpenTelemetry is more in line with the rest of Builder methods so makes perfectly sense to change it

@simaoribeiro
Copy link
Contributor Author

The CQL string is included in some use cases given that the test AddOpenTelemetry_WithIncludeDatabaseStatementOption_DbStatementIsIncludedAsAttribute is asserting the value of the activity tag against the query that is included in session.Execute(statement);

So, maybe we are lacking some other use cases

@joao-r-reis
Copy link
Collaborator

The CQL string is included in some use cases given that the test AddOpenTelemetry_WithIncludeDatabaseStatementOption_DbStatementIsIncludedAsAttribute is asserting the value of the activity tag against the query that is included in session.Execute(statement);

So, maybe we are lacking some other use cases

Yeah it works with SimpleStatements but not BatchStatements or BoundStatements, I already have a fix on my branch though so I'll merge that once this PR is merged

Copy link
Collaborator

@joao-r-reis joao-r-reis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff! I'll merge this when CI finishes assuming it's green

@simaoribeiro
Copy link
Contributor Author

Nice. It's a big PR but a good foundation for the next steps. Thanks for all the guidance and help.

There are still some other attributes that can be included or some behavior that may need to be fixed. I'm available to help if needed.

@joao-r-reis
Copy link
Collaborator

Yeah, changing the attributes should be simple to do going forward (and don't constitute a breaking change) so even if they aren't 100% perfect on the first release it's not a big deal.

@joao-r-reis joao-r-reis merged commit bd41516 into datastax:master Sep 17, 2024
2 checks passed
@joao-r-reis
Copy link
Collaborator

I'll tag you on my follow up PR but it's mostly a justfyi in case you're interested knowing about the changes I make (should be very small though)

@simaoribeiro
Copy link
Contributor Author

Absolutely. Cheers!

@joao-r-reis joao-r-reis changed the title [CSHARP-969] - OpenTelemetry session level tracing [CSHARP-969] - OpenTelemetry tracing Sep 18, 2024
@joao-r-reis joao-r-reis added this to the 3.22.0 milestone Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants