Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: GRPC instrumentation #854

Conversation

ahayworth
Copy link
Contributor

@ahayworth ahayworth commented Jun 30, 2021

This PR adds GRPC auto-instrumentation. The following features are implemented:

  • Client / Server "unary" RPCs (single request/response)
  • Client / Server "server streaming" RPCs (single resquest / stream of responses)
  • Client / Server "client streaming" RPCs (stream of requests / single response)
  • Client / Server "bi-directional streaming" RPCs (stream of requests / stream of responses)
  • Log events for individual sent / received messages during a stream's lifecycle

The following features are not yet implemented, but will be finished in this PR in short order:

  • Tests, of any kind
  • net.peer.* and net.transport semantic convention attributes
  • Span error handling (this might actually be working, but I haven't tested it yet)

I might not implement:

  • message.compressed_size and message.uncompressed_size semantic convention attributes. I think I might be able to get to those sizes, but I'm not completely certain. They are not listed as a MUST in the spec, so I think it might be okay to leave out.

Things of note to reviewers:

  • As stated before, I'm not completely done here. But if you have feedback or any thoughts, they are more than welcome at this point.
  • I initially tried to use client / server interceptors, but they only handle the beginning of the calls and are thus unsuited for the task of instrumenting a long-running call.
  • Clients using the ruby GRPC libraries can invoke a service passing the return_op: true flag ... and then they get back a version of an ActiveCall that can actually be invoked later. So we don't necessarily know that the RPC call will take place when ClientStub#request_response is called, for instance. That leads to some annoying duplication throughout, and passing spans around as instance attributes on ActiveCall / BidiCall objects.
  • If a client does not pass a block to a streaming call, they'll receive an enumerable that can be called later to lazily read in the actual responses. This causes additional annoyances because we don't really know what spans may be active in the context when we actually start to iterate through the responses - another reason we pass the spans around as instance variables.

I'm not sure that this approach is a good one, but it feels okay to me. I'm worried primarily that we won't clean up spans correctly (what if a client never enumerates their results?) but I'm not sure I see another way to do it while still adhering to the GRPC event spec requirements.

Expand me for a bunch of screenshots Screen Shot 2021-06-30 at 4 54 30 PM Screen Shot 2021-06-30 at 4 55 56 PM Screen Shot 2021-06-30 at 4 55 41 PM Screen Shot 2021-06-30 at 4 55 23 PM Screen Shot 2021-06-30 at 4 55 07 PM Screen Shot 2021-06-30 at 4 54 45 PM

Fixes open-telemetry/opentelemetry-ruby-contrib#10
Fixes #96

@ahayworth ahayworth marked this pull request as ready for review August 4, 2021 16:52
@ahayworth
Copy link
Contributor Author

Behold, I am alive!

I think this PR is ready for testing and review. Sadly I don't have much of a real application to test this with. If anyone does, I'd greatly appreciate it!

instrumentation/grpc/Gemfile Outdated Show resolved Hide resolved
# on the client side, it's a little harder in the accept loop on the server. So, we provide
# a way for server-side ActiveCalls to set the span after the ActiveCall is created.
def span=(span)
@span = span
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unfamiliar with the execution model here. Does this require a threadsafe writer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I will admit that fundamentally I am not 100% sure. This ActiveCall setter is only called in two places:

  • when the ActiveCall is initialized (which sets the "correct" span for client side things using the ActiveCall class)
  • when we start to handle a call on the server-side - just after it is dispatched for processing, but technically after the ActiveCall was already created.

The main reason I added it was because on the server-side, I couldn't find a great way to cleanly patch the accept loop that handles and schedules all of the server methods without re-implementing a lot of that code myself. That felt like a great way to introduce bugs to me 😆.

The "accept loop" (so to speak) is at RpcServer#loop_handle_server_calls - and the bit I did choose to patch was RpcDesc#run_server_method, which is called from within RpcServer#loop_handle_server_calls. If you take a peek at the handling code, you can probably see why I chose to do it this way. However, a fresh set of eyes would be appreciated sometime, because I may well have overlooked the easy way to do this 🤣

Anyways - back to the thread safety question - the reason I think this is okay is because we only ever change this variable in one place, and that specifically is when we dispatch the server-side ActiveCall to a thread for processing. A different ActiveCall is created for each request coming in, and my read of the work-processing system within the GRPC code is that requests are explicitly scheduled with their context to a single thread, rather than multiple threads potentially pulling the same work item or something. I don't immediately see a way in which there could be concurrent access to this setter given the scope and the overall way that the GRPC library is doing things.

That said ... the GRPC library is rather confusing, so I very much could be wrong there. I'm going to say I'm about 95% certain we are fine here, in my opinion.

# handle the log message requirements of the OpenTelemetry semantic conventions.
def run_batch(ops)
if ops.keys.include?(::GRPC::Core::CallOps::SEND_MESSAGE)
OpenTelemetry::Trace.current_span.add_event(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish I had a better grasp of the execution model here. I keep thinking back to the way that the Rack instrumentation keeps track of its own Span instead of using the "current" span in case users add custom interceptors that generate their own spans and I want to make sure the gRPC span is the one that is augmented with the event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish I had a better grasp of the execution model here.

Me too, honestly. 🤣 This one is rough, for sure.

I keep thinking back to the way that the Rack instrumentation keeps track of its own Span instead of using the "current" span in case users add custom interceptors that generate their own spans and I want to make sure the gRPC span is the one that is augmented with the event.

I think this is an excellent concern. I'm going to try and refactor it along those lines. I think the scenario you described could happen if someone wrote their own interceptor, and that would definitely be a really upsetting / embarrassing bug for us. Not the first one I'd have written, either. 🤦

I think what really makes this hard is that the execution model of the official GRPC library is just all over the place - and the bit we're discussing here is actually a C extension ... and a lot of work happens in that C extension, which just makes it all harder.

That said, I think I see a way to get to the state we want. 👍

# The server's ActiveCall is created during the accept loop, and so the
# current_span saved off when created will likely be a non-recording API
# span. So we replace the saved-off span with whatever we just created.
active_call.span = span
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I do not quite grok is why this does not rely on current_span since it is passed in to with_span on L32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know - at this point it's been so long since I wrote it that I don't precisely recall why. I suspect it's something like what you were talking about in another area - I wanted to make sure we're augmenting the correct span, so the stuff inside the ActiveCall patches all operate on the @span variable instead of current_span. But; there are places (the c extension, etc) where getting at the current ActiveCall isn't possible so I then need to use current_span ... and it gets to be a mess.

Like I said elsewhere, I'm going to try harder to encapsulate "our spans" more cleanly, and then as part of that refactor I'll try to make all of this stuff a lot more clear somehow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also look at other SDK implementations as well and see if the generated code looks similar to see how they are handling it. That may help us puzzle through some options.

attrs = {
'rpc.system' => 'grpc',
'rpc.service' => mth.owner.service_name,
'rpc.method' => mth.name.to_s
Copy link
Contributor

@ericmustin ericmustin Aug 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit (was working on something related and figured i'd see how this was implemented, havent done a full review)

I think semantic conventions, in addition to some of the above, require either net.peer.ip or net.peer.name: https://github.com/open-telemetry/opentelemetry-specification/blob/87a5ed7f0d4c403e2b336f275ce3e7fd66a8041b/specification/metrics/semantic_conventions/rpc.md#attributes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. It's even on the TODO list for the PR and I neglected to do it. 😆

Comment on lines +40 to +42
'rpc.system' => 'grpc',
'rpc.service' => service_name.to_s,
'rpc.method' => method_name.to_s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup - I need to do that still. I think I punted on it because it looked rather hard to figure out, but I suspect we can figure it out. 😄

@ahayworth ahayworth marked this pull request as draft August 6, 2021 14:17
@ahayworth
Copy link
Contributor Author

Putting this back into draft; the comments from @arielvalentin and @ericmustin were pretty helpful - and I need to do a major revision to try to get this ready to merge. 😄 Thanks!

@jimeh
Copy link
Contributor

jimeh commented Nov 2, 2021

What's progress on this looking like? How much is left to get it across the line?

I've got a quick and dirty GRPC interceptor implementation working within a project, but coming up against some limitations with how the GRPC interceptors are implemented in Ruby. Which is why I assume you're not using them here.

I'd be happy to spare a couple of evenings to try and help out if it's of any use :)

And if it helps, regarding message.uncompressed_size, you should be able to get that by simply calling .to_proto.size on GRPC request and response objects.

@ahayworth
Copy link
Contributor Author

@jimeh it's definitely stalled, my day job no longer requires ruby and I don't have as much free and clear time to devote to otel-ruby :(

The general status was, however:

  • Needs to be cleaned up overall. The flow is confusing and I think it's likely incorrect in places.
  • Needs testing. Even when I wrote this, I didn't have a GRPC ruby app that I could do much testing with.

I'd be happy to work on it with you, or for you to take over the branch if you have the motivation to do so!

@jimeh
Copy link
Contributor

jimeh commented Nov 9, 2021

@ahayworth Thanks for the info :)

I've actually ended up taking a quite different approach in an attempt to quickly hack something together for my day job that just about works. Turned out to not be that quick, but I think I have it all working now with message events, error handling, etc. It lacks automated tests at the moment though, so no there's no guarantees around those promises... lol

I'll try and get things cleaned up and organized so I can put together a draft PR here sometime soon. Any help you're able and willing to offer at that point @ahayworth would be greatly appreciated :)

As for the details of my implementation; It's all operating through gRPC client and server interceptors, which means a smaller footprint of monkey-patched / prepended classes, so it's a bit easier to follow how it works (I think/hope). However, due to the limitations of interceptors, the bulk of the tracing is done by monkey-patching methods on individual instances of objects exposed to interceptors. Both long-lived and per-request objects at that. It's kind of horrible, but yet somewhat elegant, and reasonably easy to follow the logical flow of... I think :P

@ahayworth
Copy link
Contributor Author

@jimeh I look forward to the PR, that's awesome!

@robertlaurin
Copy link
Contributor

Hey @jimeh, did you still have any time/bandwidth to share a draft with us? We're looking get some GRPC instrumentation out, and I'd be interested to see what your approach looks like.

@ahayworth
Copy link
Contributor Author

I'm going to go ahead and close this out - it's unlikely that I'll be able to work on it now that my day job does not involve ruby. However, hopefully it's instructive for a future attempt!

@ahayworth ahayworth closed this Feb 1, 2022
@jimeh
Copy link
Contributor

jimeh commented Feb 15, 2022

@robertlaurin Apologies for the silence, it's been a busy few months, and my focus has actually ended up being on the Golang side of our project that uses gRPC and otel. But I'm planning to put aside a couple of evenings this week or next to extract the Ruby implementation I wrote to get a draft PR submitted. After that, next steps are tests, and fixing a bug where message events are sometimes assigned to the wrong span during high volumes of parallel gRPC requests.

@ahayworth Thanks for your efforts on this PR. It was a really valuable resource for me end of last year when I went about my "quick and dirty hack" implementation. I hope you won't mind if I poke you when I open the above mentioned PR within the next week or two :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gRPC instrumentation adapter Code examples: gRPC instrumentation example
6 participants