Skip to content

async_hooks-based tracing doesn't work for gRPC server (in Node 8) #661

Closed
@kjin

Description

@kjin

From #656

@mrene wrote:

I have a similar issue (node 8.9.4, async/await with GCLOUD_TRACE_NEW_CONTEXT, grpc 1.7.3 server) - I get valid traces for the first call and then I start seeing grpc: Cannot create nested root spans.
I then added some more tracing inside my own methods and I started seeing Custom Span API: creating child for an already closed span ... grpc:/.... After a while, when the traces get pushed, I get another valid trace then the same errors repeat themselves.

The errors relating to the custom spans are reported because cls.getRootContext() returns a span that is closed (the span from the last call). Debugging this into cls-ah revealed some weird behaviour between grpc-native-core and async_hooks. Every root span gets created while both the executionAsyncId and triggerAsyncId are 0; the before callback also does not execute before handling a new call and therefore the context variable within cls-ah points to the last used context.

cls-ah seems to do the following:

  • maintain a global context object which is used as cls
  • maintain a hash of context objects, indexed by executionAsyncId
  • whenever an init event occurs:
    • copy the current global context object into the hash (regardless of the parent id specified by the callback)
  • whenever an enter event occurs:
    • if the hash contains the entered id, set the global context object to its content

In this scenario, because we're lacking an enter, the first grpc call trace is valid because the context object is initially initialized to {} and we process to creating a root span because the context doesn't contain any previous root span. Upon the 2nd call, the context still contains the previous root span (in a closed state), only errors arise. When the traces get flushed, an enter event occurs and sets the global context object back to its initial value (because the trace writer context was created before the first grpc call, so it copied the initial value).

At this point this looks more like a grpc bug although it could either be handled in the plugin (using the AsyncResource API or within grpc-native-core - I'm sure there's a param to carry over execution context ids when switching from native->js).

It seems the AsyncHooks embedder's api hasn't landed in NAN yet (nodejs/node#13254) so a JS fix is the only possible option.

I've got a hacked up version of the plugin to validate the entire hypothesis and it patches the problem.

I'd like to get some of your opinions before opening a PR, got some comments/questions here:

  • Copying the current context from init might not be the right thing, since init can happen on its own without enter
    • Part of the problem is that contexts created after the first span directly inherit it even if they don't share a parent context
    • Every async context will probably point to the same object
      • context is set back to its original value in runAndReturn but as soon as more get created, they will be set back (through before and init) and the original value will not be restored.
      • This could be mitigated by implementing after() and resetting values accordingly.
  • This feels like it emitDestroy() could be called by span#endSpan() since it basically knows whenever it's done
  • runAndReturn() should probably set contexts[uid] as well as context

What do you think?

@kjin wrote:

@mrene Thanks for investigating the async_hooks-based tracing issue! Please go ahead and open the PR.

I've been thinking through the observations you had, here are my thoughts so far:

Copying the current context from init might not be the right thing, since init can happen on its own without enter

Copying the current context in init is necessary to preserve it across asynchronous boundaries. Maybe I'm misunderstanding what you're saying?

Part of the problem is that contexts created after the first span directly inherit it even if they don't share a parent context
Every async context will probably point to the same object

I think you were correct in creating a new AsyncResource for each request being handled, so that subsequent requests after the first one don't point to the same object.

context is set back to its original value in runAndReturn but as soon as more get created, they will be set back (through before and init) and the original value will not be restored.
This could be mitigated by implementing after() and resetting values accordingly.

I believe implementing after() to reset context would effectively be a no-op, since before() will run anyway for the next async resource. Would you be able to further articulate what the potential problem could be here? EDIT: See new comment.

This feels like it emitDestroy() could be called by span#endSpan() since it basically knows whenever it's done

I'm open to (and for this and other reasons, actively thinking about) the idea of creating our own AsyncResource object for root spans and emitting the lifecycle events correspondingly. Let's talk about it in a separate issue (please feel free to open it if I haven't already).

runAndReturn() should probably set contexts[uid] as well as context

What should contexts[uid] be set to?

@mrene After looking into it more, I believe your reasoning is correct that after should be implemented to reset the previous context.

I'll open a PR to implement after soon-ish, probably tomorrow, and gRPC server tracing should just work (with NEW_CONTEXT-based tracing, at least) without need for additional changes to the plugin. That being said, it might still make sense for you to open a PR against gRPC to add async_hooks embedder support. (if you feel motivated; I can do it as well 😄 )

@mrene wrote:

Copying the current context in init is necessary to preserve it across asynchronous boundaries. Maybe I'm misunderstanding what you're saying?

I mean init shouldn't be constructed as before - the runtime isn't switching to it yet. one of its argument is the triggerAsyncId. I agree that the context should be copied, just that it should be copied from contexts[triggerAsyncId] rather than the context global var being set at that moment. (In most cases those will be the same but it's possible to pass another parent in the async_hooks API)

What should contexts[uid] be set to?

The same thing, basically context = contexts[uid] = {}; whenever it is being initialized. This would make sure that the context is preserved if an enter event to the same context occurs afterwards. It'll give that context its own object, which it doesn't have now (only context initialized while this one is active will be correctly linked).

Thinking about this more makes me realize that running this without creating another AsyncResource could cause the same id to be re-used twice, so it might not be advisable until this is implemented.

I noticed a small difference in behaviour with continuation-local-storage, it does not store any data unless the current context has been activated, whereas cls-ah initializes an empty object so things can technically be saved/fetched even if runAndReturn hasn't been called. Nothing critical but it might hide subtle bugs.

Metadata

Metadata

Assignees

Labels

api: cloudtraceIssues related to the googleapis/cloud-trace-nodejs API.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions