Description
From #656
@mrene wrote:
I have a similar issue (node 8.9.4, async/await with
GCLOUD_TRACE_NEW_CONTEXT
, grpc 1.7.3 server) - I get valid traces for the first call and then I start seeinggrpc: Cannot create nested root spans.
I then added some more tracing inside my own methods and I started seeingCustom Span API: creating child for an already closed span ... grpc:/...
. After a while, when the traces get pushed, I get another valid trace then the same errors repeat themselves.The errors relating to the custom spans are reported because
cls.getRootContext()
returns a span that is closed (the span from the last call). Debugging this intocls-ah
revealed some weird behaviour betweengrpc-native-core
andasync_hooks
. Every root span gets created while both theexecutionAsyncId
andtriggerAsyncId
are 0; thebefore
callback also does not execute before handling a new call and therefore thecontext
variable withincls-ah
points to the last used context.cls-ah seems to do the following:
- maintain a global context object which is used as cls
- maintain a hash of context objects, indexed by
executionAsyncId
- whenever an
init
event occurs:
- copy the current global context object into the hash (regardless of the parent id specified by the callback)
- whenever an
enter
event occurs:
- if the hash contains the entered id, set the global context object to its content
In this scenario, because we're lacking an
enter
, the first grpc call trace is valid because the context object is initially initialized to{}
and we process to creating a root span because the context doesn't contain any previous root span. Upon the 2nd call, the context still contains the previous root span (in a closed state), only errors arise. When the traces get flushed, anenter
event occurs and sets the globalcontext
object back to its initial value (because the trace writer context was created before the first grpc call, so it copied the initial value).At this point this looks more like a grpc bug although it could either be handled in the plugin (using the
AsyncResource
API or withingrpc-native-core
- I'm sure there's a param to carry over execution context ids when switching from native->js).
It seems the
AsyncHooks
embedder's api hasn't landed in NAN yet (nodejs/node#13254) so a JS fix is the only possible option.I've got a hacked up version of the plugin to validate the entire hypothesis and it patches the problem.
I'd like to get some of your opinions before opening a PR, got some comments/questions here:
- Copying the current context from
init
might not be the right thing, sinceinit
can happen on its own withoutenter
- Part of the problem is that contexts created after the first span directly inherit it even if they don't share a parent context
- Every async context will probably point to the same object
context
is set back to its original value inrunAndReturn
but as soon as more get created, they will be set back (throughbefore
andinit
) and the original value will not be restored.- This could be mitigated by implementing
after()
and resetting values accordingly.- This feels like it
emitDestroy()
could be called byspan#endSpan()
since it basically knows whenever it's donerunAndReturn()
should probably setcontexts[uid]
as well ascontext
What do you think?
@kjin wrote:
@mrene Thanks for investigating the
async_hooks
-based tracing issue! Please go ahead and open the PR.I've been thinking through the observations you had, here are my thoughts so far:
Copying the current context from
init
might not be the right thing, sinceinit
can happen on its own withoutenter
Copying the current context in
init
is necessary to preserve it across asynchronous boundaries. Maybe I'm misunderstanding what you're saying?Part of the problem is that contexts created after the first span directly inherit it even if they don't share a parent context
Every async context will probably point to the same objectI think you were correct in creating a new
AsyncResource
for each request being handled, so that subsequent requests after the first one don't point to the same object.
context
is set back to its original value inrunAndReturn
but as soon as more get created, they will be set back (throughbefore
andinit
) and the original value will not be restored.
This could be mitigated by implementingafter()
and resetting values accordingly.
I believe implementingEDIT: See new comment.after()
to reset context would effectively be a no-op, sincebefore()
will run anyway for the next async resource. Would you be able to further articulate what the potential problem could be here?This feels like it
emitDestroy()
could be called byspan#endSpan()
since it basically knows whenever it's doneI'm open to (and for this and other reasons, actively thinking about) the idea of creating our own
AsyncResource
object for root spans and emitting the lifecycle events correspondingly. Let's talk about it in a separate issue (please feel free to open it if I haven't already).
runAndReturn()
should probably setcontexts[uid]
as well ascontext
What should
contexts[uid]
be set to?
@mrene After looking into it more, I believe your reasoning is correct that
after
should be implemented to reset the previous context.I'll open a PR to implement
after
soon-ish, probably tomorrow, and gRPC server tracing should just work (with NEW_CONTEXT-based tracing, at least) without need for additional changes to the plugin. That being said, it might still make sense for you to open a PR against gRPC to add async_hooks embedder support. (if you feel motivated; I can do it as well 😄 )
@mrene wrote:
Copying the current context in init is necessary to preserve it across asynchronous boundaries. Maybe I'm misunderstanding what you're saying?
I mean
init
shouldn't be constructed asbefore
- the runtime isn't switching to it yet. one of its argument is thetriggerAsyncId
. I agree that the context should be copied, just that it should be copied fromcontexts[triggerAsyncId]
rather than thecontext
global var being set at that moment. (In most cases those will be the same but it's possible to pass another parent in the async_hooks API)What should contexts[uid] be set to?
The same thing, basically
context = contexts[uid] = {};
whenever it is being initialized. This would make sure that the context is preserved if anenter
event to the same context occurs afterwards. It'll give that context its own object, which it doesn't have now (only context initialized while this one is active will be correctly linked).Thinking about this more makes me realize that running this without creating another
AsyncResource
could cause the same id to be re-used twice, so it might not be advisable until this is implemented.I noticed a small difference in behaviour with
continuation-local-storage
, it does not store any data unless the current context has been activated, whereascls-ah
initializes an empty object so things can technically be saved/fetched even ifrunAndReturn
hasn't been called. Nothing critical but it might hide subtle bugs.