-
Notifications
You must be signed in to change notification settings - Fork 847
Description
Bug Report
Thread local state is broken if that thread interacts with get_default before the global subscriber is set (only if the global subscriber is then later set).
Version
tracing 0.1.37
tracing-subscriber 0.3.17
but the same code exists on the master branch with minor alterations which doesn't impact the bug.
Platform
Linux
Description
We had some memory leaks on applications scaling with requests, there was some sense that this was coming from the tracing library but no knowledge as to exactly where. We had also had missing spans and other weird bugs that seemed to be connected to tracing.
An absurd amount of checking later I found the source to be this:
https://github.com/tokio-rs/tracing/blob/master/tracing-core/src/dispatch.rs#L1033
Since get_or_insert_with is used here, if the global dispatcher is not set the first time that a span is interacted with thread a path that reaches get_default that thread will have Dispatch::none() set on its thread-local.
This would be fine for any application which never collects spans, but it creates a memory leak if you have some thread that manages to interact with get_default before the global dispatcher is set up, because that thread's state will be broken for the duration of the application's lifetime.
So in our case, we had a single path in setup which created a span and made that thread's dispatcher be set to Dispatch::none().
Then every time that thread picks up a future running under an open span which it then exits or closes, it will end up here, for example: https://github.com/tokio-rs/tracing/blob/master/tracing-subscriber/src/registry/sharded.rs#LL303C28-L303C28.
It will get Dispatch::none() since the thread-local state was set before the global dispatcher was up, and then it will never decrement the ref-counter for that span, causing it to leak for the entire application's runtime.
Pretty bad fix
The commit that fixed it is here EmbarkStudios@b178f31 but it's not a great solution (for more reasons than the fact where I wrote two functions just because of the different signatures).
The meaningful change there is that instead of eagerly setting the thread's local state forever to none, it always looks if a global dispatch is set if it has no local dispatch set.
This is problematic for applications which have library-dependencies that use tracing but doesn't use tracing themselves.
In those cases, the overhead goes from checking a thread local and a static atomic once, and then just a thread local after that, to checking a thread local and a static atomic every single time. So it's a performance hit for applications which doesn't ever set up a subscriber.
Workaround
A way of working around this problem is never interacting with spans before setting the global subscriber, or alternatively if the setup can be run single threaded, setting up a Default-guard on that thread, and then dropping that after setting up the global subscriber.
We actually did something like that except for the single-thread setup, so that didn't help in our case.
Solutions
Something like a global dispatch guard would be pretty nice, I haven't thought that much about the implementation details, but the same way that set_default sets a thread-local dispatcher that is removed when dropped. A guard on the global dispatch which can then be swapped in with an initialized dispatch which keeps threads for setting their local state before it's dropped would probably be pretty nice.
Impact
This might be a potential DOS-vulnerability and affects anyone who breaks thread local state by interacting with get_default before setting a global dispatch. Anything that triggers spans to be generated on that/those broken threads will cause memory leakage for the application. People who set the global dispatch in their application by tracing::dispatcher::set_global_default(tracing::dispatcher::Dispatch::new(subscriber)) before that would be unaffected.